Sunday, July 25, 2010

Scrape proper content from wikipedia

Scrape the 1st paragragh & Image from a Wikipedia entry

Sometime we would need a automated way of getting the description or possibly the image too to be available automatically for the keywords that we use in our projects...

One good place to look for this content is Wikipedia. But...
When you search for Company Apple in Wiki... what we get is probably not the correct one.
And we might not be in a position or we cant expect our users to type the full information about the keyword like 'Apple Inc'.

The solution is to use google search combined with Wiki.

Here is the code for getting the description & Image from wiki (Hope there is a wikipage for the keyword we search for.... unless and until it is really crazyyyyyyyyy)

require 'hpricot'
require 'open-uri'

def fetch_description(query_item)
page_title, uri_title = get_wiki_name(query_item)
return get_wiki_description(page_title, uri_title)
end

def upload_photo(wiki_photo)
begin
base_uri = URI.parse(wiki_photo)
uploaded_data = open(base_uri)
def uploaded_data.original_filename; base_uri.path.split('/').last; end
return uploaded_data.original_filename.blank? ? nil : uploaded_data
rescue
return nil
end
end


#Method to fetch wiki page and strip first two

Tags
def get_wiki_description(page_title, uri_title)
url = uri_title
final_content = ""
if url.size > 10
buffer = Hpricot(open(url, "UserAgent" => "reader"+rand(10000).to_s).read)
#Capture first two paragraphs of text
content = buffer.search("//div[@id='content']").search("//div[@id='bodyContent']").search("//p")[0..2]
#Remove the extra spaces and strip html tags fromt the fetched content
content.each do |c|
final_content+=c.inner_html.gsub(/<\/?[^>]*>/, '').gsub(/&#\d+;/,'').gsub(/\([^\)]+\)/,'').gsub(/\[[^\]]+\]/,'').gsub(/ +/,' ')+"\n"
end
end
return final_content
end
#Method to get the link for wikipedia from google search results
def get_wiki_name(query_item)
search_keywords = query_item.strip.gsub(/\s+/,'+')
url = "http://www.google.com/search?q=#{search_keywords}+site%3Aen.wikipedia.org"
begin
doc = Hpricot(open(url, "UserAgent" => "reader"+rand(10000).to_s).read)
result = doc.search("//div[@id='ires']").search("//li[@class='g']").first.search("//a").first unless doc
rescue
return '',''
end
if result
return result.inner_html.gsub(/<\/?[^>]*>/,"").gsub(/./,""),result.attributes["href"]
else
return '',''
end
end


wiki_description, wiki_photo = fetch_description("Apple")
upload_photo(wiki_photo)

Note: After all of this done... Please make sure to give credits to Wikipedia :)


0 comments:

Post a Comment