Scrape the 1st paragragh & Image from a Wikipedia entry

Sometime we would need a automated way of getting the description or possibly the image too to be available automatically for the keywords that we use in our projects...

One good place to look for this content is Wikipedia. But...

When you search for Company Apple in Wiki... what we get is probably not the correct one.

And we might not be in a position or we cant expect our users to type the full information about the keyword like 'Apple Inc'.

The solution is to use google search combined with Wiki.

Here is the code for getting the description & Image from wiki (Hope there is a wikipage for the keyword we search for.... unless and until it is really crazyyyyyyyyy)

require 'hpricot'

require 'open-uri'

def fetch_description(query_item)

page_title, uri_title = get_wiki_name(query_item)

return get_wiki_description(page_title, uri_title)

end

def upload_photo(wiki_photo)

begin

base_uri = URI.parse(wiki_photo)

uploaded_data = open(base_uri)

def uploaded_data.original_filename; base_uri.path.split('/').last; end

return uploaded_data.original_filename.blank? ? nil : uploaded_data

rescue

return nil

end

#Method to fetch wiki page and strip first two
Tags

def get_wiki_description(page_title, uri_title)

url = uri_title

final_content = ""

if url.size > 10

buffer = Hpricot(open(url, "UserAgent" => "reader"+rand(10000).to_s).read)

#Capture first two paragraphs of text

content = buffer.search("//div[@id='content']").search("//div[@id='bodyContent']").search("//p")[0..2]

#Remove the extra spaces and strip html tags fromt the fetched content

content.each do |c|

final_content+=c.inner_html.gsub(/<\/?[^>]*>/, '').gsub(/&#\d+;/,'').gsub(/\([^\)]+\)/,'').gsub(/\[[^\]]+\]/,'').gsub(/ +/,' ')+"\n"

end

return final_content

end

#Method to get the link for wikipedia from google search results

def get_wiki_name(query_item)

search_keywords = query_item.strip.gsub(/\s+/,'+')

url = "http://www.google.com/search?q=#{search_keywords}+site%3Aen.wikipedia.org"

begin

doc = Hpricot(open(url, "UserAgent" => "reader"+rand(10000).to_s).read)

result = doc.search("//div[@id='ires']").search("//li[@class='g']").first.search("//a").first unless doc

rescue

return '',''

end

if result

return result.inner_html.gsub(/<\/?[^>]*>/,"").gsub(/./,""),result.attributes["href"]

else

return '',''

end

wiki_description, wiki_photo = fetch_description("Apple")

upload_photo(wiki_photo)

Note: After all of this done... Please make sure to give credits to Wikipedia :)

Railize - Another blog on Ruby on Rails

Sunday, July 25, 2010

Scrape proper content from wikipedia

Scrape the 1st paragragh & Image from a Wikipedia entry

0 comments:

Post a Comment

Labels

Blog Archive

About Me

Followers

Search This Blog