Screen scraping using the Mechanize gem

What do you do, when you want to extract some data from a system, but there is no neat API for that particular service?

Well you resort to screen scraping of course! Something that big sites have done since the dawn of internet time.

Screen scraping is simple to understand,  you send a request for a particular page and the you analyze the HTML that you get back. Potentially you have send many requests to get Page 2, Page 3 etc.

So, it’s easy to comprehend, but how do you do it?

Well I do it using the Mechanize gem, a really competent gem that so far has solved all my scraping problems. Normally it takes less than a day to get the code working.

Here a simple snippet (in Ruby of course) that will get you started:


agent = Mechanize.new
uri = URI::HTTPS.build(host: "somehost.com", path: "/some/path/", query: hash.to_query)
page = agent.get(uri)

and then you search the page using commands similar to these ones:

prc = page.search(".listing h3 span.price").map { |x|
  x.text.remove("€").remove(",")
}

hdr = page.search(".listing h3 a").map { |x|
  x.text
}

img = page.search("a.image_mask").map { |x|
  s = x.attributes["style"].to_s
  s.remove!("background-image:url('")
  s.remove!("');")
  s
}

It’s actually quite easy to get going. The only obstacle may be how to access multiple pages and unusual structuring og the page.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.