189 Gathering Ruby Quiz 2 Data

Description

I’m building the new Ruby Quiz website and I need your help…

This week’s quiz involves gathering the existing Ruby Quiz 2 data from the Ruby Quiz website.

Each quiz entry contains the following information:

There are also many quiz solutions that belong to each quiz. The quiz solutions have the following:

Matthew has some advice for getting at the data:

If you start at http://splatbang.com/rubyquiz/, you’ll see the quiz list on the left are all links to the same quiz.rhtml file (embedded Ruby), but with different id parameters. Those parameters are the name of a subdirectory. So, for example, take quiz #184, which has a link like this:

http://splatbang.com/rubyquiz/quiz.rhtml?id=184_Befunge

So there is a subdirectory called “184_Befunge”. There are basically three files in every directory:

  • quiz.txt – the quiz description
  • sols.txt – a list of author names and the ruby-talk message # of the submission
  • summ.txt – the quiz summary

Examples:

Your program will collect and output this data as yaml (or your favorite data serialization standard; xml, json, etc.).

Summary

This quiz was an exercise in Web Scraping. As more and more information becomes available on the internet it is useful to have a programatic way to access it. This can be done through web APIs, but not all websites have such APIs available or not all information is available via the APIs. Scraping may be against the terms of use for some sites and smaller sites may suffer if large amounts of data are being pulled, so be sure to ask permission and be prudent!

The one solution to this week’s quiz come from Peter Szinek using scRUBYt. Despite being just over fifty lines long there is a lot packed in here, so let’s dive in.

Here we begin by seting up a scRUBYt Extractor and set it to get the main Ruby Quiz 2 page.

#scrape the stuff with sRUBYt!
data = Scrubyt::Extractor.define do
  fetch 'http://splatbang.com/rubyquiz/'

The ‘quiz’ sets up a node in the XML document, retrieving elements that match the XPath. This yields all the links in the side area, that is, links to all the quizzes.

  quiz "//div[@id='side']/ol/li/a[1]" do
    link_url do
      quiz_id /id=(\d+)/
      quiz_link /id=(.+)/ do

These next two sections download the description and summary for each quiz. They are saved into temporary files to be loaded into the XML document at the end. Notice the use of lambda, it takes in the match from /id=(.+)/ in the quiz_link. So for example when the link is ‘quiz.rhtml?id=157_The_Smallest_Circle’ it matches ‘157_The_Smallest_Circle’ and passes it into the lambda which returns it as “http://splatbang.com/rubyquiz/157_The_Smallest_Circle/quiz.txt” which is the text for the quiz. The summary is gathered in a likewise fashion.

    quiz_desc_url(lambda {|quiz_dir| "http://splatbang.com/rubyquiz/#{quiz_dir}/quiz.txt"}, :type => :script) do
      quiz_dl 'descriptions', :type => :download
    end
    quiz_summary_url(lambda {|quiz_dir| "http://splatbang.com/rubyquiz/#{quiz_dir}/summ.txt"}, :type => :script) do
      quiz_dl 'summaries', :type => :download
    end

This next part gets all the solutions for each quiz. It follows the link_url from the side area. Once on the new page it creates a node for each solution, again by using XPath to get all the links in the list on the side. It populates each solution with an author: the text from the html anchor tag. It populates the ruby_talk_reference with the href attribute of the tag. In order to get the solution text it follows (resolves) the link and returns the text within the //pre[1] element, again using XPath to specify. The text node is added as a child node to the solution.

quiz_detail :resolve => "http://splatbang.com/rubyquiz" do
  solution "/html/body/div/div[2]/ol/li/a" do
    author lambda {|solution_link_text| solution_link_text}, :type => :script
    ruby_talk_reference "href", :type => :attribute
    solution_detail :resolve => :full do
      text "//pre[1]"
    end
  end
end

This select_indices limits the scope of the quiz gathering to just the first three, useful for testing since we don’t want to have to traverse the entire site to see if code works. I removed it when gathering the full dataset.

  end.select_indices(0..2)
end

This next part, using Nokogiri, loads the files that were saved temporarily and inserts them into the XML document. It also removes the link_url nodes to clean up the final output to match the output specified in the quiz.

result = Nokogiri::XML(data.to_xml)

(result/"//quiz").each do |quiz|
  quiz_id = quiz.text[/\s(\d+)\s/,1].to_i
  file_index = quiz_id > 157 ? "_#{(quiz_id - 157)}" : ""
  (quiz/"//link_url").first.unlink
  
  desc = Nokogiri::XML::Element.new("description", quiz.document)
  desc.content =open("descriptions/quiz#{file_index}.txt").read
  quiz.add_child(desc)
  
  summary = Nokogiri::XML::Element.new("summary", quiz.document)
  summary.content =open("summaries/summ#{file_index}.txt").read
  quiz.add_child(summary)
end

And finally save the result to an xml file on the filesystem:

open("ruby_quiz_archive.xml", "w") {|f| f.write result}

This was my first experience with scRUBYt and it took me a little while to “get it”. It packs a lot of power into a concise syntax and is definitely worth considering for your next web scraping needs.


Saturday, February 07, 2009