As the title said, I have some DOM manipulation tasks. For example, I want to:
- find all H1 element which have blue color.
- find all text which have size 12px.
- etc..
How can I do it with Rails?
Thank you.. :)
Update
I have been doing some research about extracting web page content based on this paper-> http://www.springerlink.com/index/A65708XMUR9KN9EA.pdf
The summary of the step is:
- get the web url which I want to be extracted (single web page)
- grab some elements from the web page based on some visual rules (Ex: grab all H1 which have blue color)
- process the elements with my algorithm
- save the result into my database.
-sorry for my bad english-
If what you're trying to do is manipulate HTML documents inside a rails application, you should take a look at Nokogiri.
It uses XPath to search through the document. With the following, you would find any h1 with the "blue" css class inside a document.
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.stackoverflow.com'))
doc.xpath('//h1/a[@class="blue"]').each do |link|
puts link.content
end
After, if what you were trying to do was indeed parse the current page dom, you should take a look at JavaScript and JQuery. Rails can't do that.
http://railscasts.com/episodes/190-screen-scraping-with-nokogiri
To reliably sort out what color an arbitrary element on a webpage is, you would need to reverse engineer a browser (to accurately take into account stylesheets, markup hacks, broken tags, images, etc).
A far easier approach would be to embed an existing browser such as gecko into a custom application of your making.
As your spider would browse pages, it would pass them to your embedded instance of gecko where you could use getComputedStyle to pull what color an individual element happens to be.
You originally mentioned wanting to use Ruby on Rails for this project, Rails is a framework for writing presentational applications and really a bad fit for a project like this.
As a starting point, I'd recommend you check out RubyGnome, and in particular RubyGnome's Gtk::MozEmbed functionality.