Best Rails HTML Parser [closed]

2019-02-10 19:35发布

I know that Hpricot is still a standard but I remember hearing about a faster more expressive HTML parser for Ruby.

Does anybody know what it's called and if it is worth switching to from Hpricot??

Thanks in advance

4条回答
疯言疯语
2楼-- · 2019-02-10 20:20

There are multiple tools available. I use Nokogiri.

Demo:

require 'rubygems'
require 'nokogiri'

doc = Nokogiri::HTML(%{
  <h1 class="title">Hello, World</h1>
  <p>Some text</p>
  <a href="http://www.google.com/">Some link</a>
})

title   = doc.at_css("h1.title").text
content = doc.at_css("p").text
url     = doc.at_css("a")[:href]

Ryan Bates made an excelent screencast about using it: #190: Screen Scraping with Nokogiri.

Documentation: http://nokogiri.org/

Tutorials: http://nokogiri.org/tutorials

查看更多
贪生不怕死
3楼-- · 2019-02-10 20:23

There is also Rubyful Soup

Which sells itself as a lightweight quick and dirty parser. I found the interface very intuitive and 'Ruby-ish' when using it for a project in the past, which is perhaps a little surprising given that it is a Python port.

Edit: looks like it's no longer maintained unfortunately so it's probably not the one you were looking for. Looks like Nokogiri is the on you've been hearing about.

查看更多
劫难
4楼-- · 2019-02-10 20:27

Don't use regular expressions -- ruby's regex stuff is way too slow. Hpricot is awesome and Nokogiri looks promising, though I've not used it directly yet.

查看更多
太酷不给撩
5楼-- · 2019-02-10 20:41

You are probably thinking about Nokogiri. I have not used it myself, but "everyone" is talking about it and the benchmarks do look interesting:

hpricot:html:doc  48.930000 3.640000 52.570000 ( 52.900035)
hpricot2:html:doc  4.500000 0.020000  4.520000 (  4.518984)
nokogiri:html:doc  3.640000 0.130000  3.770000 (  3.770642)
查看更多
登录 后发表回答