Parsing large HTML files with Nokogiri

2019-07-22 05:33发布

I'm trying to parse http://www.pro-medic.ru/index.php?ht=246&perpage=all with Nokogiri, but unfortunately I can't get all items from the page.

My simple test code is:

require 'open-uri'
require 'nokogiri'

html = Nokogiri::HTML open('http://www.pro-medic.ru/index.php?ht=246&perpage=all')
p html.css('ul.products-grid-compact li .goods_container').count

It returns only 83 items but the real count is about 186.

I thought that the problem could be in open, but it seems that function reads the HTML page correctly.

Has anybody faced the same problem?

标签: ruby nokogiri
1条回答
Viruses.
2楼-- · 2019-07-22 06:16

The file seems to exceed Nokogiri's parser limits. You can relax the limits by adding the HUGE flag:

require 'open-uri'
require 'nokogiri'

url = 'http://www.pro-medic.ru/index.php?ht=246&perpage=all'
html = Nokogiri::HTML(open(url)) do |config|
  config.options |= Nokogiri::XML::ParseOptions::HUGE
end
html.css('ul.products-grid-compact li .goods_container').count
#=> 186

Note that |= is a bitwise OR assignment operator, don't confuse it with the logical operator ||=

According to Parse Options, you can also set this flag via config.huge

查看更多
登录 后发表回答