Why does this JSON file get filled with 1747 times

2019-08-06 02:43发布

问题:

I'm using the following code to generate a JSON file containing all category information for a particular website.

require 'mechanize'

@hashes = []

@categories_hash = {}
@categories_hash['category'] ||= {}
@categories_hash['category']['id'] ||= {}
@categories_hash['category']['name'] ||= {}
@categories_hash['category']['group'] ||= {}

# Initialize Mechanize object
a = Mechanize.new

# Begin scraping
a.get('http://www.marktplaats.nl/') do |page|
  groups = page.search('//*[(@id = "navigation-categories")]//a')

  groups.each_with_index do |group, index_1|
    a.get(group[:href]) do |page_2|
      categories = page_2.search('//*[(@id = "category-browser")]//a')

      categories.each_with_index do |category, index_2|
        @categories_hash['category']['id'] = "#{index_1}_#{index_2}"
        @categories_hash['category']['name'] = category.text
        @categories_hash['category']['group'] = group.text

        @hashes << @categories_hash['category']

        # Uncomment if you want to see what's being written
        puts @categories_hash['category'].to_json
      end
    end
  end
end

File.open("json/magic/#{Time.now.strftime '%Y%m%d%H%M%S'}_magic_categories.json", 'w') do |f|
  puts '# Writing category data to JSON file'
  f.write(@hashes.to_json)
  puts "|-----------> Done. #{@hashes.length} written."
end

puts '# Finished.'

But this code returns a JSON file filled with just the last category data. For the full JSON file take a look here. This is a sample:

[
   {
      "id":"36_17",
      "name":"Overige Diversen",
      "group":"Diversen"
   },
   {
      "id":"36_17",
      "name":"Overige Diversen",
      "group":"Diversen"
   },
   {
      "id":"36_17",
      "name":"Overige Diversen",
      "group":"Diversen"
   }, {...}
]

The question is, what's causing this and how can I solve it?

回答1:

The same object, the result of @categories_hash['category'], is being updated each loop.

Thus the array is filled with the same object 1747 times, and the object reflects the mutations done on the last loop when it is viewed later.


While a fix might be to use @categories_hash[category_name] or similar (i.e. fetch/ensure a different object each loop), the following avoids the problem described and the unused/misused hash of 'category' keys.

categories.each_with_index do |category, index_2|
    # creates a new Hash object
    item = {
        id: "#{index_1}_#{index_2}",
        name: category.text,
        group: group.text
    }
    # adds the new (per yield) object
    @hashes << item
end

Alternatively, a more "functional" approach might be to use map, but it solves the problem in the same way - by creating new [Hash] objects. (This could be expanded to also include the outer loop, but it's just here for a taste.)

h = categories.each_with_index.map do |category, index_2|
    {
        id: "#{index_1}_#{index_2}",
        name: category.text,
        group: group.text
    }
end
@hashes.concat(h)