I'm using the following code to generate a JSON file containing all category information for a particular website.
The goal is to have a JSON file with the following format:
[
{
"id":"36_17",
"name":"Diversen Particulier",
"group":"Diversen",
"search_attributes":{
"0":"Prijs van/tot",
"1":"Groep en Rubriek",
"2":"Conditie",
}
},
{
"id":"36_18",
"name":"Diversen Zakelijk",
"group":"Diversen",
"search_attributes":{
"0":"Prijs van/tot",
"1":"Groep en Rubriek",
"2":"Conditie",
}
},
{
"id":"36_19",
"name":"Overige Diversen",
"group":"Diversen",
"search_attributes":{
"0":"Prijs van/tot",
"1":"Groep en Rubriek",
"2":"Conditie",
}
}, {...}
]
But I keep getting this format:
[
{
"id":"36_17",
"name":"Diversen Particulier",
"group":"Diversen",
"search_attributes":{"0":"Prijs van/tot"}
},
{
"id":"36_17",
"name":"Diversen Particulier",
"group":"Diversen",
"search_attributes":{"1":"Groep en Rubriek"}
},
{
"id":"36_17",
"name":"Diversen Particulier",
"group":"Diversen",
"search_attributes":{"2":"Conditie"}
}, {...}
]
The search_attributes
are not getting saved correctly.
I'm using the following code:
require 'mechanize'
@hashes = []
# Initialize Mechanize object
a = Mechanize.new
# Begin scraping
a.get('http://www.marktplaats.nl/') do |page|
groups = page.search('//*[(@id = "navigation-categories")]//a')
groups.each_with_index do |group, index_1|
a.get(group[:href]) do |page_2|
categories = page_2.search('//*[(@id = "category-browser")]//a')
categories.each_with_index do |category, index_2|
a.get(category[:href]) do |page_3|
search_attributes = page_3.search('//*[contains(concat( " ", @class, " " ), concat( " ", "heading", " " ))]')
search_attributes.each_with_index do |attribute, index_3|
item = {
id: "#{index_1}_#{index_2}",
name: category.text,
group: group.text,
:search_attributes => {
:index_3.to_s => "#{attribute.text unless attribute.text == 'Outlet '}"
}
}
@hashes << item
puts item
end
end
end
end
end
end
# Open file and begin
File.open("json/light/#{Time.now.strftime '%Y%m%d%H%M%S'}_light_categories.json", 'w') do |f|
puts '# Writing category data to JSON file'
f.write(@hashes.to_json)
puts "|-----------> Done. #{@hashes.length} written."
end
puts '# Finished.'
The question is what's causing this and how do I solve it?
Update
A big thanks to arie-shaw for his answer.
Here's the working code:
require 'mechanize'
@hashes = []
# Initialize Mechanize object
a = Mechanize.new
# Begin scraping
a.get('http://www.marktplaats.nl/') do |page|
groups = page.search('//*[(@id = "navigation-categories")]//a')
groups.each_with_index do |group, index_1|
a.get(group[:href]) do |page_2|
categories = page_2.search('//*[(@id = "category-browser")]//a')
categories.each_with_index do |category, index_2|
a.get(category[:href]) do |page_3|
search_attributes = page_3.search('//*[contains(concat( " ", @class, " " ), concat( " ", "heading", " " ))]')
attributes_hash = {}
search_attributes.each_with_index do |attribute, index_3|
attributes_hash[index_3.to_s] = "#{attribute.text unless attribute.text == 'Outlet '}"
end
item = {
id: "#{index_1}.#{index_2}",
name: category.text,
group: group.text,
:search_attributes => attributes_hash
}
@hashes << item
puts item
end
end
end
end
end
# Open file and begin
File.open("json/light/#{Time.now.strftime '%Y%m%d%H%M%S'}_light_categories.json", 'w') do |f|
puts '# Writing category data to JSON file'
f.write(@hashes.to_json)
puts "|-----------> Done. #{@hashes.length} written."
end
puts '# Finished.'
The most inner
each_with_index
should be only be used to generate thesearch_attributes
hash, rather than an element hash of the top level array in the result.