I'm trying to parse a table but I don't know how to save the data from it. I want to save the data in each row row to look like:
['Raw name 1', 2,094, 0,017, 0,098, 0,113, 0,452]
The sample table is:
html = <<EOT
<table class="open">
<tr>
<th>Table name</th>
<th>Column name 1</th>
<th>Column name 2</th>
<th>Column name 3</th>
<th>Column name 4</th>
<th>Column name 5</th>
</tr>
<tr>
<th>Raw name 1</th>
<td>2,094</td>
<td>0,017</td>
<td>0,098</td>
<td>0,113</td>
<td>0,452</td>
</tr>
.
.
.
<tr>
<th>Raw name 5</th>
<td>2,094</td>
<td>0,017</td>
<td>0,098</td>
<td>0,113</td>
<td>0,452</td>
</tr>
</table>
EOT
My scraper's code is:
doc = Nokogiri::HTML(open(html), nil, 'UTF-8')
tables = doc.css('div.open')
@tablesArray = []
tables.each do |table|
title = table.css('tr[1] > th').text
cell_data = table.css('tr > td').text
raw_name = table.css('tr > th').text
@tablesArray << Table.new(cell_data, raw_name)
end
render template: 'scrape_krasecology'
end
end
When I try to display the data in the HTML page it looks like all the column names are stored in one array's element and all the data the same way.
I assume you were borrowing some code from here or any other related references (or I am sorry for adding wrong reference) - http://quabr.com/34781600/ruby-nokogiri-parse-html-table.
However, if you want to capture all the rows, you can change the following codes -
Hope this help you to solve your problem.
Best wishes
Your desired output is nonsense:
I'll assume you want quoted numbers.
After stripping the stuff that keeps the code from working, and reducing the HTML to a more manageable example, then running it:
Which results in:
The first thing to notice is you're not using
title
though you assign to it. Possibly that happened when you were cleaning up your code as an example.css
, likesearch
andxpath
, returns a NodeSet, which is akin to an array of Nodes. When you usetext
orinner_text
on a NodeSet it returns the text of each node concatenated into a single string:This is its behavior:
Instead, you should iterate over each node found, and extract its text individually. This is covered many times here on SO:
That can be reduced to:
See "How to avoid joining all text from Nodes when scraping" also.
The docs say this about
content
,text
andinner_text
when used with a Node:Instead, you need to go after the individual node's text:
Which now results in:
You can figure out how to coerce the quoted numbers into decimals acceptable to Ruby, or manipulate the inner arrays however you want.
The key of the problem is that calling
#text
on multiple results will return the concatenation of the#text
of each individual element.Lets examine what each step does:
Now that we know what is wrong, here is a possible solution: