The Problem
I use a tool at work that lets me do queries and get back HTML tables of info. I do not have any kind of back-end access to it.
A lot of this info would be much more useful if I could put it into a spreadsheet for sorting, averaging, etc. How can I screen-scrape this data to a CSV file?
My First Idea
Since I know jQuery, I thought I might use it to strip out the table formatting onscreen, insert commas and line breaks, and just copy the whole mess into notepad and save as a CSV. Any better ideas?
The Solution
Yes, folks, it really was as easy as copying and pasting. Don't I feel silly.
Specifically, when I pasted into the spreadsheet, I had to select "Paste Special" and choose the format "text." Otherwise it tried to paste everything into a single cell, even if I highlighted the whole spreadsheet.
Excel can open a http page.
Eg:
Click File, Open
Under filename, paste the URL ie: How can I scrape an HTML table to CSV?
Click ok
Excel does its best to convert the html to a table.
Its not the most elegant solution, but does work!
Here is a tested example that combines grequest and soup to download large quantities of pages from a structured website:
However, this is a manual solution not an automated one.
This is my python version using the (currently) latest version of BeautifulSoup which can be obtained using, e.g.,
The script reads HTML from the standard input, and outputs the text found in all tables in proper CSV format.
If you're screen scraping and the table you're trying to convert has a given ID, you could always do a regex parse of the html along with some scripting to generate a CSV.
Basic Python implementation using BeautifulSoup, also considering both rowspan and colspan: