The Problem

I use a tool at work that lets me do queries and get back HTML tables of info. I do not have any kind of back-end access to it.

A lot of this info would be much more useful if I could put it into a spreadsheet for sorting, averaging, etc. How can I screen-scrape this data to a CSV file?

My First Idea

Since I know jQuery, I thought I might use it to strip out the table formatting onscreen, insert commas and line breaks, and just copy the whole mess into notepad and save as a CSV. Any better ideas?

The Solution

Yes, folks, it really was as easy as copying and pasting. Don't I feel silly.

Specifically, when I pasted into the spreadsheet, I had to select "Paste Special" and choose the format "text." Otherwise it tried to paste everything into a single cell, even if I highlighted the whole spreadsheet.

标签： screen-scraping

11条回答

对你真心纯属浪费

2楼-- · 2020-01-27 02:13

Even easier (because it saves it for you for next time) ...

In Excel

Data/Import External Data/New Web Query

will take you to a url prompt. Enter your url, and it will delimit available tables on the page to import. Voila.

0人赞添加讨论(0) 举报

小情绪 Triste *

3楼-- · 2020-01-27 02:14

Two ways come to mind (especially for those of us that don't have Excel):

Google Spreadsheets has an excellent importHTML function:
- =importHTML("http://example.com/page/with/table", "table", index
- Index starts at 1
- I recommend a copy and paste values shortly after import
- File -> Download as -> CSV
Python's superb Pandas library has handy read_html and to_csv functions
- Here's a basic Python3 script that prompts for the URL, which table at that URL, and a filename for the CSV.

0人赞添加讨论(0) 举报

成全新的幸福

4楼-- · 2020-01-27 02:16

Have you tried opening it with excel? If you save a spreadsheet in excel as html you'll see the format excel uses. From a web app I wrote I spit out this html format so the user can export to excel.

0人赞添加讨论(0) 举报

贼婆χ

5楼-- · 2020-01-27 02:20

Quick and dirty:

Copy out of browser into Excel, save as CSV.

Better solution (for long term use):

Write a bit of code in the language of your choice that will pull the html contents down, and scrape out the bits that you want. You could probably throw in all of the data operations (sorting, averaging, etc) on top of the data retrieval. That way, you just have to run your code and you get the actual report that you want.

It all depends on how often you will be performing this particular task.

0人赞添加讨论(0) 举报

别忘想泡老子

6楼-- · 2020-01-27 02:21

using python:

for example imagine you want to scrape forex quotes in csv form from some site like:fxquotes

then...

from BeautifulSoup import BeautifulSoup
import urllib,string,csv,sys,os
from string import replace

date_s = '&date1=01/01/08'
date_f = '&date=11/10/08'
fx_url = 'http://www.oanda.com/convert/fxhistory?date_fmt=us'
fx_url_end = '&lang=en&margin_fixed=0&format=CSV&redirected=1'
cur1,cur2 = 'USD','AUD'
fx_url = fx_url + date_f + date_s + '&exch=' + cur1 +'&exch2=' + cur1
fx_url = fx_url +'&expr=' + cur2 +  '&expr2=' + cur2 + fx_url_end
data = urllib.urlopen(fx_url).read()
soup = BeautifulSoup(data)
data = str(soup.findAll('pre', limit=1))
data = replace(data,'[<pre>','')
data = replace(data,'</pre>]','')
file_location = '/Users/location_edit_this'
file_name = file_location + 'usd_aus.csv'
file = open(file_name,"w")
file.write(data)
file.close()

edit: to get values from a table: example from: palewire

from mechanize import Browser
from BeautifulSoup import BeautifulSoup

mech = Browser()

url = "http://www.palewire.com/scrape/albums/2007.html"
page = mech.open(url)

html = page.read()
soup = BeautifulSoup(html)

table = soup.find("table", border=1)

for row in table.findAll('tr')[1:]:
    col = row.findAll('td')

    rank = col[0].string
    artist = col[1].string
    album = col[2].string
    cover_link = col[3].img['src']

    record = (rank, artist, album, cover_link)
    print "|".join(record)

0人赞添加讨论(0) 举报

上一页 1 2

How can I scrape an HTML table to CSV?

The Problem

My First Idea

The Solution

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间