Is there a way I can specify and get data from a web site URL on to a CSV file for analysis using R?
问题:
回答1:
In the simplest case, just do
X <- read.csv(url("http://some.where.net/data/foo.csv"))
plus which ever options read.csv()
may need.
Long answer: Yes this can be done and many packages have use that feature for years. E.g. the tseries packages uses exactly this feature to download stock prices from Yahoo! for almost a decade:
R> library(tseries)
Loading required package: quadprog
Loading required package: zoo
‘tseries’ version: 0.10-24
‘tseries’ is a package for time series analysis and computational finance.
See ‘library(help="tseries")’ for details.
R> get.hist.quote("IBM")
trying URL 'http://chart.yahoo.com/table.csv? ## manual linebreak here
s=IBM&a=0&b=02&c=1991&d=5&e=08&f=2011&g=d&q=q&y=0&z=IBM&x=.csv'
Content type 'text/csv' length unknown
opened URL
.......... .......... .......... .......... ..........
.......... .......... .......... .......... ..........
.......... .......... .......... .......... ..........
.......... .......... .......... .......... ..........
.......... .......... .......... .......... ..........
........
downloaded 258 Kb
Open High Low Close
1991-01-02 112.87 113.75 112.12 112.12
1991-01-03 112.37 113.87 112.25 112.50
1991-01-04 112.75 113.00 111.87 112.12
1991-01-07 111.37 111.87 110.00 110.25
1991-01-08 110.37 110.37 108.75 109.00
1991-01-09 109.75 110.75 106.75 106.87
[...]
This is all exceedingly well documented in the manual pages for help(connection)
and help(url)
. Also see the manul on 'Data Import/Export' that came with R.
回答2:
Often data on webpages is in the form of an XML table. You can read an XML table into R using the package XML.
In this package, the function
readHTMLTable(<url>)
will look through a page for XML tables and return a list of data frames (one for each table found).
回答3:
Beside of read.csv(url("..."))
you also can use read.table("http://...")
.
Example:
> sample <- read.table("http://www.ats.ucla.edu/stat/examples/ara/angell.txt")
> sample
V1 V2 V3 V4 V5
1 Rochester 19.0 20.6 15.0 E
2 Syracuse 17.0 15.6 20.2 E
...
43 Atlanta 4.2 70.6 32.6 S
>
回答4:
base
read.csv
without the url
function just works fine. Probably I am missing something if Dirk Eddelbuettel included it in his answer:
ad <- read.csv("http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv")
head(ad)
X TV radio newspaper sales
1 1 230.1 37.8 69.2 22.1
2 2 44.5 39.3 45.1 10.4
3 3 17.2 45.9 69.3 9.3
4 4 151.5 41.3 58.5 18.5
5 5 180.8 10.8 58.4 12.9
6 6 8.7 48.9 75.0 7.2
Another options using two popular packages:
data.table
library(data.table)
ad <- fread("http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv")
head(ad)
V1 TV radio newspaper sales
1: 1 230.1 37.8 69.2 22.1
2: 2 44.5 39.3 45.1 10.4
3: 3 17.2 45.9 69.3 9.3
4: 4 151.5 41.3 58.5 18.5
5: 5 180.8 10.8 58.4 12.9
6: 6 8.7 48.9 75.0 7.2
readr
library(readr)
ad <- read_csv("http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv")
head(ad)
# A tibble: 6 x 5
X1 TV radio newspaper sales
<int> <dbl> <dbl> <dbl> <dbl>
1 1 230.1 37.8 69.2 22.1
2 2 44.5 39.3 45.1 10.4
3 3 17.2 45.9 69.3 9.3
4 4 151.5 41.3 58.5 18.5
5 5 180.8 10.8 58.4 12.9
6 6 8.7 48.9 75.0 7.2
回答5:
scan
can read from a web page automatically; you don't necessarily have to mess with connections.