What datasets exist out on the internet that I can run statistical analysis on?
- R - Quantstart: Testing Strategy on Multiple Equit
- Using predict with svyglm
- Reshape matrix by rows
- Extract P-Values from Dunnett Test into a Table by
- split data frame into two by column value [duplica
- System.OutOfMemoryException:“数组维度超过了支持的范围。”
- How to convert summary output to a data frame?
- How to plot smoother curves in R
- Paste all possible diagonals of an n*n matrix or d
- ess-rdired: I get this error “no ESS process is as
- How to use doMC under Windows or alternative paral
- dyLimit for limited time in Dygraphs
- Saving state of Shiny app to be restored later
package is included with base R. Run this command to see a full list:Beyond that, there are many packages that can pull data, and many others that contain important data. Of these, you may want to start by looking at the HistData package, which "provides a collection of small data sets that are interesting and important in the history of statistics and data visualization".
For financial data, the
package provides a common interface for pulling time series data from google, yahoo, FRED, and others:FRED (the Federal Reserve of St. Louis) is really a landmine of free economic data.
Many R packages come bundled with data that is specific to their goal. So if you're interested in genetics, multilevel models, etc., the relevant packages will frequently have the canonical example for that analysis. Also, the book packages typically ship with the data needed to reproduce all the examples.
Here are some examples of relevant packages:
Another good site is UN Data.
A broad selection on the Web. For instance, here's a massive directory of sports databases (all providing the data free of charge, at least that's my experience). In that directory is databaseBaseball.com, which contains among other things, complete datasets for every player who has ever played professional baseball since about 1915.
StatLib is an other excellent resource--beautifully convenient. This single web page lists 4-5 line summaries of over a hundred databases, all of which are available in flat-file form just by clicking the 'Table' link at the beginning of each data set summary.
The base distribution of R comes pre-packaged with a large and varied collection of datasts (122 in R 2.10). To get a list of them (as well as a one-line description):
Likewise, most packages come with several data sets (sometimes a lot more). You can see those the same way:
These data sets are the ones mentioned in the package manuals and vignettes for a given package, and used to illustrate the package features.
A few R packages with a lot of datasets (which again are easy to scan so you can choose what's interesting to you): AER, DAAG, and vcd.
Another thing i find so impressive about R is its I/O. Suppose you want to get some very specific financial data via the yahoo finance API. Let's say closing open and closing price of S&P 500 for every month from 2001 to 2009, just do this:
In this one line of code, R has fetched the tick data, shaped it to a dataframe and bound it to 'tick_data' all . (Here's a handy cheat sheet w/ the Yahoo Finance API symbols used to build the URLs as above)
Recently setup by Tim Berners-Lee
Obviously UK based data, but that shouldn't matter. Covers everything from abandoned cars to school absenteeism to agricultural price indexes
See the data competition set up by Hadley Wickham for the Data Expo of the ASA Statistical Computing and Statistical Graphics section. The competition is over, the data is still there.
Another collection of datasets.