Large public datasets? [closed]

I am looking for some large public datasets, in particular:

Large sample web server logs that have been anonymized.
Datasets used for database performance benchmarking.

Any other links to large public datasets would be appreciated. I already know about Amazon's public datasets at: http://aws.amazon.com/publicdatasets/

回答1:

1. Large sample web server logs that have been anonymized.

These work to start with:

UCI Machine Learning Repository
- Anonymous Microsoft Web Data
- MSNBC.com Anonymous Web Data
- Syskill and Webert Web Page Ratings

There are many, many more data sets available than these (see the gamut of other answers), but this is the lowest hanging fruit that meets your original criteria. As a bonus, they have a contact link if you have specific needs they may know of.

2. Datasets used for database performance benchmarking.

This sounds like a misnomer, because you're asking for empirical data sets that describe well-defined algorithmic problems. Specifically, it sounds like you're trying to find sets of data that you can use to test and benchmark various database systems in real time, using well-defined, normalized relational data that can be used as a set of test cases for determining the most efficient solution that meets your needs.

I don't agree with this approach. Instead of finding a litany of database systems and their canned implementations, it's far better to explore the algorithmic guarantees of these systems as your first port of call. Once you've determined the algorithmic constraints that meet your needs, you can hone in on a set of canned solutions that you can benchmark on efficiency of, for example, indexing, sorting, searching, insertion, deletion, and retrieval.

Wikipedia provides a terse article on database testing concepts that you can use to determine and write test cases for benchmarking performance. For example, you might use an agnostic data access interface like JDBC and JDBC Benchmark to determine the relative timings of each operation. From here, you can hone in on a correct solution.

In short, go to the research first for determining database guarantees. Once a set of candidate solutions has been identified, you can select amongst those by testing (or otherwise determining) the constant time performance of each desired operation.

回答2:

Based on Quora answers and my personal collections in my studies, an awesome-public-datasets repository was created and updated lively on GitHub:

Below is a snapshot version of this list. For a newest list, please visit Github:

This list of public data sources are collected and tidied from blogs, answers, and user responses. Most of the data sets listed below are free, however, some are not. This list comes from https://github.com/caesar0301/awesome-public-datasets.

Climate

Australian Weather: http://www.bom.gov.au/climate/dwo/
Climate data: http://www.cru.uea.ac.uk/cru/data/temperature/#datter and ftp://ftp.cmdl.noaa.gov/
Global climate data since 1929: http://www.tutiempo.net/en/Climate
NOAA Bering Sea Climate: http://www.beringclimate.noaa.gov/
NOAA climate datasets: http://ncdc.noaa.gov/data-access/quick-links
WU Historical Weather Worldwide: http://www.wunderground.com/history/index.html

Economics

American Economic Ass. (AEA): http://www.aeaweb.org/RFE/toc.php?show=complete
EconData (UMD): http://inforumweb.umd.edu/econdata/econdata.html
Internet Product Code Database: http://www.upcdatabase.com/
World bank: http://data.worldbank.org/indicator

Finance

CBOE Futures Exchange: http://cfe.cboe.com/Data/
Google Finance: https://www.google.com/finance
Google Trends: http://www.google.com/trends?q=google&ctab=0&geo=all&date=all&sort=0
NASDAQ: https://data.nasdaq.com/
OANDA: http://www.oanda.com/
OSU Financial data: http://fisher.osu.edu/fin/osudata.htm
Quandl: http://www.quandl.com/
St Louis Federal: http://research.stlouisfed.org/fred2/
Yahoo Finance: http://finance.yahoo.com/

Biology

CRCNS: http://crcns.org/data-sets
Gene Expression Omnibus: http://www.ncbi.nlm.nih.gov/geo/
Human Microbiome Project: http://www.hmpdacc.org/reference_genomes/reference_genomes.php
MIT Cancer Genomics Data: http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi
NIH Microarray data: ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE6532/
Protein structure: http://www.infobiotic.net/PSPbenchmarks/
Public Gene Data: http://www.pubgene.org/
Stanford Microarray Data: http://smd.stanford.edu/
UniGene: http://www.ncbi.nlm.nih.gov/unigene

Physics

NASA: http://nssdc.gsfc.nasa.gov/nssdc/obtaining_data.html

Healthcare

EHDP Large Health Data Sets: http://www.ehdp.com/vitalnet/datasets.htm
Gapminder: http://www.gapminder.org/data/
Medicare Data File: http://go.cms.gov/19xxPN4

GeoSpace

EOSDIS: http://sedac.ciesin.columbia.edu/data/sets/browse
Factual Global Location Data: http://www.factual.com/
Geo Spatial Data: http://geodacenter.asu.edu/datalist/

Transportation

Airlines Data (2009 ASA Challenge): http://stat-computing.org/dataexpo/2009/the-data.html
Airports and their locations: http://www.infochimps.com/datasets/airports-and-their-locations
Bike Share Data Systems: https://github.com/BetaNYC/Bike-Share-Data-Best-Practices/wiki/Bike-Share-Data-Systems
Edge data for US domestic flights 1990 to 2009: http://data.memect.com/?p=229
Half a million Hubway rides: http://hubwaydatachallenge.org/trip-history-data/
NYC Taxi Trip Data 2013 (FOIA/FOIL): https://archive.org/details/nycTaxiTripData2013
OpenFlights (airport, airline and route data): http://openflights.org/data.html
RITA Airline On-Time Performance Data: http://www.transtats.bts.gov/Tables.asp?DB_ID=120
RITA transport data collection: http://www.transtats.bts.gov/DataIndex.asp
Transport for London: http://www.tfl.gov.uk/info-for/open-data-users/our-feeds
U.S. Freight Analysis Framework: http://ops.fhwa.dot.gov/freight/freight_analysis/faf/index.htm

Government

Archive-it: : https://www.archive-it.org/explore?show=Collections
Australia: http://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/3301.02009?OpenDocument
Canada: http://www.data.gc.ca/default.asp?lang=En&n=5BCD274E-1
Chicago: https://data.cityofchicago.org/
FDA: https://open.fda.gov/index.html
Fed Stats: http://www.fedstats.gov/cgi-bin/A2Z.cgi
Guardian world governments: http://www.guardian.co.uk/world-government-data
HUD: http://www.huduser.org/portal/datasets/pdrdatas.html
London Datastore, U.K: http://data.london.gov.uk/dataset
New Zealand: http://www.stats.govt.nz/browse_for_stats.aspx
NYC betanyc: http://betanyc.us/
NYC Open Data: http://nycplatform.socrata.com/
OECD: http://www.oecd.org/document/0,3746,en_2649_201185_46462759_1_1_1_1,00.html
RITA: http://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp
San Francisco Data sets: http://datasf.org/
The World Bank: http://wdronline.worldbank.org/
U.K. Government Data: http://data.gov.uk/data
U.S. Census Bureau: http://www.census.gov/data.html
U.S. Federal Government Agencies: http://www.data.gov/metric
U.S. Federal Government Data Catalog: http://catalog.data.gov/dataset
U.S. Open Government: http://www.data.gov/open-gov/
UK 2011 Census Open Atlas Project: http://www.alex-singleton.com/2011-census-open-atlas-project/
United Nations: http://data.un.org/
US CDC Public Health datasets: http://www.cdc.gov/nchs/data_access/ftp_data.htm

Data Challenges

Challenges in Machine Learning: http://www.chalearn.org/
ICWSM Data Challenge (since 2009): http://icwsm.cs.umbc.edu/
Kaggle Competition Data: http://www.kaggle.com/
KDD Cup by Tencent 2012: https://www.kddcup2012.org/
Netflix Prize: http://www.netflixprize.com/leaderboard
Yelp Dataset Challenge: http://www.yelp.com/dataset_challenge

Machine Learning

eBay Online Auctions: http://www.modelingonlineauctions.com/datasets
IMDb database: http://www.imdb.com/interfaces
Keel Repository: http://sci2s.ugr.es/keel/datasets.php
Lending Club Loan Data: https://www.lendingclub.com/info/download-data.action
Machine Learning Data Set Repository: http://mldata.org/
Million Song Dataset: http://blog.echonest.com/post/3639160982/million-song-dataset
More Song Datasets: http://labrosa.ee.columbia.edu/millionsong/pages/additional-datasets
MovieLens Data Sets: http://datahub.io/dataset/movielens
RDataMining R and Data Mining ebook data: http://www.rdatamining.com/data
Registered meteorites on Earth: http://www.analyticbridge.com/profiles/blogs/registered-meteorites-that-has-impacted-on-earth-visualized
SF restaurants dataset: http://missionlocal.org/san-francisco-restaurant-health-inspections/
UCI Machine Learning Repository: http://archive.ics.uci.edu/ml/
University of Toronto Delve Datasets: http://www.cs.toronto.edu/~delve/data/datasets.html
Yahoo Ratings and Classification Data: http://webscope.sandbox.yahoo.com/catalog.php?datatype=r

Natural Language

40 Million Entities in Context: https://code.google.com/p/wiki-links/downloads/list
ClueWeb09 FACC: http://lemurproject.org/clueweb09/FACC1/
ClueWeb12 FACC: http://lemurproject.org/clueweb12/FACC1/
Flickr personal taxonomies: http://www.isi.edu/~lerman/downloads/flickr/flickr_taxonomies.html
Google Books Ngrams: http://aws.amazon.com/datasets/8172056142375670
Google Web 5gram, 2006 (1T): https://catalog.ldc.upenn.edu/LDC2006T13
Gutenberg eBooks List: http://www.gutenberg.org/wiki/Gutenberg:Offline_Catalogs
Hansards: http://www.isi.edu/natural-language/download/hansard/
Machine Translation: http://statmt.org/wmt11/translation-task.html#download
SMS Spam Collection: http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/
USENET corpus: http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html
WordNet: http://wordnet.princeton.edu/wordnet/download/

Image Processing

2GB of photos of cats: http://bit.do/UJZZ
Face Recognition Benchmark: http://www.face-rec.org/databases/
ImageNet: http://www.image-net.org/

Time Series

Time Series data Library: https://datamarket.com/data/list/?q=provider:tsdl
UC Riverside Time Series: http://www.cs.ucr.edu/~eamonn/time_series_data/

Social Sciences

China Hotel Checkin/out data: http://www.360doc.com/content/13/1105/13/7863900_326788919.shtml
CMU Enron Email: http://www.cs.cmu.edu/~enron/
Facebook Social Networks (since 2007): http://law.di.unimi.it/datasets.php
Facebook100 (2005): https://archive.org/details/oxford-2005-facebook-matrix
Foursquare (2010,2011): http://www.public.asu.edu/~hgao16/dataset.html
Foursquare (UMN/Sarwat, 2013): https://archive.org/details/201309_foursquare_dataset_umn
General Social Survey (GSS): http://www3.norc.org/GSS+Website/
GetGlue (users rating TV shows): http://getglue-data.s3.amazonaws.com/getglue_sample.tar.gz
GitHub Archive: http://www.githubarchive.org/
ICPSR: http://www.icpsr.umich.edu/icpsrweb/ICPSR/index.jsp
Mobile Social Networks (UMASS): https://kdl.cs.umass.edu/display/public/Mobile+Social+Networks
PewResearch Internet Project: http://www.pewinternet.org/datasets/pages/2/
Social Networking: http://www.cs.cmu.edu/~jelsas/data/ancestry.com/
SourceForge Graph: http://www.nd.edu/~oss/Data/data.html
Titanic Survival Data Set: https://github.com/caesar0301/awesome-public-datasets/blob/master/Datasets/titanic.csv.zip
Twitter Graph: http://an.kaist.ac.kr/traces/WWW2010.html
UC Berkeley's D-Lab Achive: http://ucdata.berkeley.edu/
UCLA Social Sciences Data Archive: http://dataarchives.ss.ucla.edu/Home.DataPortals.htm
UNIMI Social Network Datasets: http://law.di.unimi.it/datasets.php
Universities Worldwide: http://univ.cc/
UPJOHN for Employment Research: http://www.upjohn.org/erdc/erdc.html
Yahoo Graph and Social Data: http://webscope.sandbox.yahoo.com/catalog.php?datatype=g
Youtube Graph (2007,2008): http://netsg.cs.sfu.ca/youtubedata/

Complex Networks

CrossRef DOI URLs: https://archive.org/details/doi-urls
DBLP Citation dataset: https://kdl.cs.umass.edu/display/public/DBLP
NBER Patent Citations: http://nber.org/patents/
NIST complex networks data collection: http://math.nist.gov/~RPozo/complex_datasets.html
Protein-protein interaction network: http://vlado.fmf.uni-lj.si/pub/networks/data/bio/Yeast/Yeast.htm
PyPI and Maven Dependency Network: http://ogirardot.wordpress.com/2013/01/31/sharing-pypimaven-dependency-data/
Scopus Citation Database: http://www.elsevier.com/online-tools/scopus
Stanford GraphBase (Steven Skiena): http://www3.cs.stonybrook.edu/~algorith/implement/graphbase/implement.shtml
Stanford Large Network Dataset Collection: http://snap.stanford.edu/data/
The Koblenz Network Collection: http://konect.uni-koblenz.de/
UCI Network Data Repository: http://networkdata.ics.uci.edu/resources.php
UFL sparse matrix collection: http://www.cise.ufl.edu/research/sparse/matrices/
UNIMI Large Web Graph: http://law.di.unimi.it/datasets.php
WSU Graph Database: http://www.eecs.wsu.edu/mgd/gdb.html

Computer Networks

3.5B Web Pages: http://www.bigdatanews.com/profiles/blogs/big-data-set-3-5-billion-web-pages-made-available-for-all-of-us
53.5B Web clicks: http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset
CAIDA Internet Datasets: http://www.caida.org/data/overview/
ClueWeb09: http://lemurproject.org/clueweb09/
ClueWeb12: http://lemurproject.org/clueweb12/
CommonCrawl Web Data: http://commoncrawl.org/the-data/get-started/
Dartmouth CRAWDAD Wireless datasets: http://crawdad.cs.dartmouth.edu/
OpenMobileData (MobiPerf): https://console.developers.google.com/storage/openmobiledata_public/
UCSD Network Telescope: http://www.caida.org/projects/network_telescope/

Data SEs

Academic Torrents: http://academictorrents.com/
Datahub.io: http://datahub.io/dataset
DataMarket: https://datamarket.com/data/list/?q=all
Harvard Dataverse: http://thedata.harvard.edu/dvn/
Statista: http://www.statista.com/
Freebase: http://www.freebase.com/

Public Doamins

Amazon: http://aws.amazon.com/datasets
Archive.org Datasets: https://archive.org/details/datasets
CMU JASA data archive: http://lib.stat.cmu.edu/jasadata/
CMU StatLab collections: http://lib.stat.cmu.edu/datasets/
Data360: http://www.data360.org/index.aspx
Datamob.org: http://datamob.org/datasets
Google: http://www.google.com/publicdata/directory
infochimps: http://www.infochimps.com/
KDNuggets Data Collections: http://www.kdnuggets.com/datasets/index.html
Numbray: http://numbrary.com/
RevolutionAnalytics Collection: http://www.revolutionanalytics.com/subscriptions/datasets/
Sample R data sets: http://stat.ethz.ch/R-manual/R-patched/library/datasets/html/00Index.html
Stats4Stem R data sets: http://www.stats4stem.org/data-sets.html
StatSci.org: http://www.statsci.org/datasets.html
The Washington Post List: http://www.washingtonpost.com/wp-srv/metro/data/datapost.html
UCLA SOCR data collection: http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data
UFO Reports: http://www.nuforc.org/webreports.html
Wikileaks 911 pager intercepts: http://911.wikileaks.org/files/index.html
Yahoo Webscope: http://webscope.sandbox.yahoo.com/catalog.php

Complementary Collections

DataWrangling: http://www.datawrangling.com/some-datasets-available-on-the-web
Inside-r: http://www.inside-r.org/howto/finding-data-internet
Quora: http://www.quora.com/Where-can-I-find-large-datasets-open-to-the-public
RS Collection 100+ : http://rs.io/2014/05/29/list-of-data-sets.html
StaTrek: http://hsiamin.com/posts/2014/10/23/leveraging-open-data-to-understand-urban-lives/

回答3:

Here are several. Have fun.

http://archive.ics.uci.edu/ml/

http://aws.amazon.com/datasets?_encoding=UTF8&jiveRedirect=1

http://crawdad.org/

http://data.austintexas.gov

http://data.cityofchicago.org

http://data.govloop.com

http://data.gov.uk/

http://data.medicare.gov

http://data.seattle.gov

http://data.sfgov.org

http://data.sunlightlabs.com

https://datamarket.azure.com/

http://ftp.ncbi.nih.gov/

http://gettingpastgo.socrata.com

http://books.google.com/ngrams/

http://linkeddata.org/

http://medihal.archives-ouvertes.fr

http://public.resource.org/

http://rechercheisidore.fr

http://reddit.com/r/datasets

http://timetric.com/public-data/

http://www2.jpl.nasa.gov/srtm

http://www.bls.gov/

http://www.crunchbase.com/

http://www.dartmouthatlas.org/

http://www.data.gov/

http://www.datakc.org

http://www.factual.com/

http://www.freebase.com/

http://www.infochimps.com

http://www.kaggle.com/

http://build.kiva.org/

http://www.imdb.com/interfaces

http://dbpedia.org

回答4:

Just a thought:

USGS Geographic Names database
USDA PLANTS checklist
Any one of the many state GIS repositories e.g. NH's GRANIT

回答5:

Well for the web server logs you could always just generate them for the format you need. If you are going to test code against it etc. it will have to be tailored to the fields you want to store/parse.

For the datasets used for database performance benchmarking, you'll probably want to look at a tool that can generate data for you. Red Gate has a great one for not too much money.

回答6:

Google Fusion Tables has a few.

http://tables.googlelabs.com/

回答7:

Datasets available here as well.

回答8:

Kaggle.com frequently has datamining challenges. The datasets cover a wide range of fienlds: healthcare provider data to credit history information. Perhaps something there is what you're after.

回答9:

http://Quandl.com has over 10 million data sets gleaned from all over the internet. The great thing about this resource is that it gives a single way to access all of the data. The site has a free Excel plug in or there are libraries in R, Python, Ruby, etc.

回答10:

http://www.quora.com/Data/Where-can-I-get-large-datasets-open-to-the-public

回答11:

I am surprised no one mentioned Google N-Grams. More on N-Grams at http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html

回答12:

Perhaps some databases used as training sets for face recognition algorithms: face-rec.org

回答13:

Well, this one is new and there is a challenge behind it:

Million song dataset challenge