I am looking for some large public datasets, in particular:
Large sample web server logs that have been anonymized.
Datasets used for database performance benchmarking.
Any other links to large public datasets would be appreciated. I already know about Amazon's public datasets at: http://aws.amazon.com/publicdatasets/
Google Fusion Tables has a few.
http://tables.googlelabs.com/
http://Quandl.com has over 10 million data sets gleaned from all over the internet. The great thing about this resource is that it gives a single way to access all of the data. The site has a free Excel plug in or there are libraries in R, Python, Ruby, etc.
I am surprised no one mentioned Google N-Grams. More on N-Grams at http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
Well, this one is new and there is a challenge behind it:
Million song dataset challenge
Based on Quora answers and my personal collections in my studies, an awesome-public-datasets repository was created and updated lively on GitHub:
Below is a snapshot version of this list. For a newest list, please visit Github:
This list of public data sources are collected and tidied from blogs, answers, and user responses. Most of the data sets listed below are free, however, some are not. This list comes from https://github.com/caesar0301/awesome-public-datasets.
Climate
Economics
Finance
Biology
Physics
Healthcare
GeoSpace
Transportation
Government
Data Challenges
Machine Learning
Natural Language
Image Processing
Time Series
Social Sciences
Complex Networks
Computer Networks
Data SEs
Public Doamins
Complementary Collections
Datasets available here as well.