This page describes several places where you might look for interesting data to use in your SPIS projects.
Here is an overview, followed by more specific information about each one.
- Julian McAuley’s Amazon Data sets: http://jmcauley.ucsd.edu/data/amazon/
- The CORGIS datasets for Python: https://think.cs.vt.edu/corgis/python/index.html
- CORGIS is a Collection of Really Great, Interesting, Situated Datasets, collected by Austin Cory Bart, a Ph.D. student in Computer Science Education at Virginia Tech (along with several other collaborators.)
- The datasets are updated periodically, and cover many topics from Art, Economics, Geography, History, Literature, Music, Politics, and Travel among others. There are over 40 datasets that come with Python code to access them.
- Many of these datasets are of sufficient size to be considered “big data”, but only if you are careful about setting
test=Falseparameter. Read the documentation for each data set carefully. More info below.
- New York Times Data Journalism: http://www.nytimes.com/section/upshot
- Nate Silver’s Election analysis and more: http://fivethirtyeight.com/
Working with Reddit Data
- Reddit Data Visualization: https://www.reddit.com/r/dataisbeautiful/
- Articles from SPIS 2016 website that relate to getting Reddit Data:
Corgis datasets for Python
When working with the Corgis datasets for Python, be sure to read the part about the
For many of these datasets, you only get a small sample of the data when you use this code:
import cars list_of_car = cars.get_cars() list_of_car = cars.get_cars_by_year("2001") list_of_car = cars.get_cars_by_make("'Pontiac'")
If instead, you set the
test parameter to
False, you get a much larger data set that could be considered “big data”:
import cars # These may be slow! import cars # These may be slow! list_of_car = cars.get_cars(test=False) list_of_car = cars.get_cars_by_year("2001", test=False) list_of_car = cars.get_cars_by_make("'Pontiac'", test=False)