A repository of publicly available datasets for testing & demonstration purposes.
airbnb
- New York City Airbnb Open Data
- src: https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data
- single-table: 49k listings, incl geo info
airlines
- 2015 Flight Delays and Cancellations
- src: https://www.kaggle.com/datasets/usdot/flight-delays
- single-table: 100k flights w/ 20 cols, incl geo info
arxiv
- 2007-2014 arxviv papers containing words
synthetic
anddata
- src: https://www.kaggle.com/datasets/Cornell-University/arxiv
- single-table: 23k records with 1 date, 1 categorical and 3 text fields
- 2007-2014 arxviv papers containing words
bank_marketing
- Direct marketing campaigns for aPortuguese bank
- src: https://archive.ics.uci.edu/ml/datasets/bank+marketing
- single-table: 45k records w/ 17 cols
baseball
- subset of the Lahman Baseball Stats dataset
- src: https://github.com/cdalzell/Lahman
- 3-table: 20k players w/ 110k batting seasons and 140k fielding seasons
berka
- 1999 Czech Bank Dataset
- src: https://data.world/lpetrocelli/czech-financial-dataset-real-anonymized-transactions
- 8-table: 5k customers w/ 1m transactions
cdnow
- CDNOW dataset
- src: https://www.brucehardie.com/datasets/
- 2-table: 24k users w/ 70k transactions
census
- 1994 US Census dataset, aka the
adult
dataset - src: https://archive.ics.uci.edu/ml/datasets/census+income
- single-table: 49k subjects w/ 14 cols
- 1994 US Census dataset, aka the
creditcard_default
- Credit Card Default
- src: https://www.kaggle.com/datasets/uciml/default-of-credit-card-clients-dataset
- single-table: 30k records w/ 24 cols
creditcard_fraud
- Credit Card Fraud Detection
- src: https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud
- single-table: 284k records w/ 31 cols, incl highly-imbalanced target
fannie_mae
- Fannie MAE Mortgage data used for NVIDIA blog post
- src: https://docs.rapids.ai/datasets/mortgage-data
- 2-table: million of loans and billions (!) of records
firstnames_at
- Records on given baby names in Austria from 2010-2016
- src: https://www.data.gv.at/katalog/dataset/603066f6-0f0a-3806-b394-f14b7d2cb437
- single-table: 534k records, incl free text column
gleif
- GLEIF records of business entities, and their relations to each other
- src: https://www.gleif.org/
- two-table: 2.2m organizations w/ 370k relations; self-referential
grocery
- 4y purchase records for an online grocery retailer
- 3-table: 40k customers w/ 303k orders w/ 4.3m items
headlines
- 201k news headlines with category and date
- src: https://www.kaggle.com/code/imdevskp/news-category-classification/notebook
housing_at
- 30y of geo-encoded property transactions in Austria
- src: self-scraped from https://www.data.gv.at/
- single-table: 80k transactions w/ 17 cols, incl geo data
instacart
- Instacart Market Basket Analysis
- src: https://www.kaggle.com/c/instacart-market-basket-analysis
- 3-table: 200k users w/ 3.2m orders w/ 32m products
marathon_at
- Marathon split times for Vienna City Marathon
- src: self-scraped from http://www.vienna-marathon.com/
- single-table: 5k runners w/ 14 cols
medical
- Medical Abstracts
- src: https://github.com/sebischair/Medical-Abstracts-TC-Corpus
- single-table: 2x 7k with 1 categorical and 1 text (up to 4000 chars)
netflix
- Netflix Prize data
- src: https://www.kaggle.com/netflix-inc/netflix-prize-data
- two-table: 470k users w/ 2.4m ratings
online_shoppers
- Online Shoppers Purchasing Intention Dataset Data Set
- src: https://archive.ics.uci.edu/ml/datasets/Online+Shoppers+Purchasing+Intention+Dataset
- single-table: 12k records w/ 18 cols
physionet
- Mortality of ICU Patients based on cardiology records
- src: https://physionet.org/content/challenge-2012/1.0.0/
- two-table: 12k ICU patients w/ 37 cols across 48 hours
porto
- Porto Taxi dataset
- src: https://www.kaggle.com/c/pkdd-15-predict-taxi-service-trajectory-i/
- two-table: 100k trips w/ 4.8m geo positions
- pull data from cloud bucket for full dataset of 1.7m trips
sacred
- bible verses
- src: https://github.com/JohnCoene/sacred
- single-table: 31k verses w/ free text column (avg 120 chars, max 500 chars)
titanic
- Titanic Passenger Data, incl Surival
- src: https://www.kaggle.com/c/titanic/data
- single-table: 1309 records w/ 8 cols