Data Sources
Meta - Data sources about data sources
- Data is Plural Spreadsheet with all data organized
- Replies to Allen Downey's question about data sources
- The Ultimate Dataset Aggregator for Machine Learning
- List of datasets for machine-learning research
Web Data
- Web tables from common crawl data
- Domain names
- WDC Products: A Multi-Dimensional Entity Matching Benchmark - Paper
- WDC Product Data Corpus - V.2020
Entity Matching datasets
Location data
- Who's On First CC by gazetteer of places categorized by types
- Global Research Identifier Database: locations of research organizations over the world
- Geonames: all world locations with lat long, name variants, and other details
Politics
Misc
- List of politician names and properties in Wikidata
- UK MPs on Twitter
- Women in Politics
- Danish Politicians on Twitter
- Global Party Survey
- Manifesto of political parties
- Global Parliametary Report
- Data on world elections and political parties
- Global Party Facts
- World Election Calendar
- US congress memebers info (includes images)
- The Database of Political Institutions over time
- WORLD POLITICAL CLEAVAGES AND INEQUALITY DATABASE
- Constitution project: https://www.constituteproject.org/content/data?lang=en
- Comparative Constitutions Project (Changes in constitutions) Resources related to constitutions
- Voter Turnout dataset from INTERNATIONAL IDEA SUPPORTING DEMOCRACY WORLDWIDE
US
- Summaries of US bills and other congress datasets
- FB political ads dataset crowdsourced by Propublica
- Political datasets
- All congress bills (updated daily)
- Politician's social media accounts
- Federal Election Commission data
India
- Funding resources: https://myneta.info/party/
- Association for democratic reforms: https://adrindia.org/about-adr/who-we-are
Social economic data
Geolocation to census and other geolocation data
Datasets with gender, personality (preferably emotional stability), an experimental manipulation of which the effect will plausibly vary by gender Source tweet
Religion, Ethnicity, Demographics
- Religious history database
- Ethnic Power Relations (EPR) Dataset Family 2021
- Spatially Interpolated Data on Ethnicity - SIDE
- Geo-referencing of Ethnic Groups
World Events
- Historical Phoenix Event Data covers the period 1945-2019 and includes several million events extracted from 18.9 million news stories.
- Global Atrocity Prevention - Datasets https://www.topcoder.com/community/data-science/datasets/atrocity-prevention
- Violence Early-Warning System Data for each PRIO-GRID map cell across time
- GROWup - Geographical Research On War, Unified Platform
- The Armed Conflict Location & Event Data Project
- Uppsala Conflict Data Program (UCDP)
- Militarized Interstate Disputes (v4.3)
- Correlates of War Project
- Correlates of War External Links (Related datasets)
- CShapes 2.0 maps the borders and capitals of independent states and dependent territories from 1886 to 2019
- GROWup - Geographical Research On War, Unified Platform
- Behavioral Correlates of War, 1816-1979 (ICPSR 8606)
Personality prediction data
Economics data
Misc
- Catalogue For Predictive Models In The Humanitarian Sector
- Medical Data for Machine Learning
- Public data sets for testing and prototyping
- ML datasets wikipedia page
- List of Datasets for Data Science & Machine Learning Projects
Name datasets
- Manually curated name gender list from multiple countries Originally introduced here and here - Python package
- Large name dataset python program
- Name embeddings
- NYC baby names by gender and mother's ethnicity Also available here
- Baby Names map (100,000 unique names from 14 different countries)
- Name master file latest
- Frequent first and last names in US census 1990
- US first and last names list by gender and race
- First name and last name data for Dutch, English, Portugese, Italian, and Spanish
- Names from different ethinicities used in bias evaluation
- Gender by language wikidata
- US congress members name and gender
- Name pronounciation of US congress members
- Name gender race data from DBPedia SPARQL
- Name gender race data frok Wikidata SPARQL
- List of Caucasian actors
- List of Caucasian actresses
- List of African-American actors
- List of African-American actresses
- Wiki List of African-American actors
- Wiki List of American actresses
- Artists in MoMA collection and their names/nationality
- Swedish names by gender/year with at-least 10 individuals
- Given names by culture and lanugage from Wikipedia
- Names of all politicians
- IMDB artists identified if they are actor or actresses
- List of Olympic athlete names mapped with ethnicity and country
- 140k name/ethnicity associations
- First names, gender, and year in Indian electoral roll data
- Indian names with ethnicity, religion, gender extracted from SimplyMarry and CBSE data
- Name Nationalities from Wikipedia Categories and classifier to predict these
- WikiTree Data Dump (24M geneologies)
- List of names and surnames for Dutch, English, Portuguese and Spanish
- Chinese Names
- Details and relations between names: https://www.name-doctor.com/name-volodenka-meaning-of-volodenka-55917.html
- NameGrapher - Explore the historical popularity of United States baby names
- Baby Name Atlas: The Most Popular Names Around the World
- French Baby Names
- Russian names with gender
- Family History Resources from Forebears.io
- Name-Based Gender Classification (36 distinct sources—spanning over 150 countries and more than a century) - Github - Software
- EthniColr: Predict Race and Ethnicity Based on the Sequence of Characters in a Name
- World Gender Name Dictionary
- Genni + Ethnea for the Author-ity 2009 dataset
- demographicx: A Python package for estimating gender and ethnicity using deep learning transformers
- List of first names, genders and country-specific frequencies
- Validated Names for Experimental Studies on Race and Ethnicity
Images
- MoMA collection of images with name/gender/nationality
- Devnagri MNIST and http://archive.ics.uci.edu/ml/machine-learning-databases/00389/ - Devnagri MNIST large on Kaggle
- Families In the Wild: A Kinship Recognition Benchmark
- The Sixth Workshop on Fine-Grained Visual Categorization
- SALICON Saliency Prediction Challenge (LSUN 2017)
- MIT/Tübingen Saliency Benchmark datasets
- Visual story telling dataset
- UTKFace Large Scale Face Dataset- https://susanqq.github.io/UTKFace/
- Flickr Image Cropping Dataset
- Bob Ross Paintings with title and color pallet names
- OCR Datasets: https://cneud.github.io/ocr-gt/
- LVIS: A DATASET FOR LARGE VOCABULARY INSTANCE SEGMENTATION: A new dataset for long tail object recognition.
- YFCC100M 99M Image, Caption data
- LILA BC: Labeled Information Library of Alexandria: Biology and Conservation
- Ecoset – an ecologically more valid large-scale image dataset
- Natural Scenes Dataset (NSD) is a large-scale fMRI dataset conducted at ultra-high-field
Stream processing data
- Data from The Neural Hawkes Process - Google drive link
- Data from The Neural Hawkes Particle Smoothing - Google drive link
Controversial topics
Crowdtruth NLP datasets
- http://crowdtruth.org/data/
- Datasets on relation extraction, frame extraction, stories, NER, and novelty detection in Tweets
Humor data
Audio
- Acoustic and meta features of albums and songs on the Billboard 200
- Conversation: A Naturalistic Dataset of Online Recordings (CANDOR) corpus
- LAION Audio Dataset Project
Driving data
Sports
Football
Cricket stats
- Cricksheet - Also in XML format
- Python API for cricinfo JSON data - Example CricInfo JSON which includes commentary.
NFL
Food
- Portal:Food - Wikipedia
- Cookbook:Chiles - Wikibooks, open books for an open world
- Category:Ingredients - Wikibooks, open books for an open world
- Category:Recipes - Wikibooks, open books for an open world
- Recipe - Schema.org Type
- GitHub - cosylabiiit/recipe-knowledge-mining - NER for recipe - [2004.12184] A Named Entity Based Approach to Model Recipes
- RecipeDB: a resource for exploring recipes - NER Dataset from RecipeDB - recipedb - A resource for exploring recipes
- cosylabiiit/Recipedb-companion-data
- GitHub - cosylabiiit/recipe-knowledge-mining
- Training Recipe Ingredient NER with Transformers
- https://tasty.co/
- Recipes from Tasty | Kaggle
- RecipeNLG: A Cooking Recipes Dataset for Semi-Structured Text Generation - ACL Anthology
- recipe_nlg - Datasets at Hugging Face
- SHARE: a System for Hierarchical Assistive Recipe Editing
- lishuyang/recipepairs - Datasets at Hugging Face
- Open Recipes: https://huggingface.co/datasets/napsternxg/openrecipes-20170107-061401-recipeitems
- Wikibook - Cookbook:Table_of_Contents
- Wikipedia Commons - Category:Nutrition
- Wikipedia Commons - Category:Food_and_drink
- Wikipedia Commons - Category:Beverages
- Wikipedia Commons - Category:Food
- 🪐 spaCy Project: Analyzing how mentions of ingredients change over time (Named Entity Recognition)
- TASTESet recipe NER model and dataset - TasteSet 2.0 - 1K annotated - Entities Linked to https://foodon.org/
- FINER: Food Ingredient NER Dataset - Paper: SMPT: A Semi-Supervised Multi-Model Prediction Technique for Food Ingredient Named Entity Recognition (FINER) Dataset Construction
- NYTimes - CRF Ingredient Phrase Tagger
- CulinaryDB - Data Analytics for World Cuisines
- FoodData Central - Datasheet
- Food and Nutrient Database for Dietary Studies - Food ingredients
- LexMapr - A Lexicon and Rule-Based Tool for Translating Short Biomedical Specimen Descriptions into Semantic Web Ontology Terms
- FoodBase corpus: A new resource of annotated food entities