Natural Language Processing
- Courses and Books
- Scientific NLP
- Tools
- Pretrained embeddings and models
- Links
- Lexicons
- Data
- Shared tasks, competitions
- Research Groups
- Mailing lists
Table of contents generated with markdown-toc
Courses and Books
- Statistical NLP Book
- Course on word embeddings, variants, and applications
- A Course in Machine Learning
- Introduction to Natural Language Processing by Jacob Eisenstein
- Information extraction
- 601.765 Machine Learning: Linguistic & Sequence Modeling
- Variational Inference for NLP
- Probabilistic NLP course
- Oxford Deep Learning NLP
- CMU NLP
- David Bamman's Course on Applied NLP - Course Page
- NLP Course | For You by Lena Voita
- Course material for Machine Translation from UIUC, JHU
- Jason Eisner
- Introduction to Cultural Analytics & Python
- Computational Sociolinguistics by David Jurgens
- Ethics in NLP
- Multilingual Natural Language Processing
- CSE 704 - Applied Natural Language Processing and Computational Social Science
- Information Retrieval: Implementing and Evaluating Search Engines By Stefan Büttcher, Charles L. A. Clarke and Gordon V. Cormack
- Search Result Diversification
- Efficient Query Processing for Scalable Web Search
Scientific NLP
Tools
- Using spacy and flashtext for fast lookup of dictionary items in text. https://github.com/mpuig/spacy-lookup
- Open IE tool
- Different evaluation techniques for NER
- Phrase extraction using POS patterns
- Whats wrong with my NLP in Python diff visualizer for NLP tasks
- Inform - interactive fiction based on natural language
- Linguistic Knowledge and Transferability of Contextual Representations
- Generalized brown clusters
- COGCOMP NLP: Online demo with multiple tasks Code
- Python keyphrase extraction library
- Neural Relation Extration
- List of tools for corpus analysis
- WordMapper, evolution of words on Twitter
- Document subject indexing and thesaura linking (also works for Finnish)
- BPEmb: Pre-trained Subword Embeddings in 275 Languages (LREC 2018)
- Cornell Conversational Analysis Toolkit
- Python based text summarization evaluation
- Python based sequence tagging evaluation
- NLP annotation tool
- Yake Single document keyword extraction
- Open IE Tool Other OpenIE tools
- Derive named entities from Wikipedia
- Bengali NLP
- Py Readability metrics
- iNLTK
- Fast spell correction and word segmentation
- NeuSpell: A Neural Spelling Correction Toolkit
- JamSpell -- Contains error model to generate spell errors. How does JamSpell correction work?
- NERD evaluation
- MedCAT can be used to extract information from Electronic Health Records (EHRs) and link it to biomedical ontologies like SNOMED-CT and UMLS
- PyTerrier - A Python framework for performing information retrieval experiments, building on http://terrier.org/
- JamSpell - Spelling correction (works in python)
- Efficient Low Memory Aligner
- DeepTranslit: Towards better transliteration for Indic languages
- Google Trends Anchor Bank
- The Tatoeba Translation Challenge (v2021-08-07)
- Named Entity Recognition for Entity Linking: What Works and What's Next
- Inception: semantic annotation tool
- Language rules in XML format from languagetool.org
- REBL is an extension of the Radboud Entity Linker (REL) for Batch Entity Linking
- Odinson: A Fast Rule-based Information Extraction Framework
- SDSL - Succinct Data Structure Library
- ReFinED: An Efficient Zero-shot-capable Approach to End-to-End Entity Linking
Pretrained embeddings and models
- ELMO for Many Languages
- Flair for many languages
- Pretrained transformer models for many languages and domains
- FastText embeddings
Links
- The Definition of Sekine’s Extended Named Entities
- Bayesian NLP
- Rules for forming questions using text
- Intel NLP library lots of great features using Tensorflow
- Information extraction from unstructured text
- Tutorial on deep latent variable models for NLP with pytorch code
- On the viability of crowdsourcing NLP annotations in healthcare
- Contextual Word Representations: A Contextual Introduction
- Emotion vocabularies
- Emotion markup language
- NAACL 2019 tutorial on Tutorial on Modeling Language Change
- NAACL 2019 tutorial on Transfer Learning in Natural Language Processing
- Papers on Textual Adversarial Attack and Defense
- Resources for NRE: Neural Relation Extraction
- Online self-learning platform on vocabulary
- Awesome sentiment analysis
- List of papers on Textual Adversarial Attack and Defense
- Tutorial on NLP Approaches to Computational Argumentation
- Tutorial on Argumentation Mining
- ACL2019 Tutorial: Advances in Argument Mining
- International Linguistic Olympiad - Covers a lot of low resource and multilingual problems
- Tutorial by Xavier Carreras on Structured Prediction
- Visual Sentiment Ontology & Dataset
- NLP Roadmap
- Pretrained LM model papers
- EmojiMap
- Literature review and datasets on Text-Summarization
- Causal Inference using text data
- Tutorial on topic models and its extensions with animations of Chinese restaurant process and Poyla's Urn
- Relation extraction resources
- Wordbank - An open database of children's vocabulary development
- Comparison of web (text) annotation editors
- North American Computational Linguistics Open Competition
- Frequency lists of words in all languages
- The NLP Pandect
- Unicode text segmentation
- Datasets for Text Analysis
- Multilingual stopwords
- Entity Related Papers
- Entity Linking Recent Trends
- Fine-grained evaluation of Entity Linking
- Advanced String Matching and Burrows-Wheeler Indexing YouTube
- Succinct Data Structures for NLP-at-Scale
- Space-Efficient Data Structures for Top-k Completion
- Beginner's Crash Course to Elastic Stack for text analysis - YouTube
- The Definition of Sekine’s Extended Named Entities
Lexicons
- HistEmo lexicon for valance, arousal and dominance scores of words across time
- Sentiment lexicon
- Polarity shifter lexicons
- Bootstrapped polarity shifter lexicons
- Lexicon of abusive words
- Finnish General Ontology YSO, several domain-specific vocabularies, and the KOKO ontology
- Entitypedia is an Extended Named Entity Dictionary from Wikipedia
- Extended Open Multilingual Wordnet
- Open Multilingual Wordnet
- Open License English Wordnet
- HurtLex a lexicon of offensive, aggressive, and hateful words in over 50 languages
- The Evaluative Lexicon - emotionality, valence, and extremity
- CLICS: Database of Cross-Linguistic Colexifications
- FreeDict: Multilingual and Single Language Dictionaries
- Slob Dictionaries: Language dictionaries
- Hindi Language Stop Words List
- XMLittré: XML version of the French Dictionary Littré (1873–1877) Raw Text
- GNU Collaborative International Dictionary of English
- GNU Dico: GNU Dictionary Server which loads dictionaries from various formats
- Moby Thesaurus (See words.txt)
- OpenOffice MyThes thesaurus
- German POS dictionary
- CMU Dict (word pronounciation) JS format
- Gemoji: emoji descriptions and tags
- Emoji-emotion: emotion assigned to emojis
- Map of profane words to how likely it is to be used as either profanity or clean text (multilingual)
- The LiLaH Emotion Lexicon of Croatian, Dutch and Slovene
- Linguistically annotated multilingual comparable corpora of parliamentary debates ParlaMint.ana 2.1 - Github
- Multilingual comparable corpora of parliamentary debates ParlaMint 2.1
- labMT lexicon with happniess score and defnitions
- Easy-to-use word translations for 3,564 language pairs across 62 unique languages
- FrequencyWords: Frequency Word List Generator
Data
- Multi-LexSum
- List of English Medical Terms
- HUNER: improving biomedical NER with pretraining
- WebIs Datasets
- WEXEA: an exhaustive Wikipedia entity annotation system
- Toloka Visual Question Answering Challenge
- KPWR-NER: Polish Fine-grained NER
- Universal NER Datasets
- Survey on English Entity Linking on Wikidata Zenodo
- Sentiment corpora
- Relation Extraction Corpora
- Summarization corpora
- Text normalization corpora
- Sentihood data - Aspect based sentiment for neighborhood text
- Pretrained emoji embeddings and Twitter sentiment data
- Fake News Corpus
- EmoBank - Sentiment from perspective of reader's v/s writer's emotion
- Lot of argumentation mining corpora
- NER datasets
- More NER datasets
- Moral foundations twitter corpus
- Keyphrase extraction https://github.com/LIAAD/KeywordExtractor-Datasets#theses
- Multi-target-specific sentiment recognition on Twitter
- Multi (700 languages) lingual speech and text
- Large Scale text classification
- Multiple text classification data
- Data for lifelong ML for sentiment classification
- Aspect extraction data for Amazon reviews for 36 domains (1000 per domain)
- Perspectrum: dataset of claims and supporting and opposing sentences and paragraphs
- Multiple datasets curated by CogComp group and https://cogcomp.org/page/data/
- 204,135 articles from 18 American publications. Includes date, title, publication, article text, publication name, year, month, and URL (2013-2018)- https://components.one/datasets/all-the-news-articles-dataset/
- Legal Entity Detection
- New Yorker Cartoon Captions Funny v/s not-Funny dataset
- Semantic Text Similarity datasets
- Another list of NLP datasets
- Few-Shot Relation Classification Dataset (FewRel), consisting of 70,000 sentences on 100 relations derived from Wikipedia and annotated by crowdworkers
- Named dataset
- Aspect/Target based sentiment prediction
- Relation classification https://github.com/deepakn97/relationPrediction
- Multimodal knowledge graphs
- Distantly supervised relation extraction without false positives
- Part whole relationship data
- Supersense-Tagged Repository of English with a Unified Semantics for Lexical Expressions for Web reviews from English Treebank
- English WikiHow instructional guides semantically annotated with Universal Conceptual Cognitive Annotation (UCCA)
- English Keyphrase generation dataset and https://github.com/kenchan0226/keyphrase-generation-rl
- Reverse dictionary evaluation data
- Verbal shifter disambiguation
- TextGraphs-13 Shared Task on Multi-Hop Inference Explanation Regeneration
- Named Entities in European Newspapers (German, French, and Dutch)
- Document and subject indexing corpora
- GENETAG
- Open Biomedical corpora
- Multilingual NER - PANX
- WikiANN 282 languages NER - Gdrive
- Amazon QA review based question answering corpus
- Arabic NER
- NERWebpagesColumns from CogComp UIUC
- Wikification evaluation data
- Named Entity Coreference resolution across documents data
- Seminars and Job posting Data (Named Entities)
- List of various English NER datasets
- Document level relation extraction data using English Wikipedia
- Medinify dataset of medical comments tagged with drug mentions and ease of use
- Nano particles entity annotated data (in brat format)
- Named entities dataset in brat format from [Systematic Review Information Extraction (SRIE) 2018](https://tac.nist.gov/2018/SRIE/data.html)
- Ontonotes and FIGER gold data - Wayback link
- FIGER training data
- KNET fine grained named entity data
- Drug Drug interaction data for TAC 2019 task (NER and RE)
- Model Sense classification
- Entity Sentiment TAC KBP
- MPQA 3.0 Entity Sentiment data
- Cross-Lingual Sentiment (CLS) dataset comprises about 800.000 Amazon product reviews
- MLDoc: A Corpus for Multilingual Document Classification in Eight Languages
- Topical Chat: human-human dataset of open-domain conversations
- Human to Human actionable request dataset
- Human written weather summaries
- List of datasets for Natural Language Generation (NLG)
- EmotionX:emotions induced by dialogue utterances
- VU Amsterdam Metaphor Corpus Codalab
- Abusive language datasets
- Multilingual Surface Realization
- Argument mining microtexts (not social media; available in English and German)
- List of resources in argument mining
- Unibo corpora for argument mining
- Datasets on natural argumentation
- Dataset on natural arguments with emotions
- IBM debator dataset (multi task dataset)
- El Capitan corpus for sentiment and topic at both the review (document) and sentence level
- Argument Mining European Court of Human Rights
- Question Classification Labels for Science Questions
- Wikipedia Biographies data (for assessing text generation algorithms)
- TASKMASTER-1 DIALOG CORPUS: TOWARD A REALISTIC AND DIVERSE DATASET
- Parallelly Annotated Stylistic Language Dataset with Multiple Personas
- A Richly Annotated Corpus for Different Tasks in Automated Fact-Checking (UKP Snopes Corpus) Paper
- Argument Aspect Similarity (UKP ASPECT) Corpus Paper
- VU Amsterdam Metaphor Corpus
- VUAMC_crowd Dataset for evaluating humour (containing 28,210 pairwise comparisons of 4030 texts)
- Argument convincingness from crowdsourced data
- Discourse-level argumentation annotations
- CNN/Daily Mail summarization
- IBM Debator Evidence Detection data
- QA data for benchmarking entity linking systems Code
- Wikipedia-Wikidata sentence-level relation annotations
- Live Blog Corpus for Summarization
- Wikidata/FrameNet Alignment
- Event time extraction
- MultiFC: A Real-World Multi-Domain Dataset for Evidence-Based Fact Checking of Claims
- Conversation datasets in ConvoKit
- Dataset of personal narratives with Advice-Seeking Questions
- Ontonotes documents labeled using Freebase entity types (included as part of paper source files)
- Google book corpora catalogue
- Multiple AIDA datasets for Named Entity Linking
- Named Entity Disambiguation for Noisy Text
- Multiple datasets for wikipedia based disambiguation
- VideoStory: Text summaries of videos
- German Named Entity Linking dataset
- TAC KBP English Entity Linking Comprehensive Training and Evaluation Data 2010
- Temporal fact extraction datasets
- Updated and fixed Fake news challenge dataset
- Document Ranking datasets
- Japanese aspect based sentiment analysis corpora
- Political scaling dataset
- Cross lingual text similarity
- Topical segmentation of text
- Lexico-semantic relatedness (7 datasets)
- Text simiplification
- Ultra-fine entity typing
- Capturing Discriminative Attributes
- IBM Debating datasets with labels on Sentiment, Argumentation
- LitBank is an entity annotated dataset of 100 works of English-language fiction - https://github.com/dbamman/NAACL2019-literary-entities
- Nested Named Entity database
- WikiNER
- English Web Treebank
- PAWS: Paraphrase Adversaries from Word Scrambling PAWS-X multilingual
- Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus
- IRC Conversation Disentanglement
- Multiple NER datasets including corrected CoNLL 2003 data
- The Big Bad NLP Database
- LRE Map A database containing around 6,000 language resources and tools published at LREC conferences
- OpenIE datasets
- Entity linking dataset in German based on news broadcasts transcripts
- Text summarization datasets
- Multi-ling summarization datasets from shared task
- TED-Parallel-Corpus based on translations of TED talks
- Audio books corpora for speech to text
- Target Based Speech Act Classification in Political Campaign Text
- Datasets for Aspect Level Sentiment Analysis
- Curation Corpus for Abstractive Text Summarisation
- List of NER datasets curated as split of train, dev, and test
- Portmanteau corpus of 1600 items
- NLP with human traits corpus
- Categorized Entity Linking Corpus - Outputs of various EL systems on the datasets
- Bio2RDF: Linked data which can be used for Entity Linking - https://download.bio2rdf.org/#/current/
- Clickbait headlines dataset
- Named Entities based on gaze prediction
- Nerwip Corpus (Manually annotated 408 Wikipedia biographies for Named Entities)
- Serial Speakers: a Dataset of TV Series
- Wikipedia Abusive Conversations
- Extracting Semantic Network Data from Newspaper Articles
- ChroniclItaly 2.0. A corpus of Italian American newspapers annotated for entities, 1898-1920
- English/Turkish Wikipedia Named-Entity Recognition and Text Categorization Dataset
- Various German NLP datasets
- Hierarchical Patent Classification
- LSHTC: A Benchmark for Large-Scale Text Classification
- GSCL Shared Task: Automatic Linguistic Annotation of Computer-Mediated Communication / Social Media
- EU News Summary Dataset for 2006-2013
- NER on historical newspapers
- Salient information from news articles and tweets
- The Upworthy Research Archive dataset of headline A/B tests conducted by Upworthy from early 2013 into April 2015
- Emotion-Cause Pair Extraction: A New Task to Emotion Analysis in Texts More unified version
- Aspect Based Sentiment Analysis (Twitter, Product Reviews)
- Fewshot relation extraction corpus
- SemEval 2010 Task 8 - Multi-Way Classification of Semantic Relations Between Pairs of Nominals
- Wordnet Annotated Corpora
- (RC)^2 dataset for Review Conversational Reading Comprehension (RCRC) - data repo
- Review Reading Comprehension (RRC) - data repo
- Complementary Entity Recognition (CER) - QA large PCQA Reviews
- A Multilingual Multi-Target Dataset for Stance Detection
- Argument Mining manual annotation of judgments of the European Court on Human Rights
- SemCor and Masc documents annotated with NOAD (New Oxford American Dictionary) word senses
- Polish NLP datasets (NER, IE, WSD)
- PoKi: A Large Dataset of Poems by Children (divided by grade and gender inferred from name)
- Privacy policies of US companies judged by legal experts
- OpenSubtitles - subtitles in various languages
- OpenParallel corpus
- A Dataset of Petitions from Avaaz.org
- NLP Datasets on Indian Languages
- A Survey and Experiments on Annotated Corpora for Emotion Classification in Text
- Github Typo Corpus (A Large-Scale Multilingual Dataset of Misspellings and Grammatical Errors)
- Sentence Alignment in Text Simplification (Wiki and Newsela)
- Media Frame Corpus
- DAWT: Densely Annotated Wikipedia Texts across multiple languages
- News corpus of Police Killings which are sentence segmented, mention-level, distantly labeled data used in experiments.
- Biographical Structure in Text (Wikipedia event data, inferred gender and date of birth)
- NER for South and South East Asian Languages (Hindi, Bengali, Oriya, Telugu, Urdu)
- Summarization datasets
- Same Side Stance Classification
- Information Extraction from Chemical Patents
- Open Advancement of Question Answering Systems
- Medical term similarity datasets based on SNOMED-CT.
- Large-Scale Multi-Label Text Classification on EU Legislation - EURLEX57K Paper
- O*NET® 25.0 Database - Jobtitle, Job Description to Job Codes https://github.com/afshinrahimi/jobdescription2jobtitle
- WikiUMLS: Aligning UMLS to Wikipedia
- Microsoft Research Paraphrase Corpus
- Microsoft Research Paraphrase Phrase Tables
- SimplePPDB++ paraphrases with readability scores
- Word Complexity Lexicon
- Multilingual paraphrase corpus
- chakki's Aspect-Based Sentiment Analysis dataset
- Sentiment Analysis in Russian
- Similarity and relatedness dataset for Wikipedia entities (WikiSRS)
- Argumentation corpora
- A Benchmark for Structured Procedural Knowledge Extraction from Cooking Videos
- End to End Entity Linking datasets (WebQSPEL and GraphQEL)- http://dl.fbaipublicfiles.com/elq/EL4QA_data.tar.gz Blink paper
- Entity Typing Dataset and WikilinksNED Unseen-Mentions
- Zero Shot Entity Linking dataset using Wikia
- Norwegian NER
- Case Law Project (US Cases Open Text as well as Case Citation Networks)
- Multilingual LibriSpeech (MLS) - A large multilingual corpus derived from LibriVox audiobooks
- Event prediction from WikiHow
- IBM Debating data
- T-Rex : A Large Scale Alignment of Natural Language with Knowledge Base Triples
- Hindi Translated datasets
- Ontonotes CoNLL format data
- Japanese NLP datasets for multiple tasks
- Japanese parallel text data
- Ted transcripts
- WikiConv - Wikipedia Talk Page conversations in 5 languages
- NER on historic newspaper text
- Persuasion Techniques Annotation
- Temporal Privacy Policy dataset
- Groningen Meaning Bank (GMB) Publc Domain English NER + other tags
- Parallel Meaning Bank (PMB)
- Finnish NLP corpora
- Finnish English Emotion Annotation Movie dialogues from OPUS
- Event coref bank data
- Newsreader project data (Wikinews)
- NER transliteration dataset in multiple languages
- Various Paraphrase corpus
- Keyphrase Generation datasets
- Corpus of Russian documents (for LM training)
- Dakshina corpus of South Asian languages in Latin Script
- Economic Sentiment dataset
- Entity Linking datasets
- Multilingual Political Scaling
- Entity Aspect Linking
- TREC Complex Answer Retrieval
- Multilingual Text Similarity
- TREC News (Wapost related doc and wikification)
- CrossNER - Multi Domain NER data
- Geoparsing datasets
- ParCOR - Parallel EN-DE coreference corpus
- DWIE (Deutsche Welle corpus for Information Extraction) for document-level multi-task Information Extraction (IE) (NER, NED, Coref, RelEx)
- ParSent - paragraph level entity centric sentiment
- Targeted Sentiment Analysis
- African Languages NER
- Politcal Ads
- Propublica politcal ads
- Speech NER
- Russian Corpora 20+ datasets
- WNED Corpora as reported in Paper footnote 7 - https://www.dropbox.com/s/987hmjdoq0cql9z/WNED.tar.gz
- Deep - ED Entity Disambiguation Dataset - https://drive.google.com/uc?id=0Bx8d3azIm_ZcbHMtVmRVc1o5TWM&export=download
- Large scale dataset of multi-lingual aligned NER annotations from common crawl
- ViralTexts project - identify why old newspaper text went viral
- Multimodal Knowledge Graph Completion Code
- Entity Linking on Question Answering Data
- WebQuestions Semantic Parses Dataset Code
- WebQuestions Full dataset
- Question Answering over Linked Data
- Romanian language datasets
- Entity Linking datasets
- GDELT webngrams
- GDELT new similarity graph
- Web Data Commons - Schema.org Table Corpus
- QuoteBank A corpus of quotations from a decade of news
- Indian Court Judgements annotated with Gender
- Sentiment in Firm Risk Reports
- Terms of services tracked over time from various websites
- News Haiku Dataset
- India Police Events about Gujrat 2002 riots
- Wikipedia Entity Linking Editor Reccomendations - Code Datasets Paper
- WikiCheck: Wikipedia based Fact Checking Paper
- AIDA Entity Linking (Mapped to Wikidata)
- CMU Movie Summary Corpus
- MS Marco Keyphrase Extraction
- Keyphrase Extraction datasets
- JTubeSpeech: Corpus of speech collected from YouTube
- WikiNews - Annotated at multiple levels
- Linked Hypernym dataset attaches entity articles in English, German and Dutch Wikipedia linked to DBPedia
- Entity Linking in Queries Resources
- WNUT 2020 NER on wet lab protocol data
- WNUT 2020 Relation Extraction on wet lab protocol data
- Stack Overflow NER
- Kensho derived wikimedia data [Wikipedia + Wikidata]
- Abstract Meaning Representation Corpus
- NER on Material Science Papers
- Multilingual Dataset for Named Entity Recognition, Entity Linking and Stance Detection in Historical Newspapers Annotation guidelines
- Webis Query Interpretation Corpus 2022 (Webis-QInC-22)
- Webis Query Spelling Corpus 2017 (Webis-QSpell-17)
- Webis-WebSeg-20 - 42,450 crowdsourced segmentations for 8,490 web pages from the Webis-Web-Archive-17
- Webis Abstractive Snippet Corpus 2020 - More than 10 million
pairs / 3.5 million pairs were collected. - Webis TripAdvisor Corpus 2014 (Webis-Tripad-14) - includes user meta-data
- Webis Query Segmentation Corpus 2010 (Webis-QSeC-10)
- Webis Crowd Paraphrase Corpus 2011 (Webis-CPC-11)
- Webis Cross-Lingual Sentiment Dataset 2010 (Webis-CLS-10)
- Webis-Revenue-10 (Entity Linking for Revenue statements)
- Webis-Debate-16 (A collection of phrases classified as argumentative or non-argumentative)
- Same Side Stance Classification Resampled Datasets
- Benchmark for the evaluation of Named Entity Linking over ancient documents
- HOME-Alcar (Aligned and Annotated Cartularies) corpus (to train Handwritten Text Recognition (HTR) and Named Entity Recognition (NER))
- TAC KBP English Entity Linking - Comprehensive Training and Evaluation Data 2009-2013 - English - Chinese Spanish
- E3C Disorder Entity Recognizer - Multilingual
- Effective Crowdsourcing of Multiple Tasks for Comprehensive Information Extraction (NER, NEL, REL)
- Query Expansion benchmarks
- DBPedia-Entity: benchmark for entity query relevance
- Korean NER
- Few-NERD - Not only a Few-shot NER dataset
- Improving Named Entity Recognition in Noisy User-generated Text with Local Distance Neighbor Feature
- Finnish NER
- Romainian NER
- Bulgarian NER
- Hungarian NER
- MIM-GOLD-NER – Icelanding named entity recognition corpus
- Arabic Spanish, Arabic English, and English Spanish Parallel Corpus
- Dataset of code-switched datasets
- NE3L named entities Arabic corpus
- DISRPT/sharedtask2021 - Discourse Unit Segmentation, Connective Detection and Discourse Relation Classification
- NER For Entity Linking
- Tracking Knowledge Propagation Across Wikipedia Languages
- Hate Speech Data Catalogue
- WikiDataSets - Topic specific subgraphs in Wikidata
- Multilingual Reply Suggestion (MRS)
- African NLP Datasets
- TREC 2022 MS Marco Deep Learning Track
- UCPhrase: Unsupervised Context-aware Quality Phrase Tagging
- Mr. TyDi is a multi-lingual benchmark dataset for mono-lingual retrieval
- The Upworthy Research Archive
- ReDial (Recommendation Dialogues) is an annotated dataset of dialogues, where users recommend movies to each other
- Emotion Cause Pair Extraction
- Cline Center: Coups d'état are important events in the life of a country
- DyGIE++: Entity, Relation, and Event Extraction with Contextualized Span Representations
- A global and multi-lingual computational linguistic atlas
- Entity-Switched Datasets: An Approach to Auditing the In-Domain Robustness of Named Entity Recognition Models
- MATINF - Multitask Chinese NLP Dataset
- Wordbank: An open database of children's vocabulary development Book
- ArtEmis: Affective Language for Visual Art dataset
- In-group bias in the Indian judiciary Code
- SigTyp 2021: hared task on predicting language IDs from speech
- IR ranking datasets
- ShadowLink: entity disambiguation evaluation on overshadowed entities- https://zenodo.org/record/5196175
- Named Entity Recognition systems for 11 languages
- Wikipedia - Image/Caption Matching
- Silver Data Creation for Multilingual NER
- Shared Task on Named Entity Transliteration
- ClidSum: A Benchmark Dataset for Cross-Lingual Dialogue Summarization
- SportsSum2.0: Generating High-Quality Sports News from Live Text Commentary
- Knowledge Enhanced Sports Game Summarization
- Keyword extraction datasets for Croatian, Estonian, Latvian and Russian 1.0
- 24sata Croatian news article archive 1.0 - Comments data - https://www.clarin.si/repository/xmlui/handle/11356/1399
- TermFrame: Terms, definitions and semantic annotations for karstology (English, Slovenian, Croatian)
- HuffPost News Category Dataset
- OpenKP Keyphrase Extraction Dataset
- All Digitized Texas Appeals Court Cases Since 1900 - https://www.judyrecords.com/info
- Richpedia: A Comprehensive Multi-Modal Knowledge Graph
- Entity Matching Deepmatcher Datasets (multiple domains)
- Multilingual GeoQuery: A multilingual dataset for Geoquery. Each instance is a sentence annotated with its meaning representations
- Better Modeling of Incomplete Annotation for Named Entity Recognition
- Chinese Address Parsing
- Distantly Supervised NER
- Distantly Supervised NER
- PROCAT: Product Catalogue Dataset for Implicit Clustering, Permutation Learning and Structure Prediction
- Data, submissions, and intermediate files from TempEval-3 held in 2013
- WikiCoref: An English Coreference-annotated Corpus of Wikipedia Articles
- WiNER: Coarse Named Entities in Wikipedia and WiFiNE: Transforming Wikipedia into a Large-Scale Fine-Grained Entity Type Corpus
- Named Entity Recognition for Entity Linking: What Works and What's Next
- Timeline Summarization
- News Timeline Summarization
- Wikipedia Current Events Portal (WCEP) + Common Crawl Dataset - https://github.com/complementizer/wcep-mds-dataset
- NELA-Local: A Dataset of U.S. Local News Articles for the Study of County-level News Ecosystems
- NELA-GT-2021: A Large Multi-Labelled News Dataset for The Study of Misinformation in News Articles
- Music Dataset: Lyrics and Metadata from 1950 to 2019
- DL-HARD: Annotated Deep Learning Dataset For Passage and Document Retrieval
- LexGLUE: A Benchmark Dataset for Legal Language Understanding in English
- PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them
- IteraTeR: Understanding Iterative Revision from Human-Written Text
- Multilingual Name Entity Recognition (NER) Datasets with Gazetteer
- Low Context Name Entity Recognition (NER) Datasets with Gazetteer
- MultiCoNER Dataset
- The Massively Multilingual Image Dataset (MMID)
- ZEST: ZEroShot learning from Task descriptions
- EMBEDDIA Cross-Lingual Embeddings for Less-Represented Languages in European News Media
- Ekspress Meedia news archive (c.1.4M articles in Estonian and Russian)
- Latvian Delfi Article Archive (c.180k articles in Latvian and Russian)
- Styria 24sata news archive (c.650k articles in Croatian)
- STT news archive (c.2.8M articles in Finnish): urn.fi/urn:nbn:fi:lb-2019041501
- Ekspress Meedia Comment Archive (c.31M comments in Estonian and Russian)
- Latvian Delfi Comment Archive (c.12M comments in Latvian and Russian)
- Styria 24sata Comment Archive (c.20M comments in Croatian)
- Multi-lingual culture-independent word analogy dataset
- CoSimLex context-dependent similarity dataset
- Slovenian SimLex dataset
- Keyword extraction datasets for Croatian, Estonian, Latvian & Russian
- Information Retreval Datasets
- Appraisal enISEAR dataset: A reannotation of the enISEAR corpus with Cognitive Appraisal
- Universal Anaphora Data Repositories
- WANDS is a Wayfair product search relevance dataset
- 3rd Shared Task on SlavNER Recognition, Normalization, Classification and Cross-lingual linking of Named Entities in Slavic Languages
- KIND (Kessler Italian Named-entities Dataset)
- Monolingual and Cross-Lingual Acceptability Judgments with the The Italian Corpus of Linguistic Acceptability (CoLA) corpus
- Knowledge Base Construction from Pre-trained Language Models (LM-KBC)
- The Shared Task on Understanding Figurative Language - https://huggingface.co/datasets/ColumbiaNLP/FigLang2022SharedTask
- Euphemism Detection Shared Task
- Multimodal Emotion Datasets
- DEAL: Detecting Entities in the Astrophysics Literature
- CodRED: A Cross-Document Relation Extraction Dataset for Acquiring Knowledge in the Wild
- code-mixed machine translation (MixMT)
- Shared Task on Customized Chat Grounding Persona and Knowledge
- Causal News Corpus
- WikiDiverse: A Multimodal Entity Linking Dataset with Diversified Contextual Topics and Entity Types
- Time-Aware Language Models as Temporal Knowledge Bases
- GoodNewsEveryone: A corpus of news headlines annotated with emotions, semantic roles, and reader perception
- MELD: Multimodal EmotionLines Dataset: A dataset for Emotion Recognition in Multiparty Conversations
- Multi-modal Sarcasm Detection and Humor Classification in Code-mixed Conversations
- NewsClaims: A New Benchmark for Claim Detection from News with Background Knowledge
- Relationship and Entity Extraction Evaluation Dataset- https://github.com/dstl/re3d
- xNED - Basque dataset
- HIPE – Identifying Historical People, Places and other Entities Shared Task on Named Entity Recognition and Linking in Multilingual Historical Documents - Zenodo 2022 - 2020 - Zenodo 2020
- MIM-GOLD-EL: an Icelandic Entity Linking (EL) corpus
- MIM-GOLD-NER: Icelandic named entity (NE) corpus
- MIM-GOLD: gold standard for PoS-tagging and lemmatizing Icelandic texts
- MS MARCO entity annotations and disambiguations
- ConEL-2: Conversational Entity Linking Dataset - Data
- Wizard of Wikipedia: Knowledge-Powered Conversational Agents
- A Benchmarking Study of Embedding-based Entity Alignment for Knowledge Graphs
- DBP1M: LargeEA: Aligning Entities for Large-scale Knowledge Graphs
- Knowledge Graph Embedding, Entity Typing, and Entity Alignment Task Datasets
- Nested-NER
- 🐺 COYO-700M: Image-Text Pair Dataset
- KorNLI and KorSTS: New Benchmark Datasets for Korean Natural Language Understanding
- Jejueo Datasets for Machine Translation and Speech Synthesis
- MIntRec: A New Dataset for Multimodal Intent Recognition
- CH-SIMS v2.0: A Fine-grained Multi-label Chinese Multimodal Sentiment Analysis Dataset
- Multimodal Sentiment Analysis Datasets
- Contract Understanding Atticus Dataset (CUAD): A dataset of legal contracts with rich expert annotations - Zenodo
- FEVEROUS: Fact Extraction and VERification Over Unstructured and Structured information
- Shifts: A Dataset of Real Distributional Shift Across Multiple Large-Scale Tasks (MT EN-RU from Global Voices and Reddit)- https://github.com/Shifts-Project/shifts/tree/main/translation
- Multilingual Spoken Words
- People’s Speech Dataset: the world’s largest English speech recognition corpus
- Time-Sensitive Question Answering dataset
- Sentence Keywords dataset (5K)
- Bloom Library HuggingFace Hub
- Chinese NER dataset
- PreCo is a large-scale English dataset for coreference resolution from pre-schoolers
- The Project Dialogism Novel Corpus: A Dataset for Quotation Attribution in Literary Texts
- CrossRE: A Cross-Domain Dataset for Relation Extraction
- LongtoNotes: OntoNotes with Longer Coreference Chains
- Intent Classification Datasets
- MLQA (MultiLingual Question Answering)
- STAPLE: Simultaneous Translation And Paraphrase for Language Education dataverse
- Spaced Repetition Model for Language Learning
- reStructured Pretraining datasets
- PanLex Database Model Extension
- GeNER (an automated dataset Generation framework for NER)
- COMETA: the corpus of online medical entities
- Pivot-based Entity Linking (54 training langs, 9 test langs)
- Wikipedia Wikidata Relation Extraction (Context-Aware Representations for Knowledge Base Relation Extraction)
- Hinglish-TOP Dataset (Semantic Parsing)
- CoNLL++ (CoNLL 03 EN NER Fixed dataset)
- CANARD: a dataset for question-in-context rewriting
- WikiWiki - a large scale entity typing dataset
- Expedia Group ECML/PKDD 2022 challenge - Zero-shot Cross-brand Lodging Recommendation
- MACCROBAT: NER dataset for 41 biomedical entities using PubMed abstracts
- NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages
- MAVE: : A Product Dataset for Multi-source Attribute Value Extraction
- KOBE v2: Towards Knowledge-Based Personalized Product Description Generation in E-commerce
- 🪐 spaCy Project: Analyzing how mentions of ingredients change over time (Named Entity Recognition)
- TASTESet recipe NER model and dataset
- Analysis of Gender Bias in Hollywood Movies
- P-Stance: A Large Dataset for Stance Detection in Political Domain
- MultiNERD: A Multilingual, Multi-Genre and Fine-Grained Dataset for Named Entity Recognition (and Disambiguation) - Model - SpanMarker for Named Entity Recognition
- WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER
- WANDS - Wayfair ANnotation Dataset - Wayfair product search relevance dataset
- Humset toponym annotations - Location NER 469 docs
- DISTEMIST: DISease TExt Mining Shared Task - NERD dataset
- SympTEMIST: SYMPtoms, signs and findings TExt MIning Shared Task - NERD dataset
- SemEval-2 - 2010 - Evaluation Exercises on Semantic Evaluation - ACL SigLex event
- Retail Products Classification: Classification of retail products by image and description
- ColorNames - NAME ALL THE COLORS!
- en-worldwide-newswire 1100 news articles non western
- Stanford handparsed-treebank - Extra hand parsed data for training models
- ComFact: A Benchmark for Linking Contextual Commonsense Knowledge
- PeaCoK: Persona Commonsense Knowledge for Consistent and Engaging Narratives
- ConvAI3: Clarifying Questions for Open-Domain Dialogue Systems (ClariQ)
Shared tasks, competitions
- IN PURSUIT OF HAPPINESS
- NIST TAC Knowledge Base Population (KBP2019) Entity Discovery and Linking Track
- Multilingual Surface Realization Shared Task (SR'19)
- SHARED TASK ON FINE-GRAINED PROPAGANDA DETECTION @NLP4IF 2019
- PolEval
- GERMEVAL - German NLP shared tasks
- Hobbit: Holistic Benchmarking of Big Linked Data
- CLEF Shared tasks
- SemEval shared tasks
- TREC shared tasks
- Shared task on implicit and underspecified language
- Social Book Search Lab
- DSTC7: Dialog System Technology Challenges
Research Groups
- University of Copenhagen
- Statistical Social Language Analysis - Brendan O'Connor
- Stanford
- Harvard
- University College London
- Pázmány Péter Catholic University
- Noah Smith NLP Lab at University of Washington
- University of Washington
- John Hopkins University
- Natural Language Processing Lab at Tsinghua University
- OSU Speech and Language Technologies Laboratory