Natural Language Processing
- Courses and Books
- Scientific NLP
- Tools
- Pretrained embeddings and models
- Links
- Lexicons
- Data
- Shared tasks, competitions
- Research Groups
- Mailing lists
Table of contents generated with markdown-toc
Courses and Books
- Statistical NLP Book: https://github.com/uclmr/stat-nlp-book/blob/python/overview.ipynb
- Course on word embeddings, variants, and applications: http://people.ds.cam.ac.uk/iv250/esslli2018.html
- A Course in Machine Learning: http://ciml.info/
- Introduction to Natural Language Processing by Jacob Eisenstein - https://github.com/jacobeisenstein/gt-nlp-class
- Information extraction: https://web.stanford.edu/~jurafsky/slp3/17.pdf covers main ideas of information extraction
- 601.765 Machine Learning: Linguistic & Sequence Modeling - https://seq2class.github.io/
- Variational Inference for NLP - https://github.com/philschulz/VITutorial#general
- Probabilistic NLP course - https://uva-slpl.github.io/nlp2/syllabus.html
- Oxford Deep Learning NLP - https://github.com/oxford-cs-deepnlp-2017/lectures
- CMU NLP - http://phontron.com/class/nn4nlp2017/schedule.html
- David Bamman's Course on Applied NLP - https://github.com/dbamman/anlp19 - Course Page
- NLP Course | For You by Lena Voita - https://lena-voita.github.io/nlp_course.html
- Course material for Machine Translation from UIUC, JHU - http://mt-class.org/
- Jason Eisner - http://www.cs.jhu.edu/~jason/465/
- Introduction to Cultural Analytics & Python - https://melaniewalsh.github.io/Intro-Cultural-Analytics/
- Computational Sociolinguistics by David Jurgens - https://docs.google.com/document/d/1Ouyqz-emtOI-ohwTOdOZpcjcEFtPPilhJDso8sjgByU/edit
- Ethics in NLP - https://aclweb.org/aclwiki/Ethics_in_NLP
- Multilingual Natural Language Processing - http://demo.clab.cs.cmu.edu/11737fa20/
- CSE 704 - Applied Natural Language Processing and Computational Social Science - https://kennyjoseph.github.io/cse702
Scientific NLP
Tools
- Using spacy and flashtext for fast lookup of dictionary items in text. https://github.com/mpuig/spacy-lookup
- Open IE tool - https://github.com/dair-iitd/OpenIE-standalone
- Different evaluation techniques for NER: https://github.com/davidsbatista/NER-Evaluation
- Phrase extraction using POS patterns: https://github.com/slanglab/phrasemachine
- Whats wrong with my NLP in Python diff visualizer for NLP tasks - https://github.com/ppke-nlpg/whats-wrong-python
- Inform - interactive fiction based on natural language: http://inform7.com/
- Linguistic Knowledge and Transferability of Contextual Representations - https://github.com/nelson-liu/contextual-repr-analysis
- Generalized brown clusters - https://github.com/sean-chester/generalised-brown
- COGCOMP NLP: Online demo with multiple tasks - http://nlp.cogcomp.org/ Code: https://github.com/CogComp/cogcomp-nlp
- Python keyphrase extraction library - https://github.com/boudinfl/pke
- Neural Relation Extration - https://github.com/thunlp/OpenNRE
- List of tools for corpus analysis - https://corpus-analysis.com/
- WordMapper, evolution of words on Twitter - https://sites.google.com/site/wordmapperinfo/
- Document subject indexing and thesaura linking (also works for Finnish) - http://annif.org/
- BPEmb: Pre-trained Subword Embeddings in 275 Languages (LREC 2018) - https://doi.org/10.11588/data/V9CXPR
- Cornell Conversational Analysis Toolkit - https://zissou.infosci.cornell.edu/convokit/documentation/index.html
- Python based text summarization evaluation - https://github.com/chakki-works/sumeval
- Python based sequence tagging evaluation - https://github.com/chakki-works/seqeval
- NLP annotation tool - https://github.com/doccano/doccano
- Yake Single document keyword extraction - https://github.com/LIAAD/yake
- Open IE Tool - https://github.com/dair-iitd/OpenIE-standalone Other OpenIE tools
- Derive named entities from Wikipedia - https://github.com/kno10/WikipediaEntities
- Bengali NLP - https://github.com/sagorbrur/bnlp
- Py Readability metrics - https://github.com/cdimascio/py-readability-metrics
- iNLTK - https://inltk.readthedocs.io/en/latest/
- Fast spell correction and word segmentation - https://github.com/wolfgarbe/symspell
- NERD evaluation - https://nerd.readthedocs.io/en/latest/evaluation.html
- MedCAT can be used to extract information from Electronic Health Records (EHRs) and link it to biomedical ontologies like SNOMED-CT and UMLS - https://github.com/CogStack/MedCAT
- PyTerrier - A Python framework for performing information retrieval experiments, building on http://terrier.org/ - https://github.com/terrier-org/pyterrier
- JamSpell - Spelling correction (works in python) - https://github.com/bakwc/JamSpell
- Efficient Low Memory Aligner - https://github.com/robertostling/eflomal
- DeepTranslit: Towards better transliteration for Indic languages - https://github.com/notAI-tech/DeepTranslit
- Google Trends Anchor Bank - https://github.com/epfl-dlab/GoogleTrendsAnchorBank
- The Tatoeba Translation Challenge (v2021-08-07) - https://github.com/Helsinki-NLP/Tatoeba-Challenge
- Named Entity Recognition for Entity Linking: What Works and What's Next - https://github.com/Babelscape/ner4el
- Inception: semantic annotation tool - https://inception-project.github.io/
- Language rules in XML format from languagetool.org - https://dev.languagetool.org/languages
- REBL is an extension of the Radboud Entity Linker (REL) for Batch Entity Linking - https://github.com/informagi/REBL
- Odinson: A Fast Rule-based Information Extraction Framework - https://github.com/lum-ai/odinson/
- SDSL - Succinct Data Structure Library - https://github.com/simongog/sdsl-lite/
Pretrained embeddings and models
- ELMO for Many Languages - https://github.com/HIT-SCIR/ELMoForManyLangs
- Flair for many languages - https://github.com/flairNLP/flair-lms
- Pretrained transformer models for many languages and domains - https://huggingface.co/models
- FastText embeddings - https://fasttext.cc/
Links
- Bayesian NLP: https://homepages.inf.ed.ac.uk/sgwater/resources.html
- Rules for forming questions using text: https://ell.stackexchange.com/questions/1156/when-converting-a-statement-to-a-question-where-in-the-sentence-should-i-put-t/1198#1198
- Intel NLP library lots of great features using Tensorflow: http://nlp_architect.nervanasys.com/publications.html
- Information extraction from unstructured text: https://sites.google.com/site/keit2018kdd/
- Tutorial on deep latent variable models for NLP with pytorch code: https://nlp.seas.harvard.edu/latent-nlp-tutorial.html
- On the viability of crowdsourcing NLP annotations in healthcare: https://roamanalytics.com/2018/07/26/on-the-viability-of-crowdsourcing-nlp-annotations-in-healthcare/
- Contextual Word Representations: A Contextual Introduction - https://arxiv.org/abs/1902.06006
- Emotion vocabularies - https://www.w3.org/TR/emotion-voc/
- Emotion markup language - https://www.w3.org/TR/emotionml/
- NAACL 2019 tutorial on Tutorial on Modeling Language Change - https://github.com/jacobeisenstein/language-change-tutorial
- NAACL 2019 tutorial on Transfer Learning in Natural Language Processing - https://github.com/huggingface/naacl_transfer_learning_tutorial
- Papers on Textual Adversarial Attack and Defense: https://github.com/thunlp/TAADpapers
- Resources for NRE: Neural Relation Extraction: https://github.com/thunlp/NREPapers
- Online self-learning platform on vocabulary: http://vlearn.fed.cuhk.edu.hk/
- Awesome sentiment analysis - https://github.com/xiamx/awesome-sentiment-analysis
- List of papers on Textual Adversarial Attack and Defense - https://github.com/thunlp/TAADpapers
- Tutorial on NLP Approaches to Computational Argumentation - http://acl2016tutorial.arg.tech/index.php/tutorial-materials/
- Tutorial on Argumentation Mining - http://www.i3s.unice.fr/~villata/tutorialIJCAI2016.html
- ACL2019 Tutorial: Advances in Argument Mining - http://arg.tech/~chris/acl2019tut/index.html
- International Linguistic Olympiad - https://ioling.org/ - Covers a lot of low resource and multilingual problems
- Tutorial by Xavier Carreras on Structured Prediction - https://www.youtube.com/watch?v=f6Gqr2UCG9k&list=PLSWgH7JB2-1G2h8wj-ecK8FfpX72Z80_B&index=3&t=0s
- Visual Sentiment Ontology & Dataset - http://visual-sentiment-ontology.appspot.com/?
- NLP Roadmap - https://github.com/graykode/nlp-roadmap
- Pretrained LM model papers - https://github.com/thunlp/PLMpapers
- EmojiMap - http://kt.ijs.si/data/Emoji_sentiment_ranking/emojimap.html
- Literature review and datasets on Text-Summarization - https://github.com/neulab/Text-Summarization-Papers
- Causal Inference using text data - https://github.com/jaeyk/ITS-Text-Classification
- Tutorial on topic models and its extensions with animations of Chinese restaurant process and Poyla's Urn - http://topicmodels.west.uni-koblenz.de/
- Relation extraction resources - https://github.com/roomylee/awesome-relation-extraction
- Wordbank - An open database of children's vocabulary development - http://wordbank.stanford.edu/
- Comparison of web (text) annotation editors
- North American Computational Linguistics Open Competition - https://www.nacloweb.org/
- Frequency lists of words in all languages - https://en.wiktionary.org/wiki/Wiktionary:Frequency_lists
- The NLP Pandect - https://github.com/ivan-bilan/The-NLP-Pandect
- Unicode text segmentation - http://www.unicode.org/reports/tr29/
- Datasets for Text Analysis - https://docs.google.com/spreadsheets/d/1I7cvuCBQxosQK2evTcdL3qtglaEPc0WFEs6rZMx-xiE/edit#gid=0
- Multilingual stopwords - https://github.com/stopwords-iso/stopwords-iso
- Entity Related Papers - https://github.com/HelloRusk/entity-related-papers/blob/master/README.md
- Entity Linking Recent Trends - https://github.com/izuna385/Entity-Linking-Recent-Trends
- Fine-grained evaluation of Entity Linking - https://github.com/henryrosalesmendez/EL_exp
- Advanced String Matching and Burrows-Wheeler Indexing - https://langmead-lab.org/teaching-materials/ YouTube
- Succinct Data Structures for NLP-at-Scale - https://mpetri.github.io/coling16-tutorial/
- Space-Efficient Data Structures for Top-k Completion - https://www.microsoft.com/en-us/research/publication/space-efficient-data-structures-for-top-k-completion/
- Beginner's Crash Course to Elastic Stack for text analysis - https://github.com/LisaHJung/Beginners-Crash-Course-to-Elastic-Stack-Series-Table-of-Contents - YouTube
Lexicons
- HistEmo lexicon for valance, arousal and dominance scores of words across time - https://github.com/JULIELab/HistEmo
- Sentiment lexicon - https://github.com/juliasilge/tidytext/blob/master/data-raw/sentiments.csv
- Polarity shifter lexicons - https://github.com/uds-lsv/polarity-shifter-resources
- Bootstrapped polarity shifter lexicons - https://github.com/uds-lsv/bootstrapped-lexicon-of-english-polarity-shifters
- Lexicon of abusive words - https://github.com/uds-lsv/lexicon-of-abusive-words
- Finnish General Ontology YSO, several domain-specific vocabularies, and the KOKO ontology - https://github.com/NatLibFi/Finto-data
- Entitypedia is an Extended Named Entity Dictionary from Wikipedia - https://github.com/chakki-works/entitypedia
- Extended Open Multilingual Wordnet - http://compling.hss.ntu.edu.sg/omw/summx.html
- Open Multilingual Wordnet - http://compling.hss.ntu.edu.sg/omw/index.html
- Open License English Wordnet - https://github.com/globalwordnet/english-wordnet
- HurtLex a lexicon of offensive, aggressive, and hateful words in over 50 languages - https://github.com/valeriobasile/hurtlex
- The Evaluative Lexicon - emotionality, valence, and extremity - http://www.evaluativelexicon.com/
- CLICS: Database of Cross-Linguistic Colexifications - https://clics.clld.org/
- FreeDict: Multilingual and Single Language Dictionaries - https://freedict.org/downloads/
- Slob Dictionaries: Language dictionaries - https://github.com/itkach/slob/wiki/Dictionaries
- Hindi Language Stop Words List - https://data.mendeley.com/datasets/bsr3frvvjc/1
- XMLittré: XML version of the French Dictionary Littré (1873–1877) - https://bitbucket.org/Mytskine/xmlittre-data Raw Text
- GNU Collaborative International Dictionary of English - https://gcide.gnu.org.ua/
- GNU Dico: GNU Dictionary Server which loads dictionaries from various formats - https://puszcza.gnu.org.ua/software/dico/modules.html
- Moby Thesaurus - https://moby-thesaurus.org/ (See words.txt)
- OpenOffice MyThes thesaurus - https://www.openoffice.org/lingucomponent/thesaurus.html
- German POS dictionary - https://github.com/languagetool-org/german-pos-dict
- CMU Dict (word pronounciation) - https://github.com/cmusphinx/cmudict JS format
- Gemoji: emoji descriptions and tags - https://github.com/github/gemoji/blob/master/db/emoji.json
- Emoji-emotion: emotion assigned to emojis - https://github.com/words/emoji-emotion
- Map of profane words to how likely it is to be used as either profanity or clean text (multilingual) - https://github.com/words/cuss
- The LiLaH Emotion Lexicon of Croatian, Dutch and Slovene - https://www.clarin.si/repository/xmlui/handle/11356/1318
- Linguistically annotated multilingual comparable corpora of parliamentary debates ParlaMint.ana 2.1 - https://www.clarin.si/repository/xmlui/handle/11356/1431 - Github
- Multilingual comparable corpora of parliamentary debates ParlaMint 2.1 - https://www.clarin.si/repository/xmlui/handle/11356/1432
- labMT lexicon with happniess score and defnitions - https://gitlab.com/compstorylab/sentiment-analysis/-/blob/master/data/labmt.tsv
- Easy-to-use word translations for 3,564 language pairs across 62 unique languages - https://github.com/kakaobrain/word2word
Data
- Multi-LexSum: https://multilexsum.github.io/
- List of English Medical Terms - https://github.com/glutanimate/wordlist-medicalterms-en
- HUNER: improving biomedical NER with pretraining - https://corposaurus.github.io/corpora/
- WebIs Datasets - https://webis.de/data.html
- WEXEA: an exhaustive Wikipedia entity annotation system - https://github.com/mjstrobl/WEXEA
- Toloka Visual Question Answering Challenge - https://toloka.ai/challenges/wsdm2023/
- KPWR-NER: Polish Fine-grained NER - https://huggingface.co/datasets/clarin-pl/kpwr-ner
- Universal NER Datasets - https://www.universalner.org/datasets/
- Survey on English Entity Linking on Wikidata - https://github.com/semantic-systems/ELEnglishWD Zenodo
- Sentiment corpora: https://www.w3.org/community/sentiment/wiki/Datasets
- Relation Extraction Corpora: https://github.com/davidsbatista/Annotated-Semantic-Relationships-Datasets
- Summarization corpora: https://gist.github.com/napsternxg/2750479273e0621c5aa697bf89843428
- Text normalization corpora: https://github.com/rwsproat/text-normalization-data
- Sentihood data - Aspect based sentiment for neighborhood text - https://github.com/uclmr/jack/tree/master/data/sentihood
- Pretrained emoji embeddings and Twitter sentiment data - https://github.com/uclmr/emoji2vec
- Fake News Corpus: https://github.com/several27/FakeNewsCorpus
- EmoBank - Sentiment from perspective of reader's v/s writer's emotion - https://github.com/JULIELab/EmoBank
- Lot of argumentation mining corpora - https://www.informatik.tu-darmstadt.de/ukp/research_6/research_areas/argumentation_mining/index.en.jsp
- NER datasets - https://github.com/juand-r/entity-recognition-datasets
- More NER datasets - http://www.cs.technion.ac.il/~gabr/resources/data/ne_datasets.html
- Moral foundations twitter corpus - https://psyarxiv.com/w4f72
- Keyphrase extraction - https://github.com/titipata/keyphrase_extraction https://github.com/LIAAD/KeywordExtractor-Datasets#theses
- Multi-target-specific sentiment recognition on Twitter - https://github.com/bluemonk482/tdparse
- Multi (700 languages) lingual speech and text - https://github.com/festvox/datasets-CMU_Wilderness
- Large Scale text classification - http://lshtc.iit.demokritos.gr/
- Multiple text classification data - https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
- Data for lifelong ML for sentiment classification - https://www.aclweb.org/anthology/papers/P/P15/P15-2123/
- Aspect extraction data for Amazon reviews for 36 domains (1000 per domain) - https://www.aclweb.org/anthology/papers/P/P14/P14-1033/
- Perspectrum: dataset of claims and supporting and opposing sentences and paragraphs - https://github.com/CogComp/perspectrum
- Multiple datasets curated by CogComp group - https://cogcomp.org/page/corpora/ and https://cogcomp.org/page/data/
- 204,135 articles from 18 American publications. Includes date, title, publication, article text, publication name, year, month, and URL (2013-2018)- https://components.one/datasets/all-the-news-articles-dataset/
- Legal Entity Detection - https://github.com/hockeyjudson/Legal-Entity-Detection
- New Yorker Cartoon Captions Funny v/s not-Funny dataset - https://github.com/nextml/caption-contest-data/tree/master/contests
- Semantic Text Similarity datasets - https://github.com/brmson/dataset-sts
- Another list of NLP datasets - https://github.com/shashankg7/Deep-Learning-for-NLP-Resources
- Few-Shot Relation Classification Dataset (FewRel), consisting of 70,000 sentences on 100 relations derived from Wikipedia and annotated by crowdworkers - https://github.com/thunlp/FewRel
- Named dataset - https://github.com/philipperemy/name-dataset
- Aspect/Target based sentiment prediction - https://github.com/apmoore1/Bella
- Relation classification - https://github.com/zhangdongxu/kbp37 https://github.com/deepakn97/relationPrediction
- Multimodal knowledge graphs - https://github.com/nle-ml/mmkb
- Distantly supervised relation extraction without false positives - https://github.com/pvthuy/distantly-supervised-RE
- Part whole relationship data - https://github.com/pvthuy/part-whole-relations
- Supersense-Tagged Repository of English with a Unified Semantics for Lexical Expressions for Web reviews from English Treebank - https://github.com/nert-nlp/streusle
- English WikiHow instructional guides semantically annotated with Universal Conceptual Cognitive Annotation (UCCA) - https://github.com/nert-nlp/Whow_UCCA
- English Keyphrase generation dataset - https://github.com/memray/seq2seq-keyphrase and https://github.com/kenchan0226/keyphrase-generation-rl
- Reverse dictionary evaluation data - https://github.com/uds-lsv/Multi-Sense-Embeddings-Reverse-Dictionaries
- Verbal shifter disambiguation - https://github.com/uds-lsv/disambiguation-of-verbal-shifters
- TextGraphs-13 Shared Task on Multi-Hop Inference Explanation Regeneration - https://competitions.codalab.org/competitions/20150
- Named Entities in European Newspapers (German, French, and Dutch) - https://github.com/EuropeanaNewspapers/ner-corpora
- Document and subject indexing corpora - https://github.com/NatLibFi/Annif-corpora
- GENETAG - https://github.com/openbiocorpora/genetag
- Open Biomedical corpora - https://github.com/openbiocorpora
- Multilingual NER - https://github.com/afshinrahimi/mmner - PANX - https://www.amazon.com/clouddrive/share/d3KGCRCIYwhKJF0H3eWA26hjg2ZCRhjpEQtDL70FSBN?_encoding=UTF8&%2AVersion%2A=1&%2Aentries%2A=0&mgh=1
- WikiANN 282 languages NER - http://nlp.cs.rpi.edu/wikiann/ - Gdrive
- Amazon QA review based question answering corpus - https://github.com/amazonqa/amazonqa
- Arabic NER - http://www.cs.cmu.edu/~ark/ArabicNER/
- NERWebpagesColumns from CogComp UIUC - https://cogcomp.seas.upenn.edu/page/resource_view/28
- Wikification evaluation data - https://cogcomp.seas.upenn.edu/page/resource_view/4
- Named Entity Coreference resolution across documents data - https://cogcomp.seas.upenn.edu/page/resource_view/44
- Seminars and Job posting Data (Named Entities) - https://cogcomp.seas.upenn.edu/page/resource_view/31
- List of various English NER datasets - https://github.com/dice-group/FOX/tree/master/input
- Document level relation extraction data using English Wikipedia - https://github.com/thunlp/DocRED
- Medinify dataset of medical comments tagged with drug mentions and ease of use - https://github.com/NLPatVCU/medinify-datasets
- Nano particles entity annotated data (in brat format) - https://github.com/NLPatVCU/medaCy_dataset_end
- Named entities dataset in brat format from Systematic Review Information Extraction (SRIE) 2018 - https://github.com/NLPatVCU/medaCy_dataset_tac_2018
- Ontonotes and FIGER gold data - http://nlp.cs.rpi.edu/kbp/2019/data.html - Wayback link
- FIGER training data - https://github.com/xiaoling/figer
- KNET fine grained named entity data - https://github.com/thunlp/KNET
- Drug Drug interaction data for TAC 2019 task (NER and RE) - https://bionlp.nlm.nih.gov/tac2019druginteractions/
- Model Sense classification - https://github.com/amarasovic/modal-sense-classifcation
- Entity Sentiment TAC KBP - https://tac.nist.gov//2014/KBP/data.html
- MPQA 3.0 Entity Sentiment data - https://mpqa.cs.pitt.edu/corpora/mpqa_corpus/
- Cross-Lingual Sentiment (CLS) dataset comprises about 800.000 Amazon product reviews - https://webis.de/data/webis-cls-10.html
- MLDoc: A Corpus for Multilingual Document Classification in Eight Languages - https://github.com/facebookresearch/MLDoc
- Topical Chat: human-human dataset of open-domain conversations - https://github.com/alexa/alexa-prize-topical-chat-dataset/
- Human to Human actionable request dataset - https://github.com/NervanaSystems/nlp-architect/tree/master/datasets/H2H%20requests%20detection
- Human written weather summaries - https://ehudreiter.files.wordpress.com/2016/12/sumtime.zip
- List of datasets for Natural Language Generation (NLG) - https://aclweb.org/aclwiki/Data_sets_for_NLG
- EmotionX:emotions induced by dialogue utterances - https://sites.google.com/view/emotionx2019/
- VU Amsterdam Metaphor Corpus - http://ota.ahds.ac.uk/headers/2541.xml Codalab
- Abusive language datasets - https://sites.google.com/view/alw3/resources?authuser=0
- Multilingual Surface Realization - http://taln.upf.edu/pages/msr2019-ws/SRST.html#data
- Argument mining microtexts (not social media; available in English and German) - https://github.com/peldszus/arg-microtexts
- List of resources in argument mining - https://www.informatik.tu-darmstadt.de/ukp/research_6/data/index.en.jsp
- Unibo corpora for argument mining - http://argumentationmining.disi.unibo.it/resources.html
- Datasets on natural argumentation - http://www-sop.inria.fr/NoDE/NoDE-xml.html
- Dataset on natural arguments with emotions - https://project.inria.fr/seempad/datasets/
- IBM debator dataset (multi task dataset) - http://www.research.ibm.com/haifa/dept/vst/debating_data.shtml
- El Capitan corpus for sentiment and topic at both the review (document) and sentence level - https://github.com/eibeke/El-Capitan-Dataset
- Argument Mining European Court of Human Rights - https://github.com/PLN-FaMAF/ArgumentMiningECHR
- Question Classification Labels for Science Questions - http://cognitiveai.org/explanationbank/
- Wikipedia Biographies data (for assessing text generation algorithms) - https://github.com/DavidGrangier/wikipedia-biography-dataset
- TASKMASTER-1 DIALOG CORPUS: TOWARD A REALISTIC AND DIVERSE DATASET - https://mila.quebec/en/publication/taskmaster-1-dialog-corpus-toward-a-realistic-and-diverse-dataset/
- Parallelly Annotated Stylistic Language Dataset with Multiple Personas - https://github.com/dykang/PASTEL
- A Richly Annotated Corpus for Different Tasks in Automated Fact-Checking (UKP Snopes Corpus) - https://tudatalib.ulb.tu-darmstadt.de/handle/tudatalib/2081 Paper
- Argument Aspect Similarity (UKP ASPECT) Corpus - https://tudatalib.ulb.tu-darmstadt.de/handle/tudatalib/1998 Paper
- VU Amsterdam Metaphor Corpus - https://github.com/UKPLab/acl2019-GPPL-humour-metaphor/tree/master/data/VU%20Amsterdam%20Metaphor%20Corpus/2541
- VUAMC_crowd Dataset for evaluating humour (containing 28,210 pairwise comparisons of 4030 texts) - https://github.com/UKPLab/acl2019-GPPL-humour-metaphor/tree/master/data/vuamc_crowd
- Argument convincingness from crowdsourced data - https://github.com/UKPLab/tacl2018-preference-convincing
- Discourse-level argumentation annotations - https://github.com/UKPLab/naacl2019-argument-annotations
- CNN/Daily Mail summarization - https://github.com/UKPLab/emnlp2019-summary-reward/tree/master/data
- IBM Debator Evidence Detection data - https://github.com/UKPLab/fever2019-interactive-evidence-detection/tree/master/data
- QA data for benchmarking entity linking systems - https://public.ukp.informatik.tu-darmstadt.de/starsem18-entity-linking/EntityLinkingForQADatasets.zip Code
- Wikipedia-Wikidata sentence-level relation annotations - https://www.informatik.tu-darmstadt.de/ukp/research_6/data/lexical_resources/wikipedia_wikidata_relations/index.en.jsp
- Live Blog Corpus for Summarization - https://github.com/UKPLab/lrec2018-live-blog-corpus
- Wikidata/FrameNet Alignment - https://www.informatik.tu-darmstadt.de/ukp/research_6/data/lexical_resources/wikidata_framenet_alignments/index.en.jsp
- Event time extraction - https://github.com/UKPLab/tacl2017-event-time-extraction/tree/master/input
- MultiFC: A Real-World Multi-Domain Dataset for Evidence-Based Fact Checking of Claims - https://copenlu.github.io/publication/2019_emnlp_augenstein/
- Conversation datasets in ConvoKit - https://zissou.infosci.cornell.edu/convokit/documentation/datasets.html
- Dataset of personal narratives with Advice-Seeking Questions - https://github.com/CornellNLP/ASQ
- Ontonotes documents labeled using Freebase entity types (included as part of paper source files) - https://arxiv.org/format/1412.1820
- Google book corpora catalogue - http://storage.googleapis.com/books/
- Multiple AIDA datasets for Named Entity Linking - https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/aida/downloads/
- Named Entity Disambiguation for Noisy Text - https://github.com/yotam-happy/NEDforNoisyText
- Multiple datasets for wikipedia based disambiguation - http://rali.iro.umontreal.ca/rali/?q=en/wikipedia-ds-cont-emb
- VideoStory: Text summaries of videos - https://zenodo.org/record/2383739#.Xe5mO-hKh3g
- German Named Entity Linking dataset - https://github.com/linkedtv/videocorpus
- TAC KBP English Entity Linking Comprehensive Training and Evaluation Data 2010 - https://github.com/hellojet/TAC_KBP_English_EL_2010/tree/e274785013be529f0acee347b75d6076bda11126
- Temporal fact extraction datasets - https://github.com/dmsquare/tfwin
- Updated and fixed Fake news challenge dataset - https://github.com/UKPLab/coling2018_fake-news-challenge
- Document Ranking datasets - https://microsoft.github.io/TREC-2019-Deep-Learning/
- Japanese aspect based sentiment analysis corpora - https://github.com/chakki-works/chABSA-dataset
- Political scaling dataset - https://bitbucket.org/gg42554/cl-scaling/src/master/
- Cross lingual text similarity - https://bitbucket.org/gg42554/cl-sts/src/master/data/
- Topical segmentation of text - https://bitbucket.org/gg42554/graphseg/src/master/data/
- Lexico-semantic relatedness (7 datasets) - https://bitbucket.org/gg42554/dual-tensors/src/master/data/
- Text simiplification - https://bitbucket.org/gg42554/embesimp/src/master/Data/
- Ultra-fine entity typing - https://homes.cs.washington.edu/~eunsol/open_entity.html
- Capturing Discriminative Attributes - https://competitions.codalab.org/competitions/17326#participate
- IBM Debating datasets with labels on Sentiment, Argumentation - https://www.research.ibm.com/haifa/dept/vst/debating_data.shtml
- LitBank is an entity annotated dataset of 100 works of English-language fiction - https://github.com/dbamman/litbank - https://github.com/dbamman/NAACL2019-literary-entities
- Nested Named Entity database - https://github.com/nickyringland/nested_named_entities
- WikiNER - http://web.archive.org/web/20141104075416/http://schwa.org/projects/resources/wiki/Wikiner
- English Web Treebank - https://github.com/UniversalDependencies/UD_English-EWT
- PAWS: Paraphrase Adversaries from Word Scrambling - https://github.com/google-research-datasets/paws PAWS-X multilingual
- Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus - https://traces1.inria.fr/oscar/
- IRC Conversation Disentanglement - https://jkk.name/irc-disentanglement/
- Multiple NER datasets including corrected CoNLL 2003 data - https://github.com/pfliu-nlp/Named-Entity-Recognition-NER-Papers
- The Big Bad NLP Database - https://quantumstat.com/dataset/dataset.html
- LRE Map A database containing around 6,000 language resources and tools published at LREC conferences - http://lremap.elra.info/
- OpenIE datasets - https://github.com/gabrielStanovsky/supervised-oie/
- Entity linking dataset in German based on news broadcasts transcripts - https://github.com/linkedtv/videocorpus
- Text summarization datasets - http://pfliu.com/pl-summarization/summ_data.html
- Multi-ling summarization datasets from shared task - http://multiling.iit.demokritos.gr/pages/view/1666/multiling-2019-call-for-papers
- TED-Parallel-Corpus based on translations of TED talks - https://github.com/ajinkyakulkarni14/TED-Multilingual-Parallel-Corpus
- Audio books corpora for speech to text - https://github.com/ajinkyakulkarni14/Audio-Book-Corpus-for-European-Languages-
- Target Based Speech Act Classification in Political Campaign Text - https://github.com/shivashankarrs/Speech-Acts
- Datasets for Aspect Level Sentiment Analysis - https://github.com/12190143/Deep-Learning-for-Aspect-Level-Sentiment-Classification-Baselines
- Curation Corpus for Abstractive Text Summarisation - https://github.com/CurationCorp/curation-corpus
- List of NER datasets curated as split of train, dev, and test - https://github.com/pfliu-nlp/Named-Entity-Recognition-NER-Papers
- Portmanteau corpus of 1600 items - https://github.com/vgtomahawk/Charmanteau-CamReady
- NLP with human traits corpus - https://osf.io/bm9gn/
- Categorized Entity Linking Corpus - https://github.com/henryrosalesmendez/categorized_EMNLP_datasets - Outputs of various EL systems on the datasets
- Bio2RDF: Linked data which can be used for Entity Linking - https://download.bio2rdf.org/files/release/3/release.html - https://download.bio2rdf.org/#/current/
- Clickbait headlines dataset - https://github.com/bhargaviparanjape/clickbait/tree/master/dataset
- Named Entities based on gaze prediction - https://github.com/DS3Lab/ner-at-first-sight/tree/master/data
- Nerwip Corpus (Manually annotated 408 Wikipedia biographies for Named Entities) - https://doi.org/10.6084/m9.figshare.1289791.v17
- Serial Speakers: a Dataset of TV Series - https://figshare.com/articles/TV_Series_Corpus/3471839
- Wikipedia Abusive Conversations - https://figshare.com/articles/Wikipedia_Abusive_Conversations/11299118
- Extracting Semantic Network Data from Newspaper Articles - https://doi.org/10.7910/DVN/GBUL0K
- ChroniclItaly 2.0. A corpus of Italian American newspapers annotated for entities, 1898-1920 - https://doi.org/10.24416/uu01-4mecro
- English/Turkish Wikipedia Named-Entity Recognition and Text Categorization Dataset - https://data.mendeley.com/datasets/cdcztymf4k/1
- Various German NLP datasets - https://www.inf.uni-hamburg.de/en/inst/ab/lt/resources/data.html
- Hierarchical Patent Classification - https://dublin.zhaw.ch/~benf/HPC/
- LSHTC: A Benchmark for Large-Scale Text Classification - http://lshtc.iit.demokritos.gr/
- GSCL Shared Task: Automatic Linguistic Annotation of Computer-Mediated Communication / Social Media - https://sites.google.com/site/empirist2015/home
- EU News Summary Dataset for 2006-2013 - http://lexhub.org/data_sets/14
- NER on historical newspapers - https://github.com/impresso/CLEF-HIPE-2020/tree/master/data
- Salient information from news articles and tweets - http://data.crowdtruth.org/salience-news-tweets/
- The Upworthy Research Archive dataset of headline A/B tests conducted by Upworthy from early 2013 into April 2015 - https://upworthy.natematias.com/about-the-archive
- Emotion-Cause Pair Extraction: A New Task to Emotion Analysis in Texts - https://github.com/NUSTM/ECPE More unified version
- Aspect Based Sentiment Analysis (Twitter, Product Reviews) - https://github.com/NUSTM/ABSC/tree/master/data/absa
- Fewshot relation extraction corpus - https://thunlp.github.io/1/fewrel1.html
- SemEval 2010 Task 8 - Multi-Way Classification of Semantic Relations Between Pairs of Nominals - https://drive.google.com/file/d/0B_jQiLugGTAkMDQ5ZjZiMTUtMzQ1Yy00YWNmLWJlZDYtOWY1ZDMwY2U4YjFk/view?sort=name&layout=list&num=50
- Wordnet Annotated Corpora - http://globalwordnet.org/resources/wordnet-annotated-corpora/
- (RC)^2 dataset for Review Conversational Reading Comprehension (RCRC) - data repo
- Review Reading Comprehension (RRC) - data repo
- Complementary Entity Recognition (CER) - QA large PCQA Reviews
- A Multilingual Multi-Target Dataset for Stance Detection - https://github.com/ZurichNLP/xstance
- Argument Mining manual annotation of judgments of the European Court on Human Rights - https://github.com/PLN-FaMAF/ArgumentMiningECHR
- SemCor and Masc documents annotated with NOAD (New Oxford American Dictionary) word senses - https://github.com/google-research-datasets/word_sense_disambigation_corpora
- Polish NLP datasets (NER, IE, WSD) - http://poleval.pl/tasks/
- PoKi: A Large Dataset of Poems by Children (divided by grade and gender inferred from name) - https://github.com/whipson/PoKi-Poems-by-Kids
- Privacy policies of US companies judged by legal experts - https://github.com/ansgarw/privacy
- OpenSubtitles - subtitles in various languages - http://opus.nlpl.eu/OpenSubtitles-v2018.php
- OpenParallel corpus - http://opus.nlpl.eu/index.php
- A Dataset of Petitions from Avaaz.org - https://dataverse.mpi-sws.org/dataset.xhtml?persistentId=doi:10.5072/FK2/CUSKCS&version=1.0
- NLP Datasets on Indian Languages - https://github.com/indicnlpweb/indicnlp_catalog
- A Survey and Experiments on Annotated Corpora for Emotion Classification in Text - https://github.com/sarnthil/unify-emotion-datasets/tree/master/datasets
- Github Typo Corpus (A Large-Scale Multilingual Dataset of Misspellings and Grammatical Errors) - https://github.com/mhagiwara/github-typo-corpus
- Sentence Alignment in Text Simplification (Wiki and Newsela) - https://github.com/chaojiang06/wiki-auto
- Media Frame Corpus - https://github.com/dallascard/media_frames_corpus
- DAWT: Densely Annotated Wikipedia Texts across multiple languages - https://github.com/klout/opendata/blob/master/wiki_annotation/README.md
- News corpus of Police Killings which are sentence segmented, mention-level, distantly labeled data used in experiments. - http://slanglab.cs.umass.edu/PoliceKillingsExtraction/
- Biographical Structure in Text (Wikipedia event data, inferred gender and date of birth) - http://www.cs.cmu.edu/~ark/bio/
- NER for South and South East Asian Languages (Hindi, Bengali, Oriya, Telugu, Urdu) - http://ltrc.iiit.ac.in/ner-ssea-08/index.cgi?topic=5
- Summarization datasets - https://github.com/recitalAI/summarizing_summarization
- Same Side Stance Classification - https://events.webis.de/sameside-19/#task
- Information Extraction from Chemical Patents - https://chemu-patent-ie.github.io/
- Open Advancement of Question Answering Systems - https://oaqa.github.io/
- Medical term similarity datasets based on SNOMED-CT. - https://github.com/babylonhealth/medisim
- Large-Scale Multi-Label Text Classification on EU Legislation - EURLEX57K - http://nlp.cs.aueb.gr/software_and_datasets/EURLEX57K/index.html Paper
- O*NET® 25.0 Database - Jobtitle, Job Description to Job Codes - https://www.onetcenter.org/database.html#individual-files https://github.com/afshinrahimi/jobdescription2jobtitle
- WikiUMLS: Aligning UMLS to Wikipedia - https://github.com/afshinrahimi/wikiumls
- Microsoft Research Paraphrase Corpus - https://www.microsoft.com/en-us/download/details.aspx?id=52398
- Microsoft Research Paraphrase Phrase Tables - https://www.microsoft.com/en-us/download/details.aspx?id=52536
- SimplePPDB++ paraphrases with readability scores - https://github.com/mounicam/lexical_simplification/tree/master/SimplePPDBpp
- Word Complexity Lexicon - https://github.com/mounicam/lexical_simplification/tree/master/word_complexity_lexicon
- Multilingual paraphrase corpus - http://paraphrase.org/
- chakki's Aspect-Based Sentiment Analysis dataset - https://github.com/chakki-works/chABSA-dataset
- Sentiment Analysis in Russian - https://github.com/sismetanin/sentiment-analysis-in-russian
- Similarity and relatedness dataset for Wikipedia entities (WikiSRS) - https://slate.cse.ohio-state.edu/WikiSRS/
- Argumentation corpora - http://corpora.aifdb.org/
- A Benchmark for Structured Procedural Knowledge Extraction from Cooking Videos - https://github.com/frankxu2004/cooking-procedural-extraction
- End to End Entity Linking datasets (WebQSPEL and GraphQEL)- http://dl.fbaipublicfiles.com/elq/EL4QA_data.tar.gz Blink paper
- Entity Typing Dataset and WikilinksNED Unseen-Mentions - https://github.com/yasumasaonoe/ET4EL
- Zero Shot Entity Linking dataset using Wikia - https://github.com/lajanugen/zeshel
- Norwegian NER - https://github.com/ljos/navnkjenner
- Case Law Project (US Cases Open Text as well as Case Citation Networks) - https://case.law/download/
- Multilingual LibriSpeech (MLS) - A large multilingual corpus derived from LibriVox audiobooks - http://www.openslr.org/94/
- Event prediction from WikiHow - https://github.com/daiquocnguyen/EventPrediction
- IBM Debating data - https://www.research.ibm.com/haifa/dept/vst/debating_data.shtml
- T-Rex : A Large Scale Alignment of Natural Language with Knowledge Base Triples - https://github.com/hadyelsahar/RE-NLG-Dataset
- Hindi Translated datasets - http://www.cfilt.iitb.ac.in/iitb_parallel/
- Ontonotes CoNLL format data - https://github.com/yuchenlin/OntoNotes-5.0-NER-BIO
- Japanese NLP datasets for multiple tasks - http://nlp.ist.i.kyoto-u.ac.jp/EN/?NLPresources#d40a0717
- Japanese parallel text data - http://phontron.com/japanese-translation-data.php?lang=en
- Ted transcripts - https://wit3.fbk.eu/
- WikiConv - Wikipedia Talk Page conversations in 5 languages - https://github.com/conversationai/wikidetox/tree/master/wikiconv
- NER on historic newspaper text - https://github.com/dbmdz/historic-ner
- Persuasion Techniques Annotation - https://propaganda.math.unipd.it/index.html
- Temporal Privacy Policy dataset - https://github.com/citp/privacy-policy-historical
- Groningen Meaning Bank (GMB) Publc Domain English NER + other tags - https://gmb.let.rug.nl/
- Parallel Meaning Bank (PMB) - https://pmb.let.rug.nl/
- Finnish NLP corpora - https://www.kielipankki.fi/corpora/
- Finnish English Emotion Annotation Movie dialogues from OPUS - https://github.com/Helsinki-NLP/XED
- Event coref bank data - http://www.newsreader-project.eu/results/data/the-ecb-corpus/
- Newsreader project data (Wikinews) - http://www.newsreader-project.eu/results/data/
- NER transliteration dataset in multiple languages - http://workshop.colips.org/news2018/dataset.html
- Various Paraphrase corpus - https://github.com/wasiahmad/paraphrase_identification/tree/master/dataset
- Keyphrase Generation datasets - https://github.com/wasiahmad/NeuralKpGen/blob/master/data/README.md
- Corpus of Russian documents (for LM training) - https://ruscorpora.ru/new/en/index.html
- Dakshina corpus of South Asian languages in Latin Script - https://github.com/google-research-datasets/dakshina
- Economic Sentiment dataset - https://github.com/vanatteveldt/ecosent
- Entity Linking datasets - https://github.com/dice-group/gerbil/wiki/Licences-for-datasets
- Multilingual Political Scaling - https://bitbucket.org/gg42554/cl-scaling/src/master/
- Entity Aspect Linking - https://federiconanni.com/eal-d/
- TREC Complex Answer Retrieval - http://trec-car.cs.unh.edu/
- Multilingual Text Similarity - https://bitbucket.org/gg42554/cl-sts/src/master/data/
- TREC News (Wapost related doc and wikification) - http://trec-news.org/
- CrossNER - Multi Domain NER data - https://github.com/zliucr/CrossNER
- Geoparsing datasets - https://github.com/milangritta/Pragmatic-Guide-to-Geoparsing-Evaluation
- ParCOR - Parallel EN-DE coreference corpus - https://github.com/chardmeier/parcor-full
- DWIE (Deutsche Welle corpus for Information Extraction) for document-level multi-task Information Extraction (IE) (NER, NED, Coref, RelEx) - https://github.com/klimzaporojets/DWIE/
- ParSent - paragraph level entity centric sentiment - https://stonybrooknlp.github.io/PerSenT/
- Targeted Sentiment Analysis - https://www.research.ibm.com/haifa/dept/vst/debating_data.shtml#Targeted%20Sentiment%20Analysis
- African Languages NER - https://github.com/masakhane-io/masakhane-ner
- Politcal Ads - http://lig-membres.imag.fr/gogao/www21.html
- Propublica politcal ads - https://www.propublica.org/datastore/dataset/political-advertisements-from-facebook
- Speech NER - https://github.com/mdredze/speech_ner_entity_linking_data
- Russian Corpora 20+ datasets - https://github.com/natasha/corus#reference
- WNED Corpora as reported in Paper footnote 7 - https://www.dropbox.com/s/987hmjdoq0cql9z/WNED.tar.gz
- Deep - ED Entity Disambiguation Dataset - https://github.com/dalab/deep-ed - https://drive.google.com/uc?id=0Bx8d3azIm_ZcbHMtVmRVc1o5TWM&export=download
- Large scale dataset of multi-lingual aligned NER annotations from common crawl - http://data.statmt.org/xlent/
- ViralTexts project - identify why old newspaper text went viral - https://viraltexts.org/
- Multimodal Knowledge Graph Completion - https://public.ukp.informatik.tu-darmstadt.de/starsem18-multimodalKB/ Code
- Entity Linking on Question Answering Data - https://github.com/UKPLab/starsem2018-entity-linking
- WebQuestions Semantic Parses Dataset - https://www.microsoft.com/en-us/download/details.aspx?id=52763 Code
- WebQuestions Full dataset - https://nlp.stanford.edu/software/sempre/
- Question Answering over Linked Data - https://project-hobbit.eu/challenges/qald2017/
- Romanian language datasets - https://github.com/eemlcommunity/ro_benchmark_leaderboard
- Entity Linking datasets - https://github.com/kermitt2/entity-fishing/tree/master/data
- GDELT webngrams - https://blog.gdeltproject.org/announcing-the-web-news-ngram-datasets-web-ngram/
- GDELT new similarity graph - https://blog.gdeltproject.org/announcing-the-global-similarity-graph-television-news-sentence-embeddings-using-the-universal-sentence-encoder/
- Web Data Commons - Schema.org Table Corpus - http://webdatacommons.org/structureddata/schemaorgtables/
- QuoteBank A corpus of quotations from a decade of news - https://zenodo.org/record/4277311
- Indian Court Judgements annotated with Gender - https://www.devdatalab.org/judicial-data
- Sentiment in Firm Risk Reports - https://www.firmlevelrisk.com/home
- Terms of services tracked over time from various websites - https://github.com/ambanum/OpenTermsArchive-versions
- News Haiku Dataset - https://www.kaggle.com/newshaikus/dataset/version/3
- India Police Events about Gujrat 2002 riots - https://github.com/slanglab/IndiaPoliceEvents
- Wikipedia Entity Linking Editor Reccomendations - Code Datasets Paper
- WikiCheck: Wikipedia based Fact Checking - https://github.com/trokhymovych/WikiCheck Paper
- AIDA Entity Linking (Mapped to Wikidata) - https://github.com/aaaton/herd/tree/master/src/main/resources/evaluation
- CMU Movie Summary Corpus - https://github.com/ucinlp/GenderQuant/tree/master/data/raw/summaries
- MS Marco Keyphrase Extraction - https://microsoft.github.io/msmarco/
- Keyphrase Extraction datasets - https://github.com/memray/OpenNMT-kpg-release
- JTubeSpeech: Corpus of speech collected from YouTube - https://github.com/sarulab-speech/jtubespeech
- WikiNews - Annotated at multiple levels - http://www.newsreader-project.eu/results/data/wikinews/
- Linked Hypernym dataset attaches entity articles in English, German and Dutch Wikipedia linked to DBPedia - https://ner.vse.cz/datasets/linkedhypernyms/
- Entity Linking in Queries Resources - http://hasibi.com/resources/
- WNUT 2020 NER on wet lab protocol data - https://github.com/jeniyat/WNUT_2020_NER
- WNUT 2020 Relation Extraction on wet lab protocol data - https://github.com/jeniyat/WNUT_2020_RE
- Stack Overflow NER - https://github.com/jeniyat/StackOverflowNER
- Kensho derived wikimedia data [Wikipedia + Wikidata] - https://www.kaggle.com/kenshoresearch/kensho-derived-wikimedia-data
- Abstract Meaning Representation Corpus - https://github.com/IBM/transition-amr-parser/tree/master
- NER on Material Science Papers - https://github.com/olivettigroup/annotated-materials-syntheses
- Multilingual Dataset for Named Entity Recognition, Entity Linking and Stance Detection in Historical Newspapers - https://zenodo.org/record/4573313#.Yd_Ssb3MJ3g Annotation guidelines
- Webis Query Interpretation Corpus 2022 (Webis-QInC-22) - https://zenodo.org/record/5820673#.Yd_Tor3MJ3g
- Webis Query Spelling Corpus 2017 (Webis-QSpell-17) - https://zenodo.org/record/3570912
- Webis-WebSeg-20 - 42,450 crowdsourced segmentations for 8,490 web pages from the Webis-Web-Archive-17 - https://zenodo.org/record/3988124#.Yd_UNL3MJ3g
- Webis Abstractive Snippet Corpus 2020 - More than 10 million
pairs / 3.5 million pairs were collected. - https://zenodo.org/record/3653834 - Webis TripAdvisor Corpus 2014 (Webis-Tripad-14) - includes user meta-data - https://zenodo.org/record/3266882
- Webis Query Segmentation Corpus 2010 (Webis-QSeC-10) - https://zenodo.org/record/3256198
- Webis Crowd Paraphrase Corpus 2011 (Webis-CPC-11) - https://zenodo.org/record/3251771
- Webis Cross-Lingual Sentiment Dataset 2010 (Webis-CLS-10) - https://zenodo.org/record/3251672
- Webis-Revenue-10 (Entity Linking for Revenue statements) - https://zenodo.org/record/3257461
- Webis-Debate-16 (A collection of phrases classified as argumentative or non-argumentative) - https://zenodo.org/record/3251804
- Same Side Stance Classification Resampled Datasets - https://zenodo.org/record/5380989#.Yd_XGr3MJ3g
- Benchmark for the evaluation of Named Entity Linking over ancient documents - https://zenodo.org/record/3490333#.Yd_Xib3MJ3g
- HOME-Alcar (Aligned and Annotated Cartularies) corpus (to train Handwritten Text Recognition (HTR) and Named Entity Recognition (NER)) - https://zenodo.org/record/5600884#.Yd_Ylr3MJ3g
- TAC KBP English Entity Linking - Comprehensive Training and Evaluation Data 2009-2013 - https://abacus.library.ubc.ca/dataset.xhtml?persistentId=hdl:11272.1/AB2/LCPM63 - English - Chinese Spanish
- E3C Disorder Entity Recognizer - Multilingual - https://live.european-language-grid.eu/catalogue/tool-service/9283
- Effective Crowdsourcing of Multiple Tasks for Comprehensive Information Extraction (NER, NEL, REL) - https://figshare.com/articles/dataset/Effective_Crowdsourcing_of_Multiple_Tasks_for_Comprehensive_Information_Extraction/7935185
- Query Expansion benchmarks - https://github.com/hosseinfani/ReQue/
- DBPedia-Entity: benchmark for entity query relevance - https://iai-group.github.io/DBpedia-Entity/
- Korean NER - https://github.com/kmounlp/NER
- Few-NERD - Not only a Few-shot NER dataset - https://ningding97.github.io/fewnerd/
- Improving Named Entity Recognition in Noisy User-generated Text with Local Distance Neighbor Feature - https://data.mendeley.com/datasets/nsfdt6m47j/1
- Finnish NER - https://turkunlp.org/turku-ner-corpus
- Romainian NER - https://github.com/dumitrescustefan/ronec
- Bulgarian NER - https://github.com/usmiva/bg-ner
- Hungarian NER - https://github.com/nytud/NYTK-NerKor
- MIM-GOLD-NER – Icelanding named entity recognition corpus - https://repository.clarin.is/repository/xmlui/handle/20.500.12537/42
- Arabic Spanish, Arabic English, and English Spanish Parallel Corpus - http://www.lllf.uam.es/ING/Arabe_espa%C3%B1ol.html
- Dataset of code-switched datasets - https://ritual.uh.edu/lince/datasets
- NE3L named entities Arabic corpus - http://catalog.elra.info/en-us/repository/browse/ELRA-W0078/
- DISRPT/sharedtask2021 - Discourse Unit Segmentation, Connective Detection and Discourse Relation Classification - https://github.com/disrpt/sharedtask2021
- NER For Entity Linking - https://github.com/Babelscape/ner4el
- Tracking Knowledge Propagation Across Wikipedia Languages - https://zenodo.org/record/4433137
- Hate Speech Data Catalogue - https://github.com/leondz/hatespeechdata
- WikiDataSets - Topic specific subgraphs in Wikidata - https://graphs.telecom-paris.fr/Home_page.html
- Multilingual Reply Suggestion (MRS) - https://github.com/zhangmozhi/mrs
- African NLP Datasets - https://github.com/Andrews2017/africanlp-public-datasets
- TREC 2022 MS Marco Deep Learning Track - https://microsoft.github.io/msmarco/TREC-Deep-Learning.html
- UCPhrase: Unsupervised Context-aware Quality Phrase Tagging - https://github.com/xgeric/UCPhrase-exp
- Mr. TyDi is a multi-lingual benchmark dataset for mono-lingual retrieval - https://github.com/castorini/mr.tydi
- The Upworthy Research Archive - https://osf.io/jd64p/
- ReDial (Recommendation Dialogues) is an annotated dataset of dialogues, where users recommend movies to each other - https://redialdata.github.io/website/
- Emotion Cause Pair Extraction - https://github.com/NUSTM/ECPE
- Cline Center: Coups d'état are important events in the life of a country - https://databank.illinois.edu/datasets/IDB-5672473
- DyGIE++: Entity, Relation, and Event Extraction with Contextualized Span Representations - https://github.com/dwadden/dygiepp
- A global and multi-lingual computational linguistic atlas - http://www.earthlings.io/download_cglu.html
- Entity-Switched Datasets: An Approach to Auditing the In-Domain Robustness of Named Entity Recognition Models - https://github.com/oagarwal/entity-switched-ner
- MATINF - Multitask Chinese NLP Dataset - https://github.com/WHUIR/MATINF
- Wordbank: An open database of children's vocabulary development - http://wordbank.stanford.edu/ Book
- ArtEmis: Affective Language for Visual Art dataset - https://www.artemisdataset.org/
- In-group bias in the Indian judiciary - https://www.devdatalab.org/judicial-data Code
- SigTyp 2021: hared task on predicting language IDs from speech - https://sigtyp.github.io/st2021.html
- IR ranking datasets - https://ir-datasets.com/
- ShadowLink: entity disambiguation evaluation on overshadowed entities- https://zenodo.org/record/5196175
- Named Entity Recognition systems for 11 languages - https://multiconer.github.io/
- Wikipedia - Image/Caption Matching - https://www.kaggle.com/c/wikipedia-image-caption
- Silver Data Creation for Multilingual NER - https://github.com/Babelscape/wikineural
- Shared Task on Named Entity Transliteration - http://workshop.colips.org/news2018/dataset.html
- ClidSum: A Benchmark Dataset for Cross-Lingual Dialogue Summarization - https://github.com/krystalan/ClidSum
- SportsSum2.0: Generating High-Quality Sports News from Live Text Commentary - https://github.com/krystalan/SportsSum2.0
- Knowledge Enhanced Sports Game Summarization - https://github.com/krystalan/K-SportsSum
- Keyword extraction datasets for Croatian, Estonian, Latvian and Russian 1.0 - https://www.clarin.si/repository/xmlui/handle/11356/1403
- 24sata Croatian news article archive 1.0 - https://www.clarin.si/repository/xmlui/handle/11356/1410 - Comments data - https://www.clarin.si/repository/xmlui/handle/11356/1399
- TermFrame: Terms, definitions and semantic annotations for karstology (English, Slovenian, Croatian) - https://www.clarin.si/repository/xmlui/handle/11356/1463
- HuffPost News Category Dataset - https://www.kaggle.com/rmisra/news-category-dataset
- OpenKP Keyphrase Extraction Dataset - https://github.com/microsoft/OpenKP
- All Digitized Texas Appeals Court Cases Since 1900 - https://www.kaggle.com/judyrecords/all-digitized-texas-appeals-court-cases-since-1900 - https://www.judyrecords.com/info
- Richpedia: A Comprehensive Multi-Modal Knowledge Graph - https://github.com/wangmengsd/richpedia
- Entity Matching Deepmatcher Datasets (multiple domains) - https://github.com/anhaidgroup/deepmatcher/blob/master/Datasets.md
- Multilingual GeoQuery: A multilingual dataset for Geoquery. Each instance is a sentence annotated with its meaning representations - https://github.com/statnlp-research/statnlp-datasets
- Better Modeling of Incomplete Annotation for Named Entity Recognition - https://github.com/allanj/ner_incomplete_annotation
- Chinese Address Parsing - https://github.com/leodotnet/neural-chinese-address-parsing/
- Distantly Supervised NER - https://github.com/zwkatgithub/DSCAU
- Distantly Supervised NER - https://github.com/cliang1453/BOND/tree/master/dataset
- PROCAT: Product Catalogue Dataset for Implicit Clustering, Permutation Learning and Structure Prediction - https://figshare.com/articles/dataset/PROCAT_Product_Catalogue_Dataset_for_Implicit_Clustering_Permutation_Learning_and_Structure_Prediction/14709507
- Data, submissions, and intermediate files from TempEval-3 held in 2013 - https://figshare.com/articles/dataset/TempEval-3_data/9586532
- WikiCoref: An English Coreference-annotated Corpus of Wikipedia Articles - http://rali.iro.umontreal.ca/rali/?q=en/wikicoref
- WiNER: Coarse Named Entities in Wikipedia and WiFiNE: Transforming Wikipedia into a Large-Scale Fine-Grained Entity Type Corpus - http://rali.iro.umontreal.ca/rali/en/wikipedia-main-concept
- Named Entity Recognition for Entity Linking: What Works and What's Next - https://github.com/Babelscape/ner4el
- Timeline Summarization - http://www.l3s.de/~gtran/timeline/
- News Timeline Summarization - https://github.com/complementizer/news-tls
- Wikipedia Current Events Portal (WCEP) + Common Crawl Dataset - https://github.com/complementizer/wcep-mds-dataset
- NELA-Local: A Dataset of U.S. Local News Articles for the Study of County-level News Ecosystems - https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/GFE66K
- NELA-GT-2021: A Large Multi-Labelled News Dataset for The Study of Misinformation in News Articles - https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/RBKVBM
- Music Dataset: Lyrics and Metadata from 1950 to 2019 - https://data.mendeley.com/datasets/3t9vbwxgr5/2
- DL-HARD: Annotated Deep Learning Dataset For Passage and Document Retrieval - https://github.com/grill-lab/DL-HARD
- LexGLUE: A Benchmark Dataset for Legal Language Understanding in English - https://github.com/coastalcph/lex-glue
- PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them - https://github.com/facebookresearch/PAQ
- IteraTeR: Understanding Iterative Revision from Human-Written Text - https://github.com/vipulraheja/IteraTeR
- Multilingual Name Entity Recognition (NER) Datasets with Gazetteer - https://registry.opendata.aws/code-mixed-ner/
- Low Context Name Entity Recognition (NER) Datasets with Gazetteer - https://registry.opendata.aws/lowcontext-ner-gaz/
- MultiCoNER Dataset - https://registry.opendata.aws/multiconer/
- The Massively Multilingual Image Dataset (MMID) - http://multilingual-images.org/
- ZEST: ZEroShot learning from Task descriptions - https://registry.opendata.aws/allenai-zest/
- EMBEDDIA Cross-Lingual Embeddings for Less-Represented Languages in European News Media - http://embeddia.eu/outputs/
- Ekspress Meedia news archive (c.1.4M articles in Estonian and Russian): hdl.handle.net/11356/1408
- Latvian Delfi Article Archive (c.180k articles in Latvian and Russian): hdl.handle.net/11356/1409
- Styria 24sata news archive (c.650k articles in Croatian): hdl.handle.net/11356/1410
- STT news archive (c.2.8M articles in Finnish): urn.fi/urn:nbn:fi:lb-2019041501
- Ekspress Meedia Comment Archive (c.31M comments in Estonian and Russian): hdl.handle.net/11356/1401
- Latvian Delfi Comment Archive (c.12M comments in Latvian and Russian): hdl.handle.net/11356/1407
- Styria 24sata Comment Archive (c.20M comments in Croatian): hdl.handle.net/11356/1399
- Multi-lingual culture-independent word analogy dataset: hdl.handle.net/11356/1261
- CoSimLex context-dependent similarity dataset: hdl.handle.net/11356/1308
- Slovenian SimLex dataset: hdl.handle.net/11356/1309
- Keyword extraction datasets for Croatian, Estonian, Latvian & Russian: http://hdl.handle.net/11356/1403.
- Information Retreval Datasets - https://github.com/irgroup/datasets
- Appraisal enISEAR dataset: A reannotation of the enISEAR corpus with Cognitive Appraisal - https://github.com/bluzukk/appraisal-emotion-classification
- Universal Anaphora Data Repositories - https://github.com/UniversalAnaphora/UniversalAnaphora/blob/main/data/data.md
- WANDS is a Wayfair product search relevance dataset - https://github.com/wayfair/WANDS
- 3rd Shared Task on SlavNER Recognition, Normalization, Classification and Cross-lingual linking of Named Entities in Slavic Languages - http://bsnlp.cs.helsinki.fi/shared-task.html
- KIND (Kessler Italian Named-entities Dataset) - https://github.com/dhfbk/KIND
- Monolingual and Cross-Lingual Acceptability Judgments with the The Italian Corpus of Linguistic Acceptability (CoLA) corpus - https://github.com/dhfbk/ItaCoLA-dataset
- Knowledge Base Construction from Pre-trained Language Models (LM-KBC) - https://lm-kbc.github.io/
- The Shared Task on Understanding Figurative Language - https://codalab.lisn.upsaclay.fr/competitions/5908 - https://huggingface.co/datasets/ColumbiaNLP/FigLang2022SharedTask
- Euphemism Detection Shared Task - https://codalab.lisn.upsaclay.fr/competitions/5726
- Multimodal Emotion Datasets - https://github.com/A2Zadeh/CMU-MultimodalSDK
- DEAL: Detecting Entities in the Astrophysics Literature - https://ui.adsabs.harvard.edu/WIESP/2022/SharedTasks
- CodRED: A Cross-Document Relation Extraction Dataset for Acquiring Knowledge in the Wild - https://codalab.lisn.upsaclay.fr/competitions/3770
- code-mixed machine translation (MixMT) - https://codalab.lisn.upsaclay.fr/competitions/2861
- Shared Task on Customized Chat Grounding Persona and Knowledge - https://codalab.lisn.upsaclay.fr/competitions/3754
- Causal News Corpus - https://codalab.lisn.upsaclay.fr/competitions/2299
- WikiDiverse: A Multimodal Entity Linking Dataset with Diversified Contextual Topics and Entity Types - https://github.com/wangxw5/wikiDiverse
- Time-Aware Language Models as Temporal Knowledge Bases - https://github.com/google-research/language/tree/master/language/templama
- GoodNewsEveryone: A corpus of news headlines annotated with emotions, semantic roles, and reader perception - https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/goodnewseveryone/
- MELD: Multimodal EmotionLines Dataset: A dataset for Emotion Recognition in Multiparty Conversations - https://affective-meld.github.io/
- Multi-modal Sarcasm Detection and Humor Classification in Code-mixed Conversations - https://github.com/LCS2-IIITD/MSH-COMICS
- NewsClaims: A New Benchmark for Claim Detection from News with Background Knowledge - https://github.com/blender-nlp/NewsClaims
- Relationship and Entity Extraction Evaluation Dataset- https://github.com/dstl/re3d
- xNED - Basque dataset - https://github.com/anderbarrena/xNED
- HIPE – Identifying Historical People, Places and other Entities Shared Task on Named Entity Recognition and Linking in Multilingual Historical Documents - https://hipe-eval.github.io/HIPE-2022/ - Zenodo 2022 - 2020 - Zenodo 2020
- MIM-GOLD-EL: an Icelandic Entity Linking (EL) corpus - https://repository.clarin.is/repository/xmlui/handle/20.500.12537/168
- MIM-GOLD-NER: Icelandic named entity (NE) corpus - https://repository.clarin.is/repository/xmlui/handle/20.500.12537/230
- MIM-GOLD: gold standard for PoS-tagging and lemmatizing Icelandic texts - https://repository.clarin.is/repository/xmlui/handle/20.500.12537/113
- MS MARCO entity annotations and disambiguations - https://github.com/informagi/mmead
- ConEL-2: Conversational Entity Linking Dataset - https://github.com/informagi/conversational-entity-linking-2022 - Data
- Wizard of Wikipedia: Knowledge-Powered Conversational Agents - https://parl.ai/projects/wizard_of_wikipedia/
- A Benchmarking Study of Embedding-based Entity Alignment for Knowledge Graphs - https://github.com/nju-websoft/OpenEA#dataset-overview
- DBP1M: LargeEA: Aligning Entities for Large-scale Knowledge Graphs - https://github.com/ZJU-DAILY/LargeEA
- Knowledge Graph Embedding, Entity Typing, and Entity Alignment Task Datasets - https://github.com/nju-websoft/muKG#datasets-hub-
- Nested-NER - https://github.com/bplank/nested-ner
- 🐺 COYO-700M: Image-Text Pair Dataset - https://github.com/kakaobrain/coyo-dataset
- KorNLI and KorSTS: New Benchmark Datasets for Korean Natural Language Understanding - https://github.com/kakaobrain/KorNLUDatasets
- Jejueo Datasets for Machine Translation and Speech Synthesis - https://github.com/kakaobrain/jejueo
- MIntRec: A New Dataset for Multimodal Intent Recognition - https://github.com/thuiar/MIntRec
- CH-SIMS v2.0: A Fine-grained Multi-label Chinese Multimodal Sentiment Analysis Dataset - https://thuiar.github.io/sims.github.io/chsims
- Multimodal Sentiment Analysis Datasets - https://github.com/thuiar/AWESOME-MSA#related-datasets
- Contract Understanding Atticus Dataset (CUAD): A dataset of legal contracts with rich expert annotations - https://www.atticusprojectai.org/cuad - Zenodo
- FEVEROUS: Fact Extraction and VERification Over Unstructured and Structured information - https://fever.ai/dataset/feverous.html
- Shifts: A Dataset of Real Distributional Shift Across Multiple Large-Scale Tasks (MT EN-RU from Global Voices and Reddit)- https://github.com/Shifts-Project/shifts/tree/main/translation
- Multilingual Spoken Words - https://mlcommons.org/en/multilingual-spoken-words/
- People’s Speech Dataset: the world’s largest English speech recognition corpus - https://mlcommons.org/en/peoples-speech/
- Time-Sensitive Question Answering dataset - https://github.com/wenhuchen/Time-Sensitive-QA
- Sentence Keywords dataset (5K) - https://github.com/naister/Keyword-OpenSource-Data
- Bloom Library - https://bloomlibrary.org/ HuggingFace Hub
- Chinese NER dataset - https://github.com/PKUnlp-icler/SCL-RAI
- PreCo is a large-scale English dataset for coreference resolution from pre-schoolers - https://preschool-lab.github.io/PreCo/
- The Project Dialogism Novel Corpus: A Dataset for Quotation Attribution in Literary Texts - https://github.com/Priya22/pdnc-lrec2022
- CrossRE: A Cross-Domain Dataset for Relation Extraction - https://github.com/mainlp/CrossRE
- LongtoNotes: OntoNotes with Longer Coreference Chains - https://github.com/kumar-shridhar/LongtoNotes
- Intent Classification Datasets - https://github.com/kumar-shridhar/Know-Your-Intent
- MLQA (MultiLingual Question Answering) - https://github.com/facebookresearch/MLQA
- STAPLE: Simultaneous Translation And Paraphrase for Language Education - https://sharedtask.duolingo.com/ dataverse
- Spaced Repetition Model for Language Learning - https://github.com/duolingo/halflife-regression
- reStructured Pretraining datasets - https://github.com/ExpressAI/reStructured-Pretraining#download-restructured-signals
- PanLex Database - https://panlex.org/snapshot/ Model Extension
- GeNER (an automated dataset Generation framework for NER) - https://github.com/dmis-lab/GeNER
- COMETA: the corpus of online medical entities - https://github.com/cambridgeltl/cometa
- Pivot-based Entity Linking (54 training langs, 9 test langs) - https://github.com/shrutirij/pivot-based-entity-linking
- Wikipedia Wikidata Relation Extraction (Context-Aware Representations for Knowledge Base Relation Extraction) - https://tudatalib.ulb.tu-darmstadt.de/handle/tudatalib/2776
- Hinglish-TOP Dataset (Semantic Parsing) - https://github.com/google-research-datasets/Hinglish-TOP-Dataset
- CoNLL++ (CoNLL 03 EN NER Fixed dataset) - https://github.com/ZihanWangKi/CrossWeigh
Shared tasks, competitions
- IN PURSUIT OF HAPPINESS - https://sites.google.com/view/affcon2019/cl-aff-shared-task
- NIST TAC Knowledge Base Population (KBP2019) Entity Discovery and Linking Track - http://nlp.cs.rpi.edu/kbp/2019/index.html
- Multilingual Surface Realization Shared Task (SR'19) - http://taln.upf.edu/pages/msr2019-ws/SRST.html#data
- SHARED TASK ON FINE-GRAINED PROPAGANDA DETECTION @NLP4IF 2019 - https://propaganda.qcri.org/nlp4if-shared-task/
- PolEval - http://poleval.pl/tasks/
- GERMEVAL - German NLP shared tasks - https://projects.fzai.h-da.de/iggsa/germeval/
- Hobbit: Holistic Benchmarking of Big Linked Data - https://project-hobbit.eu/
- CLEF Shared tasks - http://clef2021.clef-initiative.eu/index.php?page=Pages/labs.html
- SemEval shared tasks - https://semeval.github.io/
- TREC shared tasks - https://trec.nist.gov/pubs/call2021.html
- Shared task on implicit and underspecified language - https://unimplicit.github.io/
Research Groups
- https://supernlp.github.io/ - University of Copenhagen
- http://slanglab.cs.umass.edu/ - Statistical Social Language Analysis - Brendan O'Connor
- https://nlp.stanford.edu/ - Stanford
- http://nlp.seas.harvard.edu/ - Harvard
- https://mr.cs.ucl.ac.uk/ - University College London
- http://nlpg.itk.ppke.hu/projects - Pázmány Péter Catholic University
- http://www.ark.cs.washington.edu/ - Noah Smith NLP Lab at University of Washington
- https://www.cs.washington.edu/research/nlp - University of Washington
- http://wiki.clsp.jhu.edu/view/NLP_Reading_Group - John Hopkins University
- https://github.com/thunlp - Natural Language Processing Lab at Tsinghua University
- https://github.com/OSU-slatelab - OSU Speech and Language Technologies Laboratory
Mailing lists
- https://mailman.uib.no/public/corpora/ - Corpora mailing list