Information extraction from digital social trace data with applications to social media and scholarly communication data
- Author: Shubhanshu Mishra
- Contact: @TheShubhanshu
- Defended: 24 June 2019
- URL:
Information extraction (IE) aims at extracting structured data from unstructured or semi-structured data. The thesis starts by identifying social media data and scholarly communication data as a special case of digital social trace data (DSTD). This identification allows us to utilize the graph structure of the data (e.g., user connected to a tweet, author connected to a paper, author connected to authors, etc.) for developing new information extraction tasks. The thesis focuses on information extraction from DSTD, first, using only the text data from tweets and scholarly paper abstracts, and then using the full graph structure of Twitter and scholarly communications datasets. This thesis makes three major contributions.
First, new IE tasks based on DSTD representation of the data are introduced. For scholarly communication data, methods are developed to identify article and author level novelty and expertise. Furthermore, interfaces for examining the extracted information are introduced. A social communication temporal graph (SCTG) is introduced for comparing different communication data like tweets tagged with sentiment, tweets about a search query, and Facebook group posts. For social media, new text classification categories are introduced, with the aim of identifying enthusiastic and supportive users, via their tweets. Additionally, the correlation between sentiment classes and Twitter meta-data in public corpora is analyzed, leading to the development of a better model for sentiment classification.
Second, methods are introduced for extracting information from social media and scholarly data. For scholarly data, a semi-automatic method is introduced for the construction of a large-scale taxonomy of computer science concepts. The method relies on the Wikipedia category tree. The constructed taxonomy is used for identifying key computer science phrases in scholarly papers, and tracking their evolution over time. Similarly, for social media data, machine learning models based on human-in-the-loop learning, semi-supervised learning, and multi-task learning are introduced for identifying sentiment, named entities, part of speech tags, phrase chunks, and super-sense tags. The machine learning models are developed with a focus on leveraging all available data. The multi-task models presented here result in competitive performance against other methods, for most of the tasks, while reducing inference time computational costs.
Finally, this thesis has resulted in the creation of multiple open source tools and public data sets, which can be utilized by the research community. The thesis aims to act as a bridge between research questions and techniques used in DSTD from different domains. The methods and tools presented here can help advance work in the areas of social media and scholarly data analysis. All resources related to this thesis are available at
- Associate Professor Jana Diesner, iSchool - Chair & Director of Research
- Associate Professor Vetle I. Torvik, iSchool
- Professor Karrie G. Karahalios, Computer Science
- Professor Robert J. Brunner, Accountancy
List of resources
- SAIL - For active human in the loop learning
- GIMLI - For novelty of PubMed articles
- Legolas - For expertise of PubMed articles and authors
- SCTG - Social Communication Temporal Graph visualization tool
- SocialMediaIE - Project with multi task learning and active human in the loop learning models and tools
Thesis Citation
Mishra, Shubhanshu. 2020. “Information Extraction from Digital Social Trace Data with Applications to Social Media and Scholarly Communication Data.” University of Illinois at Urbana-Champaign.
Shubhanshu Mishra. 2021. Information extraction from digital social trace data with applications to social media and scholarly communication data. SIGWEB Newsl., Spring, Article 3 (Spring 2021), 4 pages.
Shubhanshu Mishra. 2021. Information extraction from digital social trace data with applications to social media and scholarly communication data. SIGIR Forum 54, 1, Article 17 (June 2020), 2 pages.
author = {Mishra, Shubhanshu},
school = {University of Illinois at Urbana-Champaign},
title = ,
url = {{\_}thesis/},
note = {\url{}},
year = {2020}
title = {Information extraction from digital social trace data with applications to social media and scholarly communication data},
copyright = {All rights reserved},
issn = {1931-1745, 1931-1435},
url = {},
doi = {10.1145/3460304.3460307},
language = {en},
number = {Spring},
urldate = {2022-01-28},
journal = {ACM SIGWEB Newsletter},
author = {Mishra, Shubhanshu},
month = mar,
year = {2021},
pages = {1--4},
author = {Mishra, Shubhanshu},
title = {Information Extraction from Digital Social Trace Data with Applications to Social Media and Scholarly Communication Data},
year = {2021},
issue_date = {June 2020},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {54},
number = {1},
issn = {0163-5840},
url = {},
doi = {10.1145/3451964.3451981},
journal = {SIGIR Forum},
month = {feb},
articleno = {17},
numpages = {2}
note = {\url{}},
Related papers
- Mishra, Shubhanshu, and Jana Diesner. 2019. “Capturing Signals of Enthusiasm and Support Towards Social Issues from Twitter.” In Proceedings of the 5th International Workshop on Social Media World Sensors - SIdEWayS’19, 19–24. New York, New York, USA: ACM Press.
- Mishra, Shubhanshu. 2019. “Multi-Dataset-Multi-Task Neural Sequence Tagging for Information Extraction from Tweets.” In Proceedings of the 30th ACM Conference on Hypertext and Social Media - HT ’19, 283–84. New York, New York, USA: ACM Press.
- Mishra, Shubhanshu, Jana Diesner, Jason Byrne, and Elizabeth Surbeck. 2015. “Sentiment Analysis with Incremental Human-in-the-Loop Learning and Lexical Resource Customization.” In Proceedings of the 26th ACM Conference on Hypertext & Social Media - HT ’15, 323–25. New York, New York, USA: ACM Press.
- Mishra, Shubhanshu, and Vetle I. Torvik. 2016. “Quantifying Conceptual Novelty in the Biomedical Literature.” D-Lib Magazine : The Magazine of the Digital Library Forum 22 (9–10).
- Mishra, Shubhanshu, and Jana Diesner. 2016. “Semi-Supervised Named Entity Recognition in Noisy-Text.” In Proceedings of the 2nd Workshop on Noisy User-Generated Text (WNUT), 203–12. Osaka, Japan: The COLING 2016 Organizing Committee.
- Mishra, Shubhanshu, Brent D. Fegley, Jana Diesner, and Vetle I. Torvik. 2018. “Expertise as an Aspect of Author Contributions.” In WORKSHOP ON INFORMETRIC AND SCIENTOMETRIC RESEARCH (SIG/MET). Vancouver.
- Mishra, Shubhanshu, and Jana Diesner. 2018. “Detecting the Correlation between Sentiment and User-Level as Well as Text-Level Meta-Data from Benchmark Corpora.” In Proceedings of the 29th on Hypertext and Social Media - HT ’18, 2–10. New York, New York, USA: ACM Press.
- Mishra, Shubhanshu, Brent D. Fegley, Jana Diesner, and Vetle I. Torvik. 2018. “Self-Citation Is the Hallmark of Productive Authors, of Any Gender.” Edited by Niels O. Schiller. PLOS ONE 13 (9): e0195773.
- Mishra, Shubhanshu, Sneha Agarwal, Jinlong Guo, Kirstin Phelps, and Johna Picco. 2014. “SENTINETS: User Classification Based on Sentiment for Social Causes within a Twitter Network.” IDEALS UIUC.
- Mishra, Shubhanshu, and Vetle I. Torvik. 2016. “Measures of Novelty in Biomedical Literature.” Washington DC: IDEALS UIUC.
- Mishra, Shubhanshu, Sneha Agarwal, Jinlong Guo, Kirstin Phelps, Johna Picco, and Jana Diesner. 2014. “Enthusiasm and Support: Alternative Sentiment Classification for Social Movements on Social Media.” In Proceedings of the 2014 ACM Conference on Web Science - WebSci ’14, 261–62. Bloomington, Indiana, USA: ACM Press.