Research Summary:

I conduct fundamental, empirical and applied research at the nexus of data science, network analysis, natural language processing, machine learning and computational social science. My research agenda is driven by my believe that progress in data science requires interdisciplinary and mixed methods approaches to enable the joint consideration of the structure and content of social interactions, and to assess the impact of data provenance on research outcomes and of information products on social agents. In my research lab, we develop computational solutions that are grounded in theories from the social sciences, humanities and linguistics. We bring these solutions into different application contexts to test their generalizability, and to advance theory.

Area 1: Impact Assessment

What impact do information products have on people beyond simple frequency based metrics?

We have been developing, implementing, evaluating and applying a theory-driven, computational solution to assessing the impact of issue-focused information products on people, groups and society. For example, we built a theory-driven framework and probabilistic prediction model for identifying different types of impact on individuals, such as changes versus reinforcement in personal behavior, cognition and emotions...[Learn More]

Area 2: Impact of Data Quality and Provenance

How do limitations and intransparencies in data quality and data provenance bias research outcomes, and how can we detect and mitigate these limitations?

For example, we have been investigating the impact of entity resolution errors on network analysis results. We found that commonly reported network metrics and derived implications can strongly deviate from the truth - as established based on gold standard data or approximations thereof - depending on the efforts dedicated to entity resolution....[Learn More]

Area 3: Natural Language Processing for Building and Enhancing Graph Data and Theory

How can we use user-generated content to construct, infer or refine network data?

We have been tackling this problem by leveraging communication content produced and disseminated in social networks to enhance graph data. For example, we have used domain-adjusted sentiment analysis to label graphs with valence values in order to enable triadic balance assessment. The resulting method enables fast and systematic sign detection, eliminates the need for surveys or manual link labeling, and reduces issues with leveraging user-generated (meta)-data....[Learn More]

Area 4: Practical Ethics for Working with Human-Centered and Online Data

How to be rule compliant and still innovate?

The collection and analysis of human-centered and/ or data are governed by multiple sets of norms and regulations. Problems can arise when researchers are unaware of applicable rules, uninformed about their practical meaning and compatibility, and insufficiently skilled in implementing them. We are developing and delivering educational modules to address this issue....[Learn More]

Area 5: Evaluation and Mapping of Medical Information and Patient Language Use

In our current work funded by IMO (Intelligent Medical Objects), a private company specializing in developing, managing and licensing medical vocabularies, we evaluate the coverage and accuracy of various medical terminologies, and test strategies for increasing the precision of mapping medical reports to standardized terminologies....[Learn More]

Former completed project

  • Socio-technical data analytics for improving impact and impact assessment
    • PI: Jana Diesner (2014-2015)
    • Funder: Anheuser Busch, grant # 2014-04922
    • Technologies Developed: SAIL (Sentiment Analysis and Incremental Learning) [GitHub]
  • Entity Extractor for the Scalable Constructing of Semantically Rich Socio-Technical Network Data (2013)
    • PI: Jana Diesner, Co-PI: Brent Fegley (Informatics PhD student)
    • Funder:  start-up allocation award from "Extreme Science and Engineering Discovery Environment" (XSEDE)
    • Short description: Network data can be extracted from text corpora; a process also known as relation extraction. As part of this process, instances of relevant entity classes need to be located and classified in a robust, accurate and automated fashion. One problem in this domain is that for distilling socio-technical network data suitable for answering substantive questions, e.g. about culture or geopolitical conflicts, the set of entities classes to be considered needs to include andgo beyond the standard set of classes (agents, organizations, locations) by also covering: a) Additional classes relevant for modeling socio-techncal systems, such as “tasks”, “resources” and “knowledge” and b) Entities that are not referred to by a name, e.g. “protestors” or “climate change”. With this project, we are building an entity extractor that overcomes these limitations.