Mining Structured Knowledge from Massive Text Data: A Data-Driven Approach

Jiawei Han | Tuesday, July 16
Michael Aiken Chair Professor, Computer Science, University of Illinois at Urbana-Champaign

The real-world big data are largely unstructured, interconnected, and dynamic, in the form of natural language text. It is highly desirable to transform such massive unstructured data into structured knowledge. Many researchers rely on labor-intensive labeling and curation to extract knowledge from such data.   However, such approaches may not be scalable, especially considering that a lot of text corpora are highly dynamic and domain-specific.  Fortunately, the massive text data itself may disclose a large body of hidden patterns, structures, and knowledge.   Equipped with domain-independent and domain-dependent knowledge bases, we can explore the power of massive data itself for turning unstructured data into structured knowledge.

In this lecture we introduce a data-driven approach and a set of methods developed recently on exploration of the power of big text data, including mining quality phrases, recognition and typing of entities and relations by distant supervision, pattern-based information extraction, multi-faceted taxonomy discovery, construction of multi-dimensional text cubes and networks, and their associated knowledge generation.   We show that the massive text data can be powerful at disclosing patterns and structures, and it is promising to explore the power of massive text data to turn massive text data into structured knowledge.

Outline of the lecture:

PART 1: Introduction

  • Why is miningstructures in text a key problem for “turning big data to knowledge”?
  • Why data-driven approach to text mining?

PART 2:  Automated Phrase Mining

  • Different approaches to mining quality phrases
  • AutoPhrase: Exploring the power of distant supervision

PART 3:  Automated Entity/Relation Recognition

  • Entity/relation recognition: weakly/distantly supervised approaches
  • Meta-pattern discovery and embedding in entity recognition

PART 4:  Text Classification and Text Cube Construction

  • Embedding and text similarity
  • Text classification: Doc2Cube and WeSTClass Approaches
  • Taxonomy generation: Set expansion, synonym discovery, and taxonomy mining
  • Textcube construction

PART 5:  Exploring Multidimensional Structures for Knowledge Discovery

  • Multidimensional text analysis
  • User-guided topic mining

PART 6:  Looking into the future

  • Multi-dimensional text-intensive knowledge network construction and exploration