MDI Data Workshop

Text as Data: Measurement and Inference Issues

Le Bao

Massive Data Institute, Georgetown University

September 19, 2023

MDI Data Workshop

  • Today and tomorrow:
    • Text as Data: Measurement and Inference Issues with Text Data

 

  • October 23 & 24: Advanced Models Using Text with Dr. Helge Marahrens
  • November 13 & 14: Cutting Large Language Models Down to Size with Dr. Nathan Wycoff

The Plan

  • Text as data
    • vs. text analysis, natural language processing (NLP)
    • The challenges of using textual data
    • Connections to other statistical methods

 

  • A prequel and sequel
    • How can we use text as data?
    • Measurement issues with text as data
    • Inference issues with text as data

Text as Data

  • A pre-2000’s view of text in social science
    • The debate over close- vs. open-ended questions in survey research. (Lazarsfeld, 1944; Geer 1991; Krosnick, 1999; etc.)
      • Closed-ended questions were easier to ask, code and analyze than their open-ended counterparts (Schuman & Presser, 1981).
    • Social interaction often occurs in texts
    • Social Scientists avoided studying texts/speech
      • Hard to find
      • Time Consuming
      • Not generalizable (each new data set…new coding scheme)
      • Difficult to store/search
      • Idiosyncratic to coders/researcher
      • Statistical methods/algorithms, computationally intensive

Text as Data

  • A post-2000’s view of text in social science:
    • Massive collections of texts are increasingly used as a data source in social science:
      • Congressional speeches, press releases, newsletters, …
      • Facebook posts, tweets, emails, cell phone records, …
      • Newspapers, magazines, news broadcasts, …
      • Foreign news sources, treaties, sermons, fatwas, …
    • Massive increase in availability of unstructured text
    • Massive improvement in computational power and storage capability
      • iPhone 6 is 32,600 times faster than Apollo Guidance Computer (AGC), which had a RAM of 4KB, a 32KB hard disk.
    • Explosion in methods and programs to analyze texts
      • Generalizable, systematic, cheap, …

The Challenges of Analyzing Text

  • Data generation process for text \leadsto unknown

    • Complexity of language
    • Models necessarily fail to capture language useful for specific tasks
  • Most of the methods are designed to augment humans

    • Quantitative methods organize, direct, and suggest
    • Humans: read and interpret
  • There is no globally best method

    • When methods yield different results …
  • Requiring constant validation

  • An agnostic approach to text analysis

Text Data Preparation

  • Finding text data
    • Goal: a plain text (.txt) file (UTF-8, ASCII). (Or an XML or JSON file)
    • Webscrapping
    • Prepackaged data sources & APIs
    • Other formats to texts:
      • Optical Character Recognition (OCR)
      • Audio/video to text
        • Is text the best way to represent them?
        • Tarr, Hwang, & Imai (2022): issue mentions, opponent appearance, and negativity in political campaign advertisement videos

Finding Text Data

  • Examples of image texts (Tarr, Hwang, & Imai, 2022)