MDI Data Workshop

Text as Data: Measurement and Inference Issues

Le Bao

Massive Data Institute, Georgetown University

September 19, 2023

MDI Data Workshop

  • Today and tomorrow:
    • Text as Data: Measurement and Inference Issues with Text Data

 

  • October 23 & 24: Advanced Models Using Text with Dr. Helge Marahrens
  • November 13 & 14: Cutting Large Language Models Down to Size with Dr. Nathan Wycoff

The Plan

  • Text as data
    • vs. text analysis, natural language processing (NLP)
    • The challenges of using textual data
    • Connections to other statistical methods

 

  • A prequel and sequel
    • How can we use text as data?
    • Measurement issues with text as data
    • Inference issues with text as data

Text as Data

  • A pre-2000’s view of text in social science
    • The debate over close- vs. open-ended questions in survey research. (Lazarsfeld, 1944; Geer 1991; Krosnick, 1999; etc.)
      • Closed-ended questions were easier to ask, code and analyze than their open-ended counterparts (Schuman & Presser, 1981).
    • Social interaction often occurs in texts
    • Social Scientists avoided studying texts/speech
      • Hard to find
      • Time Consuming
      • Not generalizable (each new data set…new coding scheme)
      • Difficult to store/search
      • Idiosyncratic to coders/researcher
      • Statistical methods/algorithms, computationally intensive

Text as Data

  • A post-2000’s view of text in social science:
    • Massive collections of texts are increasingly used as a data source in social science:
      • Congressional speeches, press releases, newsletters, …
      • Facebook posts, tweets, emails, cell phone records, …
      • Newspapers, magazines, news broadcasts, …
      • Foreign news sources, treaties, sermons, fatwas, …
    • Massive increase in availability of unstructured text
    • Massive improvement in computational power and storage capability
      • iPhone 6 is 32,600 times faster than Apollo Guidance Computer (AGC), which had a RAM of 4KB, a 32KB hard disk.
    • Explosion in methods and programs to analyze texts
      • Generalizable, systematic, cheap, …

The Challenges of Analyzing Text

  • Data generation process for text \leadsto unknown

    • Complexity of language
    • Models necessarily fail to capture language useful for specific tasks
  • Most of the methods are designed to augment humans

    • Quantitative methods organize, direct, and suggest
    • Humans: read and interpret
  • There is no globally best method

    • When methods yield different results …
  • Requiring constant validation

  • An agnostic approach to text analysis

Text Data Preparation

  • Finding text data
    • Goal: a plain text (.txt) file (UTF-8, ASCII). (Or an XML or JSON file)
    • Webscrapping
    • Prepackaged data sources & APIs
    • Other formats to texts:
      • Optical Character Recognition (OCR)
      • Audio/video to text
        • Is text the best way to represent them?
        • Tarr, Hwang, & Imai (2022): issue mentions, opponent appearance, and negativity in political campaign advertisement videos

Finding Text Data

  • Examples of image texts (Tarr, Hwang, & Imai, 2022)

Finding Text Data

  • Web scraping: BeautifulSoup, Selenium, etc.
  • Web interface and API: requests, json, etc.
  • OCR, image, audio, video: Pillow (PIL), pytesseract, SpeechRecognition, MoviePy

Basics of Measurement

  • As long as there have been people, people have been measuring things

    • Euclid: “father of geometry”
  • Throughout history, measurement was a counting operation

  • People started giving it thought when they wanted to measure things that were not amenable to a simple counting process (especially during the Scientific Revolution)

    • For fun reading, there is a rich literature on how difficult it was to devise a reliable metric by which temperature could be measured
  • S.S. Stevens: measurement is the process of assigning numbers to objects according to rules.

Exercise: Counting the Words

  • What if we ignored everything we know about language and just counted the words? Would that get us anywhere?

 

- https://tinyurl.com/mditext

What to do with Text?

  • The Bag of Words
    • We ignore word order, syntax, punctuation, meaning, and context, focusing only on the frequency with which words appear in the text.
  • The “standard recipe” for representing a text corpus as a bag of words:
    • Choose a unit of analysis
    • Tokenize
    • Reduce complexity
    • Create a document-feature matrix

The Bag of Words

  • Unit of analysis (document)
    • Person, sentence, time period, etc.
Candidate "Donald Trump"
0 Amy Klobuchar 49
1 Elizabeth Warren 36
2 Kamala Harris 35
3 Pete Buttigieg 29
4 Joe Biden 24
5 Bernie Sanders 21

The Bag of Words

  • Tokenize
    • A “token” in natural language terms is “an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing.”
debate_data['token'] = debate_data['speech'].apply(word_tokenize)
debate_data['token']
5       [Well, ,, you, ’, re, right, ,, the, economy, ...
9       [Oh, ,, Mr., Bloomberg, ., Let, me, tell, Mr.,...
11      [We, know, what, the, President, …, what, Russ...
12      [Look, ,, the, way, I, see, this, is, that, Be...
13      [I, dug, in, ,, I, did, the, work, and, then, ...
                              ...                        
5893    [Three, things, to, know, about, me, ., First,...
5894    [Secondly, ,, I, ’, m, someone, that, can, win...
5895    [nd, finally, ,, yeah, ,, I, am, not, the, est...
5907    [Thank, you, ., It, ’, s, a, great, honor, to,...
5908    [But, I, got, my, chance, ., It, was, a, 50, d...
Name: token, Length: 2267, dtype: object

The Bag of Words

  • Reduce complexity
    • Remove capitalization, punctuation, etc.
debate_data['token'] = debate_data['token'].apply(lambda tokens: [word.lower() for word in tokens if word.isalpha()])
debate_data['token']
5       [well, you, re, right, the, economy, is, doing...
9       [oh, bloomberg, let, me, tell, putin, okay, i,...
11      [we, know, what, the, president, what, russia,...
12      [look, the, way, i, see, this, is, that, berni...
13      [i, dug, in, i, did, the, work, and, then, ber...
                              ...                        
5893    [three, things, to, know, about, me, first, i,...
5894    [secondly, i, m, someone, that, can, win, and,...
5895    [nd, finally, yeah, i, am, not, the, establish...
5907    [thank, you, it, s, a, great, honor, to, be, h...
5908    [but, i, got, my, chance, it, was, a, dollar, ...
Name: token, Length: 2267, dtype: object

The Bag of Words

  • Reduce complexity
    • Remove capitalization, punctuation, etc.
    • Stop Words: English Language place holding words (e.g. the, it, if, a, able, at, be, because...)
      • Note of Caution: she, he, her, his (Monroe, Colaresi, and Quinn 2008)

Stop Words

from nltk.corpus import stopwords
#nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
debate_data['token'] = debate_data['token'].apply(lambda tokens: [word for word in tokens if word not in stop_words])
debate_data['token']
5       [well, right, economy, really, great, people, ...
9       [oh, bloomberg, let, tell, putin, okay, good, ...
11                [know, president, russia, wants, chaos]
12      [look, way, see, bernie, winning, right, democ...
13      [dug, work, bernie, team, trashed, need, presi...
                              ...                        
5893    [three, things, know, first, listen, people, g...
5894    [secondly, someone, win, beat, donald, trump, ...
5895    [nd, finally, yeah, establishment, party, cand...
5907    [thank, great, honor, never, million, years, t...
5908    [got, chance, dollar, semester, commuter, coll...
Name: token, Length: 2267, dtype: object

The Bag of Words

  • Reduce complexity
    • Remove capitalization, punctuation, etc.
    • Stop Words
    • Equivalence Class of Words
      • Words used to refer to same basic concept family, families, familial -> famili
      • Stemming/Lemmatization algorithms: Many-to-one mapping from words to stem/lemma

Stemming and Lemmatization

from nltk.stem.snowball import EnglishStemmer
from nltk.stem import WordNetLemmatizer
#nltk.download('wordnet')

stemmer = EnglishStemmer()
lemmatizer = WordNetLemmatizer()

stemmed_tokens = debate_data['token'].apply(lambda tokens: [stemmer.stem(word) for word in tokens])
lemmatized_tokens = debate_data['token'].apply(lambda tokens: [lemmatizer.lemmatize(word) for word in tokens])
print(stemmed_tokens)
5       [well, right, economi, realli, great, peopl, l...
9       [oh, bloomberg, let, tell, putin, okay, good, ...
11                     [know, presid, russia, want, chao]
12      [look, way, see, berni, win, right, democrat, ...
13      [dug, work, berni, team, trash, need, presid, ...
                              ...                        
5893    [three, thing, know, first, listen, peopl, get...
5894    [second, someon, win, beat, donald, trump, eve...
5895    [nd, final, yeah, establish, parti, candid, go...
5907    [thank, great, honor, never, million, year, th...
5908    [got, chanc, dollar, semest, commut, colleg, l...
Name: token, Length: 2267, dtype: object

Stemming and Lemmatization

print(lemmatized_tokens)
5       [well, right, economy, really, great, people, ...
9       [oh, bloomberg, let, tell, putin, okay, good, ...
11                 [know, president, russia, want, chaos]
12      [look, way, see, bernie, winning, right, democ...
13      [dug, work, bernie, team, trashed, need, presi...
                              ...                        
5893    [three, thing, know, first, listen, people, ge...
5894    [secondly, someone, win, beat, donald, trump, ...
5895    [nd, finally, yeah, establishment, party, cand...
5907    [thank, great, honor, never, million, year, th...
5908    [got, chance, dollar, semester, commuter, coll...
Name: token, Length: 2267, dtype: object

The Bag of Words

  • Reduce complexity
    • Remove capitalization, punctuation, etc.
    • Stop Words
    • Equivalence Class of Words
    • *Discard less useful features \leadsto depends on application
    • *Other reduction, specialization

The Bag of Words

  • Document-feature matrix
from sklearn.feature_extraction.text import CountVectorizer

debate_data['text'] = debate_data['text'].apply(' '.join)
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(debate_data['text'])
debate_cleaned = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out(), index=debate_data['speaker'])
debate_cleaned.head()

The Bag of Words

people going president get think one need country make right ... nose confirm notch noted nother congratulated notorious nowhere confiscation aa
speaker
Amy Klobuchar 166 117 120 120 172 119 67 69 98 71 ... 0 0 0 1 1 0 1 0 1 1
Bernie Sanders 253 97 62 43 85 61 77 150 45 90 ... 0 0 0 0 0 0 0 0 0 0
Elizabeth Warren 222 132 62 141 79 98 165 84 116 70 ... 0 0 1 0 0 0 0 0 0 0
Joe Biden 151 156 118 143 84 123 34 49 116 68 ... 1 0 0 0 0 1 0 0 0 0
Kamala Harris 82 42 47 32 27 32 56 26 14 25 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 5501 columns

The Bag of Words

  • Unigram, Bigram, Trigram, N-gram
from nltk import bigrams
from nltk import trigrams
from nltk import ngrams

text_bi = bigrams(debate_data['speech'])
text_tri = trigrams(debate_data['speech'])
text_n = ngrams(debate_data['speech'], 4)

from sklearn.feature_extraction.text import CountVectorizer
vectorizer_bigrams = CountVectorizer(ngram_range=(2, 2))
text_bi = vectorizer_bigrams.fit_transform(debate_data['text'])

vectorizer = CountVectorizer(
    lowercase=True,  
    stop_words='english',  
    ngram_range=(2, 2),
    # max_features=N  # Optionally restricts to top N tokens
)
text_bi = vectorizer.fit_transform(debate_data['speech'])

Document-Feature Matrix with Bigrams

donald trump united state make sure american people climate change president united million people insurance company making sure get done ... good size good spot good standard good sure good union good using good world good year goodbye family zone said
speaker
Amy Klobuchar 49 15 28 6 15 11 9 1 8 24 ... 1 0 0 0 0 0 1 0 0 0
Bernie Sanders 21 25 10 42 28 11 26 23 1 0 ... 0 1 1 0 1 0 0 0 0 0
Elizabeth Warren 36 33 12 8 11 12 10 23 5 5 ... 0 0 0 0 0 1 0 0 0 0
Joe Biden 24 57 48 17 6 15 9 5 24 15 ... 0 0 0 0 0 0 0 1 0 1
Kamala Harris 35 42 1 7 2 13 7 7 2 1 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 42900 columns

The Bag of Words: Assumption

  • Word order doesn’t matter
# Setup CountVectorizer to capture only bigrams
vectorizer_bigrams = CountVectorizer(ngram_range=(3, 3))

# Fit and transform the text data
text_bi = vectorizer_bigrams.fit_transform(debate_data['text'])

# Convert matrix to DataFrame with bigram columns
dtm_bigrams = pd.DataFrame(text_bi.toarray(), 
                           columns=vectorizer_bigrams.get_feature_names_out(),
                           index=debate_data['speaker'])

['pharmaceutical provision oppose', 'won given given', 'relationship entire world', 'final thing comprehensive', 'com foreign policy', 'word climate change', 'pursued grace interested', 'right question going', 'straightforward beat president', 'activities required healthcare', 'campaign right speak', 'say senator klobuchar', 'said hope mitch', 'pharma companies got', 'mass shootings american', 'money ve got', 'running president restore', 'candidates guess beat', 'say government backs', 'favorite woman president', 'time rebuilding distressed', 'paso beto god', 'assault weapon ban', 'cases charges times', 'going ability reform']

Exercise: Counting the Words

  • Continue with all the major candidates and maybe bigrams/trigrams/n-grams.

Measurement

  • Measurement begins with a classification process; it’s about categorizing

  • When we conduct measurement, we start with a construct to be measured and a property (characteristic) of that construct that we can use to distinguish constructs from one another

  • All measurement is theory-testing (because of the construction and evaluation of models is reliant on theory)

    • So, all measurement is a tentative statement, or conjecture, about the state of reality
    • That is, measurement is falsifiable
    • No such thing as the “true” measure of an object/characteristic
  • We’re making measurement decisions starting with pre-processing the text.

Distance

Distance

Text and Geometry

  • A document-term matrix

\boldsymbol{X} = \begin{matrix} 1 & 2 & 0 & ... & 0 \\ 0 & 0 & 3 & ... & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & 0 & 0 & ... & 3 \\ \end{matrix}

Text and Geometry

  • A document-term matrix

  • Suppose documents live in a space \leadsto rich set of results from linear algebra

    • Provides a geometry \leadsto modify with word weighting
    • Natural notions of distance
    • A Multi-Dimensional Scaling (MDS) problem (van Langren, 1635; Torgerson, 1958)
    • Building block for clustering, supervised learning, and scaling

Text and Geometry: Measuring Similarity

  • What properties should similarity measure have?
    • Maximum: document with itself
    • Minimum: documents have no words in common (orthogonal )
    • Increasing when more of same words used
    • s(a,b)=s(b,a)

Text and Geometry: Measuring Similarity

 

 

Candidate Word 1 Word 2
Candidate 1 2 1
Candidate 2 1 4

Text and Geometry: Measuring Similarity

  • Measure 1: inner product (2, 1)'\cdot (1,4) = 6

Text and Geometry: Measuring Similarity

  • Problem: length dependent

  • Inner product: (4, 2)'\cdot (1,4) = 12

Text and Geometry: Measuring Similarity

  • Length dependent

(4, 2)'\cdot (1,4) = 12 a \cdot b = ||a|| \times ||b|| \times \cos \theta

Text and Geometry: Measuring Similarity

  • Cosine Similarity \cos \theta = \left(\frac{a} {||a||}\right) \cdot \left(\frac{b} {||b||} \right) \\ \frac{(4,2)}{||(4,2) ||} = (0.89, 0.45) \\ \frac{(2,1)}{||(2,1) || } = (0.89, 0.45) \\ \frac{(1,4)} {||(1,4)||} = (0.24, 0.97) \\ (0.89, 0.45)^{'} (0.24, 0.97) = 0.65

Text and Geometry

speaker           Amy Klobuchar  Bernie Sanders  Elizabeth Warren   
speaker                                                             
Amy Klobuchar      3.330669e-16    2.305125e-01      1.411393e-01  \
Bernie Sanders     2.305125e-01    1.110223e-16      1.899815e-01   
Elizabeth Warren   1.411393e-01    1.899815e-01      2.220446e-16   
Joe Biden          1.813508e-01    3.081170e-01      1.867826e-01   
Kamala Harris      2.365191e-01    2.399044e-01      2.002306e-01   
Pete Buttigieg     1.233278e-01    2.461802e-01      1.441456e-01   

speaker              Joe Biden  Kamala Harris  Pete Buttigieg  
speaker                                                        
Amy Klobuchar     1.813508e-01   2.365191e-01        0.123328  
Bernie Sanders    3.081170e-01   2.399044e-01        0.246180  
Elizabeth Warren  1.867826e-01   2.002306e-01        0.144146  
Joe Biden         2.220446e-16   2.501987e-01        0.159975  
Kamala Harris     2.501987e-01   1.110223e-16        0.235334  
Pete Buttigieg    1.599755e-01   2.353338e-01        0.000000  

Exercise: Similarities between Candidates

Other Considerations: Weights

  • Are all words created equal?
    • Treat all words equally
    • Lots of noise
    • Reweight words
    • Accentuate words that are likely to be informative
    • Make specific assumptions about characteristics of informative words
  • How to generate weights?
    • Assumptions about separating words
    • Use training set to identify separating words (Monroe, Ideology measurement)

Weights: TF-IDF Weighting

\text{N} = \text{Total number of documents} \text{n}_{j} = \text{No. documents in which word $j$ occurs}\\ \text{idf}_{j} = \log \frac{N} {n_j} \\ \textbf{idf} = (\text{idf}_{1} , \text{idf}_{2}, ..., \text{idf}_{J} )

  • Why \log ?
    • Maximum at n_j = 1
    • Decreases at rate 1/n_j \leadsto diminishing “penalty” for more common use j
    • Other functional forms are fine, embed assumptions about penalization of common use

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

Word Embeddings

  • The bag of word: separate indices in a word count vector, implicitly assuming that they have completely unique meanings
    • E.g. monarchy, king, queen, president, prime minister, executive, …
  • Word embedding: high-dimensional vectors representing words.
    • It captures semantic meaning based on word’s context.
    • Mathematical operations on words (e.g., King - Man + Woman = Queen).
    • Distributional/contextual characteristics: words that appear in similar contexts tend to have similar meanings
    • Essential for dimension reduction, transfer learning, RNNs or Transformers, etc.
    • Pre-trained models: Word2Vec, FastText, and GloVe.

Clustering

  • To (further) reduce dimensionality
    • Distance metric \leadsto when are documents close?
    • Clustering \leadsto how do we summarize distances
  • K-Means: partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.
    • Unsupervised method

K-Means Clustering

 

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

kmeans = KMeans(n_clusters=2, random_state=42)
clusters = kmeans.fit_predict(scaled_data)

Clustering

  • We know the number of clusters we’re expecting
  • Can statistics guide us?
    • Sum squared errors decreases as K increases
    • Each document in own cluster (a useless model)
    • Modeling problem: fit often increases with features
  • No one statistic captures how you want to use your data
    • But, can help guide your selection
    • Combination of theory + statistic + discovery

Exercise: Clustering Words

Large Language Model

  • Each word vector still represents a point in the “word space,” and words with more similar meanings are placed closer together.
  • High dimension and high-dimensional calculation
  • Word meaning depends on context
    • Attention: weighting context and relevancy
    • Transforming word vectors into word predictions

Scaling Ideologies of Politicians

  • Another multi-dimensional scaling problem.
  • Distance: people vote together are closer.

Scaling Ideologies of Politicians

  • Problem:

Scaling Ideologies of Politicians

  • Alternatives:
    • Hopkin and Noel (2022): Perceived ideology
    • Bonica (2016): campaign finance scores (CFscores)

 

  • What about using texts (and context)
  • Wu et al. - Large Language Models Can Be Used to Scale the Ideologies of Politicians in a Zero-Shot Learning Setting
    • Asking ChatGPT to place politicians.

Prompt

Which senator is more liberal/conservative: (Senator1) or (Senator2)?

Scaling Ideologies of Politicians

ChatGPT API

  • Chat completions API
response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Where was the world series in 2020?"}
    ]
)
  • Response format
{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "The 2020 World Series was played in Texas at Globe Life Field in Arlington.",
        "role": "assistant"
      }
    }
  ],
  "created": 1677664795,
  "id": "chatcmpl-7QyqpwdfhqwajicIEznoc6Q47XAyW",
  "model": "gpt-3.5-turbo-0613",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 17,
    "prompt_tokens": 57,
    "total_tokens": 74
  }
}

Exercise: ChatGPT API

 

- https://tinyurl.com/midtext2

Measurement as a Classification Problem

  • Topic: What is this text about?
    • Policy area: {Agriculture, Crime, Environment, …}
    • Campaign agendas: {Abortion, Campaign, Finance, Taxing, … }
  • Sentiment: What is said in this text?
    • Positions on legislation: { Support, Ambiguous, Oppose }
    • Positions on Court Cases: { Agree with Court, Disagree with Court }
    • Liberal/Conservative Blog Posts: { Liberal, Middle, Conservative, No Ideology Expressed }
  • Style/Tone: How is it said?
    • Taunting in floor statements: { Partisan Taunt, Intra party taunt, Agency taunt, … }
    • Negative campaigning: { Negative ad, Positive ad}

Different Methods

  • Dictionary methods

    • Dictionary methods are context invariant
  • Supervised and semi-supervised methods

  • “Wisdom of the crowd”

    • Crowd-sourcing labeling

Inference Problems

  • Regression models: classified as predictor or outcome
    • High-dimensional
    • There many correlated variables
    • \leadsto variable selection methods, e.g. Ridge/Lasso regression, Elastic-Net, etc.

Causal Inference

  • Causal effects
  • Potential outcome/DAG framework
  • Design-based inference
  • Randomized designs: blocked, conjoint, list, and multiarm bandit experiments
  • Appropriate designs for causal inference in observational data: matching, instrumental variables, regression discontinuities, and synthetic controls
  • Sensitivity of estimates to unmeasured confounders

The Fundemental Problem of Causal Inference

  • Suppose we ask, “Would a canvassing policy increase enrollment in a health insurance program?”
Citizen Canvassed? Enrolled?
1 Yes Yes
2 Yes Yes
3 No No
4 No No
  • Is it causal?

The Fundemental Problem of Causal Inference

 

Citizen Canvassed? If Canvass? If No Canvass? Enrolled?
1 Yes Yes (Yes) Yes
2 Yes Yes (No) Yes
3 No (Yes) No No
4 No (No) No No

 

  • What is the true causal effect of canvassing?

  • We can never observe both outcomes for canvass and no canvass

  • The Fundamental Problem of Causal Inference: We can never observe more than one potential outcome for a given unit.

The Fundemental Problem of Causal Inference

  • The problem: “Canvass” group \neq “No canvass” group
  • The pontential outcome help predict treatment condition
    • The people would enroll under canvass are more likely to be canvassed.
Citizen Canvassed? If Canvass? If No Canvass? Enrolled?
1 Yes Yes (Yes) Yes
2 Yes Yes (No) Yes
3 No (Yes) No No
4 No (No) No No

When We Can Recover Causal Effects

  • Potential outcomes do not predict treatment
    • Knowing whether enroll if canvass not predictive.
Citizen Canvassed? If Canvass? If No Canvass? Enrolled?
1 Yes Yes (Yes) Yes
2 Yes Yes (No) Yes
3 No (Yes) Yes No
4 No (Yes) No No
  • Randomized Controlled Trials (RCT)
  • Causal inference with observational data: natrual experiment, matching, instrumental variables, regression discontinuities, synthetic controls, etc.
  • Sensitivity analysis: how strong must violations be to invalidate inference about phone calls?

Causal Inference with Text Data

Causal Inference with Text Data

Causal Inference with Text Data

Causal Inference with Text Data

  • The problems:
    • Text data is highdimensional.
    • Complexity and ambiguity
    • Confounding: textual content might be influenced by unobserved variables.
  • We lose the nuances/ambiguity when reducing the dimensions.
  • Causal inference often deal with categorical or numerical variables.

Adaptive Design

  • Adaptive clinic trails
  • Goal: to identify the best treatment arm
  • Estimand: not ATE, but P(E(R_{this arm}) > E(R_{this arm}))
  • Testing many treatment arms simutaneously and adaptively
    • Eliminating the badly performed arms; adding new arms

Exercise: Using Textual Treatment in Adaptive Design

Concluding Notes

  • Challenges and opportunities of text in the research process
    • Measuring text (processing)
    • Using text as measurement
    • Making inferences of text
    • Using text for (causal) inferences
  • An agnostic approach to text analysis
    • Unknown data generating process
    • Methods to augment humans not replace humans
    • Highdimensional and ambiguous

Exit Ticket

https://bit.ly/MDIWorkshopSept2023ExitTicket