MDI Data Workshop

Text as Data: Measurement and Inference Issues

Le Bao

Massive Data Institute, Georgetown University

September 19, 2023

MDI Data Workshop

Today and tomorrow:
- Text as Data: Measurement and Inference Issues with Text Data

October 23 & 24: Advanced Models Using Text with Dr. Helge Marahrens
November 13 & 14: Cutting Large Language Models Down to Size with Dr. Nathan Wycoff

The Plan

Text as data
- vs. text analysis, natural language processing (NLP)
- The challenges of using textual data
- Connections to other statistical methods

A prequel and sequel
- How can we use text as data?
- Measurement issues with text as data
- Inference issues with text as data

Text as Data

A pre-2000’s view of text in social science
- The debate over close- vs. open-ended questions in survey research. (Lazarsfeld, 1944; Geer 1991; Krosnick, 1999; etc.)
  - Closed-ended questions were easier to ask, code and analyze than their open-ended counterparts (Schuman & Presser, 1981).
- Social interaction often occurs in texts
- Social Scientists avoided studying texts/speech
  - Hard to find
  - Time Consuming
  - Not generalizable (each new data set…new coding scheme)
  - Difficult to store/search
  - Idiosyncratic to coders/researcher
  - Statistical methods/algorithms, computationally intensive

Text as Data

A post-2000’s view of text in social science:
- Massive collections of texts are increasingly used as a data source in social science:
  - Congressional speeches, press releases, newsletters, …
  - Facebook posts, tweets, emails, cell phone records, …
  - Newspapers, magazines, news broadcasts, …
  - Foreign news sources, treaties, sermons, fatwas, …
- Massive increase in availability of unstructured text
- Massive improvement in computational power and storage capability
  - iPhone 6 is 32,600 times faster than Apollo Guidance Computer (AGC), which had a RAM of 4KB, a 32KB hard disk.
- Explosion in methods and programs to analyze texts
  - Generalizable, systematic, cheap, …

The Challenges of Analyzing Text

Data generation process for text \leadsto unknown
- Complexity of language
- Models necessarily fail to capture language useful for specific tasks
Most of the methods are designed to augment humans
- Quantitative methods organize, direct, and suggest
- Humans: read and interpret
There is no globally best method
- When methods yield different results …
Requiring constant validation
An agnostic approach to text analysis

Text Data Preparation

Finding text data
- Goal: a plain text (.txt) file (UTF-8, ASCII). (Or an XML or JSON file)
- Webscrapping
- Prepackaged data sources & APIs
- Other formats to texts:
  - Optical Character Recognition (OCR)
  - Audio/video to text
    - Is text the best way to represent them?
    - Tarr, Hwang, & Imai (2022): issue mentions, opponent appearance, and negativity in political campaign advertisement videos

Finding Text Data

Examples of image texts (Tarr, Hwang, & Imai, 2022)

Finding Text Data

Web scraping: BeautifulSoup, Selenium, etc.
Web interface and API: requests, json, etc.
OCR, image, audio, video: Pillow (PIL), pytesseract, SpeechRecognition, MoviePy

Basics of Measurement

As long as there have been people, people have been measuring things
- Euclid: “father of geometry”
Throughout history, measurement was a counting operation
People started giving it thought when they wanted to measure things that were not amenable to a simple counting process (especially during the Scientific Revolution)
- For fun reading, there is a rich literature on how difficult it was to devise a reliable metric by which temperature could be measured
S.S. Stevens: measurement is the process of assigning numbers to objects according to rules.

Exercise: Counting the Words

What if we ignored everything we know about language and just counted the words? Would that get us anywhere?

- https://tinyurl.com/mditext

What to do with Text?

The Bag of Words
- We ignore word order, syntax, punctuation, meaning, and context, focusing only on the frequency with which words appear in the text.
The “standard recipe” for representing a text corpus as a bag of words:
- Choose a unit of analysis
- Tokenize
- Reduce complexity
- Create a document-feature matrix

The Bag of Words

Unit of analysis (document)
- Person, sentence, time period, etc.

	Candidate	"Donald Trump"
0	Amy Klobuchar	49
1	Elizabeth Warren	36
2	Kamala Harris	35
3	Pete Buttigieg	29
4	Joe Biden	24
5	Bernie Sanders	21

The Bag of Words

Tokenize
- A “token” in natural language terms is “an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing.”

debate_data['token'] = debate_data['speech'].apply(word_tokenize)
debate_data['token']

5       [Well, ,, you, ’, re, right, ,, the, economy, ...
9       [Oh, ,, Mr., Bloomberg, ., Let, me, tell, Mr.,...
11      [We, know, what, the, President, …, what, Russ...
12      [Look, ,, the, way, I, see, this, is, that, Be...
13      [I, dug, in, ,, I, did, the, work, and, then, ...
                              ...                        
5893    [Three, things, to, know, about, me, ., First,...
5894    [Secondly, ,, I, ’, m, someone, that, can, win...
5895    [nd, finally, ,, yeah, ,, I, am, not, the, est...
5907    [Thank, you, ., It, ’, s, a, great, honor, to,...
5908    [But, I, got, my, chance, ., It, was, a, 50, d...
Name: token, Length: 2267, dtype: object

The Bag of Words

Reduce complexity
- Remove capitalization, punctuation, etc.

debate_data['token'] = debate_data['token'].apply(lambda tokens: [word.lower() for word in tokens if word.isalpha()])
debate_data['token']

5       [well, you, re, right, the, economy, is, doing...
9       [oh, bloomberg, let, me, tell, putin, okay, i,...
11      [we, know, what, the, president, what, russia,...
12      [look, the, way, i, see, this, is, that, berni...
13      [i, dug, in, i, did, the, work, and, then, ber...
                              ...                        
5893    [three, things, to, know, about, me, first, i,...
5894    [secondly, i, m, someone, that, can, win, and,...
5895    [nd, finally, yeah, i, am, not, the, establish...
5907    [thank, you, it, s, a, great, honor, to, be, h...
5908    [but, i, got, my, chance, it, was, a, dollar, ...
Name: token, Length: 2267, dtype: object

The Bag of Words

Reduce complexity
- Remove capitalization, punctuation, etc.
- Stop Words: English Language place holding words (e.g. the, it, if, a, able, at, be, because...)
  - Note of Caution: she, he, her, his (Monroe, Colaresi, and Quinn 2008)

Stop Words

from nltk.corpus import stopwords
#nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
debate_data['token'] = debate_data['token'].apply(lambda tokens: [word for word in tokens if word not in stop_words])
debate_data['token']

5       [well, right, economy, really, great, people, ...
9       [oh, bloomberg, let, tell, putin, okay, good, ...
11                [know, president, russia, wants, chaos]
12      [look, way, see, bernie, winning, right, democ...
13      [dug, work, bernie, team, trashed, need, presi...
                              ...                        
5893    [three, things, know, first, listen, people, g...
5894    [secondly, someone, win, beat, donald, trump, ...
5895    [nd, finally, yeah, establishment, party, cand...
5907    [thank, great, honor, never, million, years, t...
5908    [got, chance, dollar, semester, commuter, coll...
Name: token, Length: 2267, dtype: object

The Bag of Words

Reduce complexity
- Remove capitalization, punctuation, etc.
- Stop Words
- Equivalence Class of Words
  - Words used to refer to same basic concept family, families, familial -> famili
  - Stemming/Lemmatization algorithms: Many-to-one mapping from words to stem/lemma

Stemming and Lemmatization

from nltk.stem.snowball import EnglishStemmer
from nltk.stem import WordNetLemmatizer
#nltk.download('wordnet')

stemmer = EnglishStemmer()
lemmatizer = WordNetLemmatizer()

stemmed_tokens = debate_data['token'].apply(lambda tokens: [stemmer.stem(word) for word in tokens])
lemmatized_tokens = debate_data['token'].apply(lambda tokens: [lemmatizer.lemmatize(word) for word in tokens])
print(stemmed_tokens)

5       [well, right, economi, realli, great, peopl, l...
9       [oh, bloomberg, let, tell, putin, okay, good, ...
11                     [know, presid, russia, want, chao]
12      [look, way, see, berni, win, right, democrat, ...
13      [dug, work, berni, team, trash, need, presid, ...
                              ...                        
5893    [three, thing, know, first, listen, peopl, get...
5894    [second, someon, win, beat, donald, trump, eve...
5895    [nd, final, yeah, establish, parti, candid, go...
5907    [thank, great, honor, never, million, year, th...
5908    [got, chanc, dollar, semest, commut, colleg, l...
Name: token, Length: 2267, dtype: object

Stemming and Lemmatization

print(lemmatized_tokens)

5       [well, right, economy, really, great, people, ...
9       [oh, bloomberg, let, tell, putin, okay, good, ...
11                 [know, president, russia, want, chaos]
12      [look, way, see, bernie, winning, right, democ...
13      [dug, work, bernie, team, trashed, need, presi...
                              ...                        
5893    [three, thing, know, first, listen, people, ge...
5894    [secondly, someone, win, beat, donald, trump, ...
5895    [nd, finally, yeah, establishment, party, cand...
5907    [thank, great, honor, never, million, year, th...
5908    [got, chance, dollar, semester, commuter, coll...
Name: token, Length: 2267, dtype: object

The Bag of Words

Reduce complexity
- Remove capitalization, punctuation, etc.
- Stop Words
- Equivalence Class of Words
- *Discard less useful features \leadsto depends on application
- *Other reduction, specialization

The Bag of Words

Document-feature matrix

from sklearn.feature_extraction.text import CountVectorizer

debate_data['text'] = debate_data['text'].apply(' '.join)
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(debate_data['text'])
debate_cleaned = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out(), index=debate_data['speaker'])
debate_cleaned.head()

The Bag of Words

	people	going	president	get	think	one	need	country	make	right	...	nose	confirm	notch	noted	nother	congratulated	notorious	nowhere	confiscation	aa
speaker
Amy Klobuchar	166	117	120	120	172	119	67	69	98	71	...	0	0	0	1	1	0	1	0	1	1
Bernie Sanders	253	97	62	43	85	61	77	150	45	90	...	0	0	0	0	0	0	0	0	0	0
Elizabeth Warren	222	132	62	141	79	98	165	84	116	70	...	0	0	1	0	0	0	0	0	0	0
Joe Biden	151	156	118	143	84	123	34	49	116	68	...	1	0	0	0	0	1	0	0	0	0
Kamala Harris	82	42	47	32	27	32	56	26	14	25	...	0	0	0	0	0	0	0	0	0	0

5 rows × 5501 columns

The Bag of Words

Unigram, Bigram, Trigram, N-gram

from nltk import bigrams
from nltk import trigrams
from nltk import ngrams

text_bi = bigrams(debate_data['speech'])
text_tri = trigrams(debate_data['speech'])
text_n = ngrams(debate_data['speech'], 4)

from sklearn.feature_extraction.text import CountVectorizer
vectorizer_bigrams = CountVectorizer(ngram_range=(2, 2))
text_bi = vectorizer_bigrams.fit_transform(debate_data['text'])

vectorizer = CountVectorizer(
    lowercase=True,  
    stop_words='english',  
    ngram_range=(2, 2),
    # max_features=N  # Optionally restricts to top N tokens
)
text_bi = vectorizer.fit_transform(debate_data['speech'])

Document-Feature Matrix with Bigrams

	donald trump	united state	make sure	american people	climate change	president united	million people	insurance company	making sure	get done	...	good size	good spot	good standard	good sure	good union	good using	good world	good year	goodbye family	zone said
speaker
Amy Klobuchar	49	15	28	6	15	11	9	1	8	24	...	1	0	0	0	0	0	1	0	0	0
Bernie Sanders	21	25	10	42	28	11	26	23	1	0	...	0	1	1	0	1	0	0	0	0	0
Elizabeth Warren	36	33	12	8	11	12	10	23	5	5	...	0	0	0	0	0	1	0	0	0	0
Joe Biden	24	57	48	17	6	15	9	5	24	15	...	0	0	0	0	0	0	0	1	0	1
Kamala Harris	35	42	1	7	2	13	7	7	2	1	...	0	0	0	0	0	0	0	0	0	0

5 rows × 42900 columns

The Bag of Words: Assumption

Word order doesn’t matter

# Setup CountVectorizer to capture only bigrams
vectorizer_bigrams = CountVectorizer(ngram_range=(3, 3))

# Fit and transform the text data
text_bi = vectorizer_bigrams.fit_transform(debate_data['text'])

# Convert matrix to DataFrame with bigram columns
dtm_bigrams = pd.DataFrame(text_bi.toarray(), 
                           columns=vectorizer_bigrams.get_feature_names_out(),
                           index=debate_data['speaker'])

['pharmaceutical provision oppose', 'won given given', 'relationship entire world', 'final thing comprehensive', 'com foreign policy', 'word climate change', 'pursued grace interested', 'right question going', 'straightforward beat president', 'activities required healthcare', 'campaign right speak', 'say senator klobuchar', 'said hope mitch', 'pharma companies got', 'mass shootings american', 'money ve got', 'running president restore', 'candidates guess beat', 'say government backs', 'favorite woman president', 'time rebuilding distressed', 'paso beto god', 'assault weapon ban', 'cases charges times', 'going ability reform']

Exercise: Counting the Words

Continue with all the major candidates and maybe bigrams/trigrams/n-grams.

Measurement

Measurement begins with a classification process; it’s about categorizing
When we conduct measurement, we start with a construct to be measured and a property (characteristic) of that construct that we can use to distinguish constructs from one another
All measurement is theory-testing (because of the construction and evaluation of models is reliant on theory)
- So, all measurement is a tentative statement, or conjecture, about the state of reality
- That is, measurement is falsifiable
- No such thing as the “true” measure of an object/characteristic
We’re making measurement decisions starting with pre-processing the text.

Distance

Text and Geometry

A document-term matrix

\boldsymbol{X} = \begin{matrix} 1 & 2 & 0 & ... & 0 \\ 0 & 0 & 3 & ... & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & 0 & 0 & ... & 3 \\ \end{matrix}

Text and Geometry

A document-term matrix
Suppose documents live in a space \leadsto rich set of results from linear algebra
- Provides a geometry \leadsto modify with word weighting
- Natural notions of distance
- A Multi-Dimensional Scaling (MDS) problem (van Langren, 1635; Torgerson, 1958)
- Building block for clustering, supervised learning, and scaling

Text and Geometry: Measuring Similarity

What properties should similarity measure have?
- Maximum: document with itself
- Minimum: documents have no words in common (orthogonal )
- Increasing when more of same words used
- s(a,b)=s(b,a)

Text and Geometry: Measuring Similarity

Candidate	Word 1	Word 2
Candidate 1	2	1
Candidate 2	1	4

Text and Geometry: Measuring Similarity

Measure 1: inner product (2, 1)'\cdot (1,4) = 6

Text and Geometry: Measuring Similarity

Problem: length dependent

Inner product: (4, 2)'\cdot (1,4) = 12

Text and Geometry: Measuring Similarity

Length dependent

(4, 2)'\cdot (1,4) = 12 a \cdot b = ||a|| \times ||b|| \times \cos \theta

Text and Geometry: Measuring Similarity

Cosine Similarity \cos \theta = \left(\frac{a} {||a||}\right) \cdot \left(\frac{b} {||b||} \right) \\ \frac{(4,2)}{||(4,2) ||} = (0.89, 0.45) \\ \frac{(2,1)}{||(2,1) || } = (0.89, 0.45) \\ \frac{(1,4)} {||(1,4)||} = (0.24, 0.97) \\ (0.89, 0.45)^{'} (0.24, 0.97) = 0.65

Text and Geometry

speaker           Amy Klobuchar  Bernie Sanders  Elizabeth Warren   
speaker                                                             
Amy Klobuchar      3.330669e-16    2.305125e-01      1.411393e-01  \
Bernie Sanders     2.305125e-01    1.110223e-16      1.899815e-01   
Elizabeth Warren   1.411393e-01    1.899815e-01      2.220446e-16   
Joe Biden          1.813508e-01    3.081170e-01      1.867826e-01   
Kamala Harris      2.365191e-01    2.399044e-01      2.002306e-01   
Pete Buttigieg     1.233278e-01    2.461802e-01      1.441456e-01   

speaker              Joe Biden  Kamala Harris  Pete Buttigieg  
speaker                                                        
Amy Klobuchar     1.813508e-01   2.365191e-01        0.123328  
Bernie Sanders    3.081170e-01   2.399044e-01        0.246180  
Elizabeth Warren  1.867826e-01   2.002306e-01        0.144146  
Joe Biden         2.220446e-16   2.501987e-01        0.159975  
Kamala Harris     2.501987e-01   1.110223e-16        0.235334  
Pete Buttigieg    1.599755e-01   2.353338e-01        0.000000

Exercise: Similarities between Candidates

Other Considerations: Weights

Are all words created equal?
- Treat all words equally
- Lots of noise
- Reweight words
- Accentuate words that are likely to be informative
- Make specific assumptions about characteristics of informative words
How to generate weights?
- Assumptions about separating words
- Use training set to identify separating words (Monroe, Ideology measurement)

Weights: TF-IDF Weighting

\text{N} = \text{Total number of documents} \text{n}_{j} = \text{No. documents in which word $j$ occurs}\\ \text{idf}_{j} = \log \frac{N} {n_j} \\ \textbf{idf} = (\text{idf}_{1} , \text{idf}_{2}, ..., \text{idf}_{J} )

Why \log ?
- Maximum at n_j = 1
- Decreases at rate 1/n_j \leadsto diminishing “penalty” for more common use j
- Other functional forms are fine, embed assumptions about penalization of common use

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

Word Embeddings

The bag of word: separate indices in a word count vector, implicitly assuming that they have completely unique meanings
- E.g. monarchy, king, queen, president, prime minister, executive, …
Word embedding: high-dimensional vectors representing words.
- It captures semantic meaning based on word’s context.
- Mathematical operations on words (e.g., King - Man + Woman = Queen).
- Distributional/contextual characteristics: words that appear in similar contexts tend to have similar meanings
- Essential for dimension reduction, transfer learning, RNNs or Transformers, etc.
- Pre-trained models: Word2Vec, FastText, and GloVe.

Clustering

To (further) reduce dimensionality
- Distance metric \leadsto when are documents close?
- Clustering \leadsto how do we summarize distances
K-Means: partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.
- Unsupervised method

K-Means Clustering

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

kmeans = KMeans(n_clusters=2, random_state=42)
clusters = kmeans.fit_predict(scaled_data)

Clustering

We know the number of clusters we’re expecting
Can statistics guide us?
- Sum squared errors decreases as K increases
- Each document in own cluster (a useless model)
- Modeling problem: fit often increases with features
No one statistic captures how you want to use your data
- But, can help guide your selection
- Combination of theory + statistic + discovery

Exercise: Clustering Words

Large Language Model

Each word vector still represents a point in the “word space,” and words with more similar meanings are placed closer together.
High dimension and high-dimensional calculation
Word meaning depends on context
- Attention: weighting context and relevancy
- Transforming word vectors into word predictions

Scaling Ideologies of Politicians

Another multi-dimensional scaling problem.
Distance: people vote together are closer.

Scaling Ideologies of Politicians

Problem:

Scaling Ideologies of Politicians

Alternatives:
- Hopkin and Noel (2022): Perceived ideology
- Bonica (2016): campaign finance scores (CFscores)

What about using texts (and context)
Wu et al. - Large Language Models Can Be Used to Scale the Ideologies of Politicians in a Zero-Shot Learning Setting
- Asking ChatGPT to place politicians.

Prompt

Which senator is more liberal/conservative: (Senator1) or (Senator2)?

Scaling Ideologies of Politicians

ChatGPT API

Chat completions API

response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Where was the world series in 2020?"}
    ]
)

Response format

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "The 2020 World Series was played in Texas at Globe Life Field in Arlington.",
        "role": "assistant"
      }
    }
  ],
  "created": 1677664795,
  "id": "chatcmpl-7QyqpwdfhqwajicIEznoc6Q47XAyW",
  "model": "gpt-3.5-turbo-0613",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 17,
    "prompt_tokens": 57,
    "total_tokens": 74
  }
}

Exercise: ChatGPT API

- https://tinyurl.com/midtext2

Measurement as a Classification Problem

Topic: What is this text about?
- Policy area: {Agriculture, Crime, Environment, …}
- Campaign agendas: {Abortion, Campaign, Finance, Taxing, … }
Sentiment: What is said in this text?
- Positions on legislation: { Support, Ambiguous, Oppose }
- Positions on Court Cases: { Agree with Court, Disagree with Court }
- Liberal/Conservative Blog Posts: { Liberal, Middle, Conservative, No Ideology Expressed }
Style/Tone: How is it said?
- Taunting in floor statements: { Partisan Taunt, Intra party taunt, Agency taunt, … }
- Negative campaigning: { Negative ad, Positive ad}

Different Methods

Dictionary methods
- Dictionary methods are context invariant
Supervised and semi-supervised methods
“Wisdom of the crowd”
- Crowd-sourcing labeling

Inference Problems

Regression models: classified as predictor or outcome
- High-dimensional
- There many correlated variables
- \leadsto variable selection methods, e.g. Ridge/Lasso regression, Elastic-Net, etc.

Causal Inference

Causal effects
Potential outcome/DAG framework
Design-based inference
Randomized designs: blocked, conjoint, list, and multiarm bandit experiments
Appropriate designs for causal inference in observational data: matching, instrumental variables, regression discontinuities, and synthetic controls
Sensitivity of estimates to unmeasured confounders

The Fundemental Problem of Causal Inference

Suppose we ask, “Would a canvassing policy increase enrollment in a health insurance program?”

Citizen	Canvassed?	Enrolled?
1	Yes	Yes
2	Yes	Yes
3	No	No
4	No	No

Is it causal?

The Fundemental Problem of Causal Inference

Citizen	Canvassed?	If Canvass?	If No Canvass?	Enrolled?
1	Yes	Yes	(Yes)	Yes
2	Yes	Yes	(No)	Yes
3	No	(Yes)	No	No
4	No	(No)	No	No

What is the true causal effect of canvassing?
We can never observe both outcomes for canvass and no canvass
The Fundamental Problem of Causal Inference: We can never observe more than one potential outcome for a given unit.

The Fundemental Problem of Causal Inference

The problem: “Canvass” group \neq “No canvass” group
The pontential outcome help predict treatment condition
- The people would enroll under canvass are more likely to be canvassed.

Citizen	Canvassed?	If Canvass?	If No Canvass?	Enrolled?
1	Yes	Yes	(Yes)	Yes
2	Yes	Yes	(No)	Yes
3	No	(Yes)	No	No
4	No	(No)	No	No

When We Can Recover Causal Effects

Potential outcomes do not predict treatment
- Knowing whether enroll if canvass not predictive.

Citizen	Canvassed?	If Canvass?	If No Canvass?	Enrolled?
1	Yes	Yes	(Yes)	Yes
2	Yes	Yes	(No)	Yes
3	No	(Yes)	Yes	No
4	No	(Yes)	No	No

Randomized Controlled Trials (RCT)
Causal inference with observational data: natrual experiment, matching, instrumental variables, regression discontinuities, synthetic controls, etc.
Sensitivity analysis: how strong must violations be to invalidate inference about phone calls?

Causal Inference with Text Data

Text as confounder
Text as mediator
- [Veitch, Sridhar, & Blei, 2020 - Adapting Text Embeddings for Causal Inference][https://proceedings.mlr.press/v124/veitch20a.html]
- Keith, Rice, & O’Connor, 2021 - Text as Causal Mediators

Causal Inference with Text Data

Improving NLP with causal infernece

Causal Inference with Text Data

The problems:
- Text data is highdimensional.
- Complexity and ambiguity
- Confounding: textual content might be influenced by unobserved variables.
We lose the nuances/ambiguity when reducing the dimensions.
Causal inference often deal with categorical or numerical variables.

Adaptive Design

Adaptive clinic trails
Goal: to identify the best treatment arm
Estimand: not ATE, but P(E(R_{this arm}) > E(R_{this arm}))
Testing many treatment arms simutaneously and adaptively
- Eliminating the badly performed arms; adding new arms

Exercise: Using Textual Treatment in Adaptive Design

Concluding Notes

Challenges and opportunities of text in the research process
- Measuring text (processing)
- Using text as measurement
- Making inferences of text
- Using text for (causal) inferences
An agnostic approach to text analysis
- Unknown data generating process
- Methods to augment humans not replace humans
- Highdimensional and ambiguous

Exit Ticket

https://bit.ly/MDIWorkshopSept2023ExitTicket