Candidate | "Donald Trump" | |
---|---|---|
0 | Amy Klobuchar | 49 |
1 | Elizabeth Warren | 36 |
2 | Kamala Harris | 35 |
3 | Pete Buttigieg | 29 |
4 | Joe Biden | 24 |
5 | Bernie Sanders | 21 |
Text as Data: Measurement and Inference Issues
Massive Data Institute, Georgetown University
September 19, 2023
Data generation process for text \leadsto unknown
Most of the methods are designed to augment humans
There is no globally best method
Requiring constant validation
An agnostic approach to text analysis
BeautifulSoup
, Selenium
, etc.requests
, json
, etc.Pillow
(PIL), pytesseract
, SpeechRecognition
, MoviePy
As long as there have been people, people have been measuring things
Throughout history, measurement was a counting operation
People started giving it thought when they wanted to measure things that were not amenable to a simple counting process (especially during the Scientific Revolution)
S.S. Stevens: measurement is the process of assigning numbers to objects according to rules.
Candidate | "Donald Trump" | |
---|---|---|
0 | Amy Klobuchar | 49 |
1 | Elizabeth Warren | 36 |
2 | Kamala Harris | 35 |
3 | Pete Buttigieg | 29 |
4 | Joe Biden | 24 |
5 | Bernie Sanders | 21 |
5 [Well, ,, you, ’, re, right, ,, the, economy, ...
9 [Oh, ,, Mr., Bloomberg, ., Let, me, tell, Mr.,...
11 [We, know, what, the, President, …, what, Russ...
12 [Look, ,, the, way, I, see, this, is, that, Be...
13 [I, dug, in, ,, I, did, the, work, and, then, ...
...
5893 [Three, things, to, know, about, me, ., First,...
5894 [Secondly, ,, I, ’, m, someone, that, can, win...
5895 [nd, finally, ,, yeah, ,, I, am, not, the, est...
5907 [Thank, you, ., It, ’, s, a, great, honor, to,...
5908 [But, I, got, my, chance, ., It, was, a, 50, d...
Name: token, Length: 2267, dtype: object
debate_data['token'] = debate_data['token'].apply(lambda tokens: [word.lower() for word in tokens if word.isalpha()])
debate_data['token']
5 [well, you, re, right, the, economy, is, doing...
9 [oh, bloomberg, let, me, tell, putin, okay, i,...
11 [we, know, what, the, president, what, russia,...
12 [look, the, way, i, see, this, is, that, berni...
13 [i, dug, in, i, did, the, work, and, then, ber...
...
5893 [three, things, to, know, about, me, first, i,...
5894 [secondly, i, m, someone, that, can, win, and,...
5895 [nd, finally, yeah, i, am, not, the, establish...
5907 [thank, you, it, s, a, great, honor, to, be, h...
5908 [but, i, got, my, chance, it, was, a, dollar, ...
Name: token, Length: 2267, dtype: object
the, it, if, a, able, at, be, because...
)
from nltk.corpus import stopwords
#nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
debate_data['token'] = debate_data['token'].apply(lambda tokens: [word for word in tokens if word not in stop_words])
debate_data['token']
5 [well, right, economy, really, great, people, ...
9 [oh, bloomberg, let, tell, putin, okay, good, ...
11 [know, president, russia, wants, chaos]
12 [look, way, see, bernie, winning, right, democ...
13 [dug, work, bernie, team, trashed, need, presi...
...
5893 [three, things, know, first, listen, people, g...
5894 [secondly, someone, win, beat, donald, trump, ...
5895 [nd, finally, yeah, establishment, party, cand...
5907 [thank, great, honor, never, million, years, t...
5908 [got, chance, dollar, semester, commuter, coll...
Name: token, Length: 2267, dtype: object
family, families, familial -> famili
from nltk.stem.snowball import EnglishStemmer
from nltk.stem import WordNetLemmatizer
#nltk.download('wordnet')
stemmer = EnglishStemmer()
lemmatizer = WordNetLemmatizer()
stemmed_tokens = debate_data['token'].apply(lambda tokens: [stemmer.stem(word) for word in tokens])
lemmatized_tokens = debate_data['token'].apply(lambda tokens: [lemmatizer.lemmatize(word) for word in tokens])
print(stemmed_tokens)
5 [well, right, economi, realli, great, peopl, l...
9 [oh, bloomberg, let, tell, putin, okay, good, ...
11 [know, presid, russia, want, chao]
12 [look, way, see, berni, win, right, democrat, ...
13 [dug, work, berni, team, trash, need, presid, ...
...
5893 [three, thing, know, first, listen, peopl, get...
5894 [second, someon, win, beat, donald, trump, eve...
5895 [nd, final, yeah, establish, parti, candid, go...
5907 [thank, great, honor, never, million, year, th...
5908 [got, chanc, dollar, semest, commut, colleg, l...
Name: token, Length: 2267, dtype: object
5 [well, right, economy, really, great, people, ...
9 [oh, bloomberg, let, tell, putin, okay, good, ...
11 [know, president, russia, want, chaos]
12 [look, way, see, bernie, winning, right, democ...
13 [dug, work, bernie, team, trashed, need, presi...
...
5893 [three, thing, know, first, listen, people, ge...
5894 [secondly, someone, win, beat, donald, trump, ...
5895 [nd, finally, yeah, establishment, party, cand...
5907 [thank, great, honor, never, million, year, th...
5908 [got, chance, dollar, semester, commuter, coll...
Name: token, Length: 2267, dtype: object
from sklearn.feature_extraction.text import CountVectorizer
debate_data['text'] = debate_data['text'].apply(' '.join)
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(debate_data['text'])
debate_cleaned = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out(), index=debate_data['speaker'])
debate_cleaned.head()
people | going | president | get | think | one | need | country | make | right | ... | nose | confirm | notch | noted | nother | congratulated | notorious | nowhere | confiscation | aa | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
speaker | |||||||||||||||||||||
Amy Klobuchar | 166 | 117 | 120 | 120 | 172 | 119 | 67 | 69 | 98 | 71 | ... | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 1 |
Bernie Sanders | 253 | 97 | 62 | 43 | 85 | 61 | 77 | 150 | 45 | 90 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Elizabeth Warren | 222 | 132 | 62 | 141 | 79 | 98 | 165 | 84 | 116 | 70 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Joe Biden | 151 | 156 | 118 | 143 | 84 | 123 | 34 | 49 | 116 | 68 | ... | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
Kamala Harris | 82 | 42 | 47 | 32 | 27 | 32 | 56 | 26 | 14 | 25 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 5501 columns
from nltk import bigrams
from nltk import trigrams
from nltk import ngrams
text_bi = bigrams(debate_data['speech'])
text_tri = trigrams(debate_data['speech'])
text_n = ngrams(debate_data['speech'], 4)
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_bigrams = CountVectorizer(ngram_range=(2, 2))
text_bi = vectorizer_bigrams.fit_transform(debate_data['text'])
vectorizer = CountVectorizer(
lowercase=True,
stop_words='english',
ngram_range=(2, 2),
# max_features=N # Optionally restricts to top N tokens
)
text_bi = vectorizer.fit_transform(debate_data['speech'])
donald trump | united state | make sure | american people | climate change | president united | million people | insurance company | making sure | get done | ... | good size | good spot | good standard | good sure | good union | good using | good world | good year | goodbye family | zone said | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
speaker | |||||||||||||||||||||
Amy Klobuchar | 49 | 15 | 28 | 6 | 15 | 11 | 9 | 1 | 8 | 24 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
Bernie Sanders | 21 | 25 | 10 | 42 | 28 | 11 | 26 | 23 | 1 | 0 | ... | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
Elizabeth Warren | 36 | 33 | 12 | 8 | 11 | 12 | 10 | 23 | 5 | 5 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
Joe Biden | 24 | 57 | 48 | 17 | 6 | 15 | 9 | 5 | 24 | 15 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
Kamala Harris | 35 | 42 | 1 | 7 | 2 | 13 | 7 | 7 | 2 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 42900 columns
# Setup CountVectorizer to capture only bigrams
vectorizer_bigrams = CountVectorizer(ngram_range=(3, 3))
# Fit and transform the text data
text_bi = vectorizer_bigrams.fit_transform(debate_data['text'])
# Convert matrix to DataFrame with bigram columns
dtm_bigrams = pd.DataFrame(text_bi.toarray(),
columns=vectorizer_bigrams.get_feature_names_out(),
index=debate_data['speaker'])
['pharmaceutical provision oppose', 'won given given', 'relationship entire world', 'final thing comprehensive', 'com foreign policy', 'word climate change', 'pursued grace interested', 'right question going', 'straightforward beat president', 'activities required healthcare', 'campaign right speak', 'say senator klobuchar', 'said hope mitch', 'pharma companies got', 'mass shootings american', 'money ve got', 'running president restore', 'candidates guess beat', 'say government backs', 'favorite woman president', 'time rebuilding distressed', 'paso beto god', 'assault weapon ban', 'cases charges times', 'going ability reform']
Measurement begins with a classification process; it’s about categorizing
When we conduct measurement, we start with a construct to be measured and a property (characteristic) of that construct that we can use to distinguish constructs from one another
All measurement is theory-testing (because of the construction and evaluation of models is reliant on theory)
We’re making measurement decisions starting with pre-processing the text.
\boldsymbol{X} = \begin{matrix} 1 & 2 & 0 & ... & 0 \\ 0 & 0 & 3 & ... & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & 0 & 0 & ... & 3 \\ \end{matrix}
A document-term matrix
Suppose documents live in a space \leadsto rich set of results from linear algebra
Candidate | Word 1 | Word 2 |
---|---|---|
Candidate 1 | 2 | 1 |
Candidate 2 | 1 | 4 |
(4, 2)'\cdot (1,4) = 12 a \cdot b = ||a|| \times ||b|| \times \cos \theta
speaker Amy Klobuchar Bernie Sanders Elizabeth Warren
speaker
Amy Klobuchar 3.330669e-16 2.305125e-01 1.411393e-01 \
Bernie Sanders 2.305125e-01 1.110223e-16 1.899815e-01
Elizabeth Warren 1.411393e-01 1.899815e-01 2.220446e-16
Joe Biden 1.813508e-01 3.081170e-01 1.867826e-01
Kamala Harris 2.365191e-01 2.399044e-01 2.002306e-01
Pete Buttigieg 1.233278e-01 2.461802e-01 1.441456e-01
speaker Joe Biden Kamala Harris Pete Buttigieg
speaker
Amy Klobuchar 1.813508e-01 2.365191e-01 0.123328
Bernie Sanders 3.081170e-01 2.399044e-01 0.246180
Elizabeth Warren 1.867826e-01 2.002306e-01 0.144146
Joe Biden 2.220446e-16 2.501987e-01 0.159975
Kamala Harris 2.501987e-01 1.110223e-16 0.235334
Pete Buttigieg 1.599755e-01 2.353338e-01 0.000000
\text{N} = \text{Total number of documents} \text{n}_{j} = \text{No. documents in which word $j$ occurs}\\ \text{idf}_{j} = \log \frac{N} {n_j} \\ \textbf{idf} = (\text{idf}_{1} , \text{idf}_{2}, ..., \text{idf}_{J} )
Word2Vec
, FastText
, and GloVe
.
Prompt
Which senator is more liberal/conservative: (Senator1) or (Senator2)?
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"content": "The 2020 World Series was played in Texas at Globe Life Field in Arlington.",
"role": "assistant"
}
}
],
"created": 1677664795,
"id": "chatcmpl-7QyqpwdfhqwajicIEznoc6Q47XAyW",
"model": "gpt-3.5-turbo-0613",
"object": "chat.completion",
"usage": {
"completion_tokens": 17,
"prompt_tokens": 57,
"total_tokens": 74
}
}
Dictionary methods
Supervised and semi-supervised methods
“Wisdom of the crowd”
Citizen | Canvassed? | Enrolled? |
---|---|---|
1 | Yes | Yes |
2 | Yes | Yes |
3 | No | No |
4 | No | No |
Citizen | Canvassed? | If Canvass? | If No Canvass? | Enrolled? |
---|---|---|---|---|
1 | Yes | Yes | Yes | |
2 | Yes | Yes | Yes | |
3 | No | No | No | |
4 | No | No | No |
What is the true causal effect of canvassing?
We can never observe both outcomes for canvass and no canvass
The Fundamental Problem of Causal Inference: We can never observe more than one potential outcome for a given unit.
Citizen | Canvassed? | If Canvass? | If No Canvass? | Enrolled? |
---|---|---|---|---|
1 | Yes | Yes | Yes | |
2 | Yes | Yes | Yes | |
3 | No | No | No | |
4 | No | No | No |
Citizen | Canvassed? | If Canvass? | If No Canvass? | Enrolled? |
---|---|---|---|---|
1 | Yes | Yes | Yes | |
2 | Yes | Yes | Yes | |
3 | No | Yes | No | |
4 | No | No | No |
https://bit.ly/MDIWorkshopSept2023ExitTicket
Le Bao · MDI Workshop · https://baole.io/