Subnational Measurement with Geospatial Data

Le Bao

Massive Data Institute

October 18-19, 2022

More Exciting Events at MDI

MDI Distinguished Lecture
- Speaker: Dr. Matthew J. Salganik (Princeton)
- “The Unpredictability of Life Outcomes”
- October 26th, 3:00-4:00 PM, Copley Formal Lounge
- RSVP: bit.ly/RSVPMDIDistinguishedLectureOct2022
Next MDI Data Workshop
- “Modeling Propensity to Use or Sell Drugs at the Block Group Level”
- Dr. J.J. Naddeo
- Tuesday, Nov 15 and Wednesday, Nov 16: 4-5:30pm

Front Matter

Introduction
- Postdoc fellow at MDI
- Political methodology
- Polarization, environmental politics, inequality
- Journal Political Analysis
The workshop materials are available at: bit.ly/mdi22ws
- Slides, Google Colab, Python and R
- The materials are created using Quarto

Quarto & Multi-Language Programming

Python code

h = "Hello"
w = "World"
msg = h + " " + w
print(msg)

Hello World

R code

h <- "Hello"
w <- "World"
msg <- paste(h, w)
print(msg)

[1] "Hello World"

The Goal & Plan

Connecting the dots
Looking at a problem from different perspectives
Weighting, aggregation, spatial statistics, multilevel regression, Bayesian statistical methods, small area estimation, MrP, kriging, …
Today
- The problem of subnational measurement; different solutions; an introduction to multilevel regression and poststratification (MrP)
Tomorrow:
- The problem from a spatial perspective; Bayesian spatial methods; Bayesian universial kriging

Let’s begin by talking a bit big picture. Obviously, one or two workshop is probably not enough for you to master a method. It’s particularly true for today’s content. You need to have a relatively good understanding of the both the underlying theory and also, the coding isn’t too bad but it’s a bit convuluted so you also need to think hard on how to operationalize the method.

The goal of the workshop is not to replace the curriculum training, rather, we want to help you connect the dots by bringing some different contents and perspectives together. I’m speaking from my own experience. We often learn one thing from a class. But we didn’t know it actually connects to another method or can solve a different problem.

For example, today and tomorrow’s topic. It sometimes about weighting and aggregation. It’s maybe a sub topic of spatial statistics but often not talked a lot. And we are also using multilevel regression and Bayesian methods. etc. etc. So, they all converge on the subnational measurement problem.

The Problem(s)

Some questions:
- How to predict the presidential election results of a state based on national polls (\(N=1500\); \(\overline{n} \approx 30\))?
  - What about other public opinions in general?
- How to estimate the air quality of a city based on sporadically located monitors?
  - What about census tracks?
- Average housing price? Predictions of covid cases?
- …

The Problem(s)

Point-to-block realignment (Banerjee, Carlin and Gelfand, 2014)
- To use point-level data to produce aggregate estimates at the area/block level (e.g. means, counts, proportions, etc.)
- E.g. from survey respondents to state-level presidential approval
Small Area Estimation (Ghosh and Rao, 1994)
- To produce reliable estimates of variables of interest with only small samples or no samples are available.
- “Small area” can be a small geographical area or a “small domain” (i.e. subgroup).

ANES Data

import pandas as pd
import plotly.graph_objects as go
import plotly.express as px

anes = pd.read_csv('data/anes2020.csv')
anes

           state state_abb  ...                                  educ pres_vote
0       Oklahoma        OK  ...                     Bachelor's degree     Trump
1       Virginia        VA  ...                High school credential     Biden
2     California        CA  ...  Some post-high school, no bachelor's     Biden
3       Colorado        CO  ...                       Graduate degree     Trump
4          Texas        TX  ...  Some post-high school, no bachelor's     Biden
...          ...       ...  ...                                   ...       ...
5739        Iowa        IA  ...                     Bachelor's degree     Trump
5740     Florida        FL  ...  Some post-high school, no bachelor's     Trump
5741       Idaho        ID  ...                     Bachelor's degree     Trump
5742  California        CA  ...                High school credential     Biden
5743    Virginia        VA  ...                       Graduate degree     Biden

[5744 rows x 10 columns]

ANES Data

## Vote percentage
#anes.pres_vote.value_counts()
anes.pres_vote.value_counts(normalize=True)

Biden    0.554492
Trump    0.445508
Name: pres_vote, dtype: float64

## Number of respondents by states
state_counts = anes.groupby('state_abb').size().reset_index().rename(columns={0: 'obs'})
state_counts.head(10)

  state_abb  obs
0        AK    8
1        AL   91
2        AR   42
3        AZ  105
4        CA  544
5        CO  119
6        CT   63
7        DE   17
8        FL  363
9        GA  167

The Problem(s)

Code

p1 = go.Figure(data=go.Choropleth(
    locations=state_counts['state_abb'],
    z=state_counts['obs'].astype(float),
    locationmode='USA-states',
    colorscale='bluyl',
    autocolorscale=False,
    marker_line_color='black',
    colorbar_title="Obs"
))

#### Confine map to US states and update layout
p1.update_layout(
    title_text='Number of respondents by states',
    geo = dict(
        scope='usa',
        projection=go.layout.geo.Projection(type = 'albers usa'),
        showlakes=True, ),
)

#p1.show()

The Problem(s)

Code

p2 = px.histogram(anes.loc[anes['pres_vote']=="Biden"], x="educ", color="sex", marginal="box", 
                  category_orders=dict(educ = ['Less than high school credential', 
                                                  'High school credential', 
                                                  "Some post-high school, no bachelor's", 
                                                  "Bachelor's degree", 
                                                  "Graduate degree"]),
                  hover_data=anes.loc[anes['pres_vote']=="Biden"].columns)

p2 = p2.update_layout(legend=dict(
    orientation="h",
    yanchor="bottom",
    y=1.02,
    xanchor="right",
    x=1),
    barmode='group',
    legend_title_text="Sex",
    title_text='Support for Biden by Education Group and Sex')

p2.show()

The Solutions

Design-based methods
- Simple and rely on direct observations \[ \hat{Y_j} = \frac{\mathbf{s}um_{i = 1}^{n_j}{y_{ij}}}{n_j} \text{ } \text{ } \text{ } \text{ } \text{ } Var[\hat{Y_j}|n_j] = \frac{S_j^2}{n_j} \left(1 - \frac{n_j}{N_j}\right)\]
- E.g. disaggregation, large samples, pooling samples, etc.

Disaagregation

### Recoding support for Biden
anes['pres_vote_b'] = anes['pres_vote'].map({'Trump': 0, 'Biden': 1})
anes_votes = (anes.groupby('state_abb')['pres_vote_b'].mean()*100).reset_index().rename(columns={'pres_vote_b': 'biden_share'})
anes_votes

   state_abb  biden_share
0         AK    12.500000
1         AL    35.164835
2         AR    30.952381
3         AZ    46.666667
4         CA    69.852941
5         CO    60.504202
6         CT    61.904762
7         DE    64.705882
8         FL    50.688705
9         GA    49.101796
10        HI    72.222222
11        IA    50.980392
12        ID    50.000000
13        IL    61.603376
14        IN    51.700680
15        KS    48.192771
16        KY    42.708333
17        LA    38.028169
18        MA    78.343949
19        MD    74.825175
20        ME    61.538462
21        MI    58.415842
22        MN    56.692913
23        MO    48.461538
24        MS    43.103448
25        MT    42.105263
26        NC    44.019139
27        ND    18.750000
28        NE    47.368421
29        NH    61.764706
30        NJ    66.165414
31        NM    57.142857
32        NV    57.142857
33        NY    65.371025
34        OH    44.588745
35        OK    33.783784
36        OR    78.750000
37        PA    54.000000
38        RI    50.000000
39        SC    52.577320
40        SD    23.076923
41        TN    38.931298
42        TX    47.890819
43        UT    48.529412
44        VA    60.869565
45        VT    75.000000
46        WA    71.621622
47        WI    54.385965
48        WV    31.250000
49        WY    75.000000

Disaagregation

A straightforward way to make state-level predictions is to just use the observed responses.

Code

p3 = go.Figure(data=go.Choropleth(
    locations=anes_votes['state_abb'],
    z=anes_votes['biden_share'].astype(float),
    locationmode='USA-states',
    colorscale='blues',
    autocolorscale=False,
    marker_line_color='white', # line markers between states
    colorbar_title="Biden's Two-Party Share"))

p3.update_layout(
    title_text="Raw Estimates of Support for Biden",
    geo = dict(
        scope='usa',
        projection=go.layout.geo.Projection(type = 'albers usa'),
        showlakes=True, # lakes
        lakecolor='rgb(255, 255, 255)'),)

The Solutions

Design-based methods
- Simple and rely on direct observations \[ \hat{Y_j} = \frac{\mathbf{s}um_{i = 1}^{n_j}{y_{ij}}}{n_j} \text{ } \text{ } \text{ } \text{ } \text{ } Var[\hat{Y_j}|n_j] = \frac{S_j^2}{n_j} \left(1 - \frac{n_j}{N_j}\right) \]
- E.g. disaggregation, large samples, pooling samples, etc.
- The problems may remain: bias and efficiency (variance).

Design-Based Methods

The Solutions

Design-based methods
- Simple and rely on direct observations \[ \hat{Y_j} = \frac{\mathbf{s}um_{i = 1}^{n_j}{y_{ij}}}{n_j} \text{ } \text{ } \text{ } \text{ } \text{ } Var[\hat{Y_j}|n_j] = \frac{S_j^2}{n_j} \left(1 - \frac{n_j}{N_j}\right)\]
- E.g. disaggregation, large samples, pooling samples, etc.
- The problems may remain: bias and efficiency (variance).
- Note that large sample per se would not help:
  - “Paradises and paradoxes in big data”.
  - (Meng, 2018, Bailey, forthcoming)

–

Exercise

EPA: respondents by states and demographics
- Create a plot for Trump support using disaagregation
- What are the proportions of female and male in each state?
- What about race and ethnicity?
- Is there a pattern for age group and sex?

The Solutions

Model-based methods
- Model-based methods imply a model of the data generation process of the sample data.
- Inferences are made based on the distributions assumed by the models.
- Advantages:
  - Increased precision of the estimation.
  - Associated variability measures can also be derived under an assumed model.

Model-Based Methods

Multilevel Regression with Poststratifications (MRP) (today)
(Bayesian) Spatial Methods (tomorrow)
Tons of models under the umbrella of SAE
- (Pfeffermann, 2013; Rao and Molina, 2015)
- Some may be domain-specific.

Multi-Level Regression with Poststratification

Often called as “Mister P”
Modern version: Park, Gelman and Bafumi 2004
Like disaagregation, MRP still relies on national survey data.
MRP can produce more accurate and robust the estimates when working with small and medium-sized samples (\(N \approx 1400\)), strongly outperforming disaggregation.
MRP begins by using multilevel regression to model individual survey responses as a function of demographic and geographic predictors.
The next step is poststratification, in which the estimates for each demographic-geographic respondent type are weighted (poststratified) by the percentages of each type in the actual state populations.

Multi-Level Regression

ALso known as hierarchical models, mixed-effect model, etc.
Parameters can vary at more than one level.
The idea: individuals are nested in contexts.
E.g. students in schools, residents in states, citizens of countries, demographic groups, polls, etc.

\[\text{Pr}(y_i = 1) = logit^{-1} (\beta_0 + \alpha^{race, sex}_{j[i]} + \alpha^{age}_{k[i]} + \alpha^{edu}_{l[i]} + \alpha^{age, edu}_{k[i],l[i]} + \alpha^{state}_{s[i]})\]

\[\alpha_s^{state} \mathbf{s}im N(\alpha_{m[s]}^{region} + \beta^{prev} \cdot X^{prev}) \]

Multi-Level Regression

Modelling contextual effects
Partially pooling respondents across states
- Complete pooling, no pooling
- Borrow strengh from similar cases to an extent determined by the data

Multi-Level Regression

Modelling contextual effects
Partially pooling respondents across states
- Complete pooling, no pooling
- Borrow strengh from similar cases to an extent determined by the data
Explicit estimation of variances
- Comparing to robust/clustered standard errors
In the context of MRP:
- To include contextual effects
- Allow partial pooling between states—states with fewer respondents can borrow strength from other states.

Poststratification

Poststratify the demographic-geographic types.
The multi-level logistic regression estimates the probability that any person will support Biden given the person’s sex, race, age, education, and state.
Compute weighted averages of these probabilities to estimate the proportion of supporters in each state.

\[y^{MRP}_{state s} = \frac{\mathbf{s}um_{c\in s} N_c \theta_c}{\mathbf{s}um_{c\in s} N_c}\]

Intuition

Multilevel regression:
- An 18-29 year old African American female with a bachelor degree in California has 80% of the chance to vote for Biden.
Poststratification:
- There are 11% of voters in California belong to this specific demographic group.

MRP: Implementation

Gather national polls.
Organize polling data and aggregate-level variables into cells
Fit a multi-level regression model for individual survey responses given demographics and geography
Use census data to create cells for poststratification
Make predictions for census cells based on model estimates

MRP: Implementation

Cells:

Code

anes_sub = anes.groupby(['state_abb','race_ethnicity', 'sex', 'educ', 'age_grp', 'region', 
                    'state_clinton'])['pres_vote_b'].agg(pres_vote_b='sum').reset_index()
anes_sub['n'] = anes.groupby(['state_abb','race_ethnicity', 'sex', 'educ', 'age_grp', 'region', 
                    'state_clinton'])['pres_vote_b'].agg(n='count').reset_index()['n']
anes_sub[["state_abb", "race_ethnicity", "sex", "educ", "age_grp", "pres_vote_b", "n"]]

     state_abb race_ethnicity     sex  ...       age_grp pres_vote_b  n
0           AK          White  Female  ...         50-59           1  1
1           AK          White  Female  ...  70 and older           0  1
2           AK          White  Female  ...  70 and older           0  1
3           AK          White    Male  ...  70 and older           0  1
4           AK          White    Male  ...         18-29           0  1
...        ...            ...     ...  ...           ...         ... ..
2626        WV          White    Male  ...         60-69           0  1
2627        WV          White    Male  ...  70 and older           0  1
2628        WY          White  Female  ...         60-69           2  2
2629        WY          White  Female  ...         40-49           0  1
2630        WY          White    Male  ...  70 and older           1  1

[2631 rows x 7 columns]

MRP: Implementation

Google Colab

MRP: Additional Thoughts

It is a generic approach for stabilizing aggregate estimates.
- Wang et al. (2015) used MRP to obtain estimates of voting behavior in the 2012 US Presidential election based on a sample of 350,000 Xbox users, empaneled 45 days prior to the election.

MRP: Additional Thoughts

Data quality is still an extremly important factor.
- What is the poll doesn’t tell us anything about a 40-49 year old white male with a high school degree in WY?
- MRP can only correct it to a degree.

MRP: Additional Thoughts

Methodological issues:
- Rely on the quality of census data
- No (real) geographical information
- The curse of dimensionality: deep interactions

Recent developments:
- Double regression with post-stratification (DRP) (Ben-Michael et al., 2021), Bayesian additive regression trees (BART) (Bisbee, 2019), stacked regression (Ornstein, 2020), sparse hierarchical regression (Goplerud et al., 2018), synthetic area weighting (Kuriwaki and Yamauchi, 2021)

Bayesian Spatial Methods

Spatially referenced data (latitude–longitude, Easting– Northing, etc.)
Geographically neighboring observations are more similar than distant observations
- Neighboring units are similar because of the shared contexts
- Spatial dependence can be the results of interaction and diffusion
- Providing information in addition to observed data
To incorporate spatial correlation as an important modeling ingredient.
- With the recent advances in computational methods (particularly in the area of Monte Carlo methods)

Bayesian and MCMC: A Teaser

Bayesian statistical methods: (mid 20th century)

\[\pi(\theta|\mathbf{X}) = \frac{p(\theta)L(\theta|\mathbf{X})}{\int_{\theta}p(\theta)L(\theta|\mathbf{X})d\theta} \propto p(\theta)L(\theta|\mathbf{X})\] \[\text{Posterior Probability} \propto \text{Prior Probability} \times \text{Likelihood Function}\]

Bayesian and MCMC: A Teaser

A cohesive framework for combining complex data models and external and prior knowledge.
Unknown parameters are random variables; the observed data is finite and fixed.
- Frequentist approach: fix parameters and infinite data
Bayesian framework also handles prediction cohesively and elegantly
- Measurement as a prediction problem
- Bayesian framework treats the prediction as the same process as the marginal distribution of observed data.

Bayesian and MCMC: A Teaser

Marchov Chain Monte Carlo (MCMC)
- To solve high-dimensional problems

Spatial Model and Gaussian Process

\(s\) is a finite set of \(n\) observed sites \(\{s_1, s_2, ..., s_n\}\) over a domain \(\mathcal{D}\) (\(s \in \mathcal{D}\))
\(\boldsymbol{\omega}\) is a random field accounts for the spatial correlation across the spatial domain \(\mathcal{D}\): \(\{\boldsymbol{\omega}(\mathbf{s}): \mathbf{s} \in \mathcal{D}\}\)
Outcome: \(\mathbf{y}(\mathbf{s}) = \{\mathbf{y}(\mathbf{s}_1),\mathbf{y}(\mathbf{s}_2),...,\mathbf{y}(\mathbf{s}_n)\}\)
Predictors: \(\mathbf{X}(\mathbf{s}) = \{\mathbf{X}(\mathbf{s}_1),\mathbf{X}(\mathbf{s}_2),...,\mathbf{X}(\mathbf{s}_n)\}\)

Spatial models and gaussian process

\(s\) is a finite set of \(n\) observed sites \(\{s_1, s_2, ..., s_n\}\) over a domain \(\mathcal{D}\) (\(s \in \mathcal{D}\))
\(\boldsymbol{\omega}\) is a random field accounts for the spatial correlation across the spatial domain \(\mathcal{D}\): \(\{\boldsymbol{\omega}(\mathbf{s}): \mathbf{s} \in \mathcal{D}\}\)
Outcome: \(\mathbf{y}(\mathbf{s}) = \{\mathbf{y}(\mathbf{s}_1),\mathbf{y}(\mathbf{s}_2),...,\mathbf{y}(\mathbf{s}_n)\}\)
Predictors: \(\mathbf{X}(\mathbf{s}) = \{\mathbf{X}(\mathbf{s}_1),\mathbf{X}(\mathbf{s}_2),...,\mathbf{X}(\mathbf{s}_n)\}\)

\[ \mathbf{y}(\mathbf{s}) = \mathbf{X}(\mathbf{s})\mathbf{\beta} + \boldsymbol{\omega}(\mathbf{s}) + \boldsymbol{\epsilon}(\mathbf{s}) \]

\[\boldsymbol\omega(\mathbf{s}) \sim \mathcal{N}(\mathbf{0},\sigma^2\mathbf{H}(\phi))\]

\[\boldsymbol\epsilon(\mathbf{s}) \sim \mathcal{N}(\mathbf{0}, \tau^2\mathbf{I})\]

Spatial Model and Gaussian Process

\[ \mathbf{y}(\mathbf{s}) = \mathbf{X}(\mathbf{s})\mathbf{\beta} + \boldsymbol{\omega}(\mathbf{s}) + \boldsymbol{\epsilon}(\mathbf{s}) \]

\[\boldsymbol\omega(\mathbf{s}) \sim \mathcal{N}(\mathbf{0},\sigma^2\mathbf{H}(\phi))\]

\[\boldsymbol\epsilon(\mathbf{s}) \sim \mathcal{N}(\mathbf{0}, \tau^2\mathbf{I})\]

Idiosyncratic error term \(\boldsymbol\epsilon\):
- \(\tau^2\): nugget, error variance independent from spatial separation.
Spatial error term \(\boldsymbol\omega\):
- \(\sigma^2\): partial sill, the variance that can be driven by geographic distance between two observations
- \(\phi\): decay, the level of correlation among observations given distance

Several of the parameters of these variance components have a substantive interpretation. For the idiosyncratic error term, \(\EPS\), the variance \(\tau^2\) is traditionally referred to as the . This is the amount of error variance in the outcome that is independent from spatial separation. Turning to the spatial \(\boldsymbol\omega\) error terms, \(\sigma^2\) is called the . The partial sill reflects the variance that can be driven by geographic distance between two observations, with the assumption being that more distant observations have a higher variance. The partial sill equals the maximum amount of variance among observations due to geographic separation. In fact, the nugget plus the partial sill equals the , which is the maximum total variance possible among distant observations. Finally, the other parameter feeding into the spatial \(\boldsymbol\omega\) error terms is the term, \(\phi\). The decay term affects the level of correlation among observations at various distances.

Bayesian Kriging Regression

Kriging models, named after statistician and mining engineer Daniel G. Krige, originated in the areas of mining and geostatistics that involve spatially correlated data.
Create a smooth surface over two non-overlapping contiguous geographic regions.

Bayesian Kriging Regression

Also known as Gaussian Random Field (GRF) or Gaussian Process (GP)
Other distributional frameworks or special cases of Gaussian process can also be used, such as exponential, spherical, wave, or Mat ́ern processes.
Advantages:
- A correlations structure on the error terms.
- An optimized (linear) combination of the data at observed points to fill-in unsampled area.
- Built-in prediction process.

Bayesian Kriging Regression: Implementation

Training: estimate the model with observed points
Testing: make predictions for unobserved points based on the estimates
A collection of points \(\leadsto\) aggregate level estimates

Google Colab