Subnational Measurement with Geospatial Data

 

Le Bao

Massive Data Institute

October 18-19, 2022

More Exciting Events at MDI

  • MDI Distinguished Lecture
  • Next MDI Data Workshop
    • “Modeling Propensity to Use or Sell Drugs at the Block Group Level”
    • Dr. J.J. Naddeo
    • Tuesday, Nov 15 and Wednesday, Nov 16: 4-5:30pm

Front Matter

  • Introduction
    • Postdoc fellow at MDI
    • Political methodology
    • Polarization, environmental politics, inequality
    • Journal Political Analysis

  • The workshop materials are available at: bit.ly/mdi22ws
    • Slides, Google Colab, Python and R
    • The materials are created using Quarto

Quarto & Multi-Language Programming

  • Python code
h = "Hello"
w = "World"
msg = h + " " + w
print(msg)
Hello World


  • R code
h <- "Hello"
w <- "World"
msg <- paste(h, w)
print(msg)
[1] "Hello World"

The Goal & Plan

  • Connecting the dots
  • Looking at a problem from different perspectives
  • Weighting, aggregation, spatial statistics, multilevel regression, Bayesian statistical methods, small area estimation, MrP, kriging, …

  • Today
    • The problem of subnational measurement; different solutions; an introduction to multilevel regression and poststratification (MrP)
  • Tomorrow:
    • The problem from a spatial perspective; Bayesian spatial methods; Bayesian universial kriging

The Problem(s)

  • Some questions:
    • How to predict the presidential election results of a state based on national polls (\(N=1500\); \(\overline{n} \approx 30\))?
      • What about other public opinions in general?
    • How to estimate the air quality of a city based on sporadically located monitors?
      • What about census tracks?
    • Average housing price? Predictions of covid cases?

The Problem(s)

  • Point-to-block realignment (Banerjee, Carlin and Gelfand, 2014)
    • To use point-level data to produce aggregate estimates at the area/block level (e.g. means, counts, proportions, etc.)
    • E.g. from survey respondents to state-level presidential approval

  • Small Area Estimation (Ghosh and Rao, 1994)
    • To produce reliable estimates of variables of interest with only small samples or no samples are available.
    • “Small area” can be a small geographical area or a “small domain” (i.e. subgroup).

ANES Data

import pandas as pd
import plotly.graph_objects as go
import plotly.express as px


anes = pd.read_csv('data/anes2020.csv')
anes
           state state_abb  ...                                  educ pres_vote
0       Oklahoma        OK  ...                     Bachelor's degree     Trump
1       Virginia        VA  ...                High school credential     Biden
2     California        CA  ...  Some post-high school, no bachelor's     Biden
3       Colorado        CO  ...                       Graduate degree     Trump
4          Texas        TX  ...  Some post-high school, no bachelor's     Biden
...          ...       ...  ...                                   ...       ...
5739        Iowa        IA  ...                     Bachelor's degree     Trump
5740     Florida        FL  ...  Some post-high school, no bachelor's     Trump
5741       Idaho        ID  ...                     Bachelor's degree     Trump
5742  California        CA  ...                High school credential     Biden
5743    Virginia        VA  ...                       Graduate degree     Biden

[5744 rows x 10 columns]

ANES Data

## Vote percentage
#anes.pres_vote.value_counts()
anes.pres_vote.value_counts(normalize=True)
Biden    0.554492
Trump    0.445508
Name: pres_vote, dtype: float64


## Number of respondents by states
state_counts = anes.groupby('state_abb').size().reset_index().rename(columns={0: 'obs'})
state_counts.head(10)
  state_abb  obs
0        AK    8
1        AL   91
2        AR   42
3        AZ  105
4        CA  544
5        CO  119
6        CT   63
7        DE   17
8        FL  363
9        GA  167

The Problem(s)


Code
p1 = go.Figure(data=go.Choropleth(
    locations=state_counts['state_abb'],
    z=state_counts['obs'].astype(float),
    locationmode='USA-states',
    colorscale='bluyl',
    autocolorscale=False,
    marker_line_color='black',
    colorbar_title="Obs"
))

#### Confine map to US states and update layout
p1.update_layout(
    title_text='Number of respondents by states',
    geo = dict(
        scope='usa',
        projection=go.layout.geo.Projection(type = 'albers usa'),
        showlakes=True, ),
)

#p1.show()

The Problem(s)

Code
p2 = px.histogram(anes.loc[anes['pres_vote']=="Biden"], x="educ", color="sex", marginal="box", 
                  category_orders=dict(educ = ['Less than high school credential', 
                                                  'High school credential', 
                                                  "Some post-high school, no bachelor's", 
                                                  "Bachelor's degree", 
                                                  "Graduate degree"]),
                  hover_data=anes.loc[anes['pres_vote']=="Biden"].columns)

p2 = p2.update_layout(legend=dict(
    orientation="h",
    yanchor="bottom",
    y=1.02,
    xanchor="right",
    x=1),
    barmode='group',
    legend_title_text="Sex",
    title_text='Support for Biden by Education Group and Sex')

p2.show()

The Solutions

  • Design-based methods
    • Simple and rely on direct observations \[ \hat{Y_j} = \frac{\mathbf{s}um_{i = 1}^{n_j}{y_{ij}}}{n_j} \text{ } \text{ } \text{ } \text{ } \text{ } Var[\hat{Y_j}|n_j] = \frac{S_j^2}{n_j} \left(1 - \frac{n_j}{N_j}\right)\]
    • E.g. disaggregation, large samples, pooling samples, etc.

Disaagregation


### Recoding support for Biden
anes['pres_vote_b'] = anes['pres_vote'].map({'Trump': 0, 'Biden': 1})
anes_votes = (anes.groupby('state_abb')['pres_vote_b'].mean()*100).reset_index().rename(columns={'pres_vote_b': 'biden_share'})
anes_votes
   state_abb  biden_share
0         AK    12.500000
1         AL    35.164835
2         AR    30.952381
3         AZ    46.666667
4         CA    69.852941
5         CO    60.504202
6         CT    61.904762
7         DE    64.705882
8         FL    50.688705
9         GA    49.101796
10        HI    72.222222
11        IA    50.980392
12        ID    50.000000
13        IL    61.603376
14        IN    51.700680
15        KS    48.192771
16        KY    42.708333
17        LA    38.028169
18        MA    78.343949
19        MD    74.825175
20        ME    61.538462
21        MI    58.415842
22        MN    56.692913
23        MO    48.461538
24        MS    43.103448
25        MT    42.105263
26        NC    44.019139
27        ND    18.750000
28        NE    47.368421
29        NH    61.764706
30        NJ    66.165414
31        NM    57.142857
32        NV    57.142857
33        NY    65.371025
34        OH    44.588745
35        OK    33.783784
36        OR    78.750000
37        PA    54.000000
38        RI    50.000000
39        SC    52.577320
40        SD    23.076923
41        TN    38.931298
42        TX    47.890819
43        UT    48.529412
44        VA    60.869565
45        VT    75.000000
46        WA    71.621622
47        WI    54.385965
48        WV    31.250000
49        WY    75.000000

Disaagregation

  • A straightforward way to make state-level predictions is to just use the observed responses.
Code
p3 = go.Figure(data=go.Choropleth(
    locations=anes_votes['state_abb'],
    z=anes_votes['biden_share'].astype(float),
    locationmode='USA-states',
    colorscale='blues',
    autocolorscale=False,
    marker_line_color='white', # line markers between states
    colorbar_title="Biden's Two-Party Share"))

p3.update_layout(
    title_text="Raw Estimates of Support for Biden",
    geo = dict(
        scope='usa',
        projection=go.layout.geo.Projection(type = 'albers usa'),
        showlakes=True, # lakes
        lakecolor='rgb(255, 255, 255)'),)

The Solutions

  • Design-based methods
    • Simple and rely on direct observations \[ \hat{Y_j} = \frac{\mathbf{s}um_{i = 1}^{n_j}{y_{ij}}}{n_j} \text{ } \text{ } \text{ } \text{ } \text{ } Var[\hat{Y_j}|n_j] = \frac{S_j^2}{n_j} \left(1 - \frac{n_j}{N_j}\right) \]
    • E.g. disaggregation, large samples, pooling samples, etc.
    • The problems may remain: bias and efficiency (variance).

Design-Based Methods

The Solutions

  • Design-based methods
    • Simple and rely on direct observations \[ \hat{Y_j} = \frac{\mathbf{s}um_{i = 1}^{n_j}{y_{ij}}}{n_j} \text{ } \text{ } \text{ } \text{ } \text{ } Var[\hat{Y_j}|n_j] = \frac{S_j^2}{n_j} \left(1 - \frac{n_j}{N_j}\right)\]
    • E.g. disaggregation, large samples, pooling samples, etc.
    • The problems may remain: bias and efficiency (variance).
    • Note that large sample per se would not help:

Exercise

  • EPA: respondents by states and demographics
    • Create a plot for Trump support using disaagregation
    • What are the proportions of female and male in each state?
    • What about race and ethnicity?
    • Is there a pattern for age group and sex?

The Solutions

  • Model-based methods
    • Model-based methods imply a model of the data generation process of the sample data.
    • Inferences are made based on the distributions assumed by the models.
    • Advantages:
      • Increased precision of the estimation.
      • Associated variability measures can also be derived under an assumed model.

Model-Based Methods

  • Multilevel Regression with Poststratifications (MRP) (today)
  • (Bayesian) Spatial Methods (tomorrow)

  • Tons of models under the umbrella of SAE

Multi-Level Regression with Poststratification

  • Often called as “Mister P”
  • Modern version: Park, Gelman and Bafumi 2004
  • Like disaagregation, MRP still relies on national survey data.
  • MRP can produce more accurate and robust the estimates when working with small and medium-sized samples (\(N \approx 1400\)), strongly outperforming disaggregation.
  • MRP begins by using multilevel regression to model individual survey responses as a function of demographic and geographic predictors.
  • The next step is poststratification, in which the estimates for each demographic-geographic respondent type are weighted (poststratified) by the percentages of each type in the actual state populations.

Multi-Level Regression

  • ALso known as hierarchical models, mixed-effect model, etc.
  • Parameters can vary at more than one level.
  • The idea: individuals are nested in contexts.
  • E.g. students in schools, residents in states, citizens of countries, demographic groups, polls, etc.

\[\text{Pr}(y_i = 1) = logit^{-1} (\beta_0 + \alpha^{race, sex}_{j[i]} + \alpha^{age}_{k[i]} + \alpha^{edu}_{l[i]} + \alpha^{age, edu}_{k[i],l[i]} + \alpha^{state}_{s[i]})\]

\[\alpha_s^{state} \mathbf{s}im N(\alpha_{m[s]}^{region} + \beta^{prev} \cdot X^{prev}) \]

Multi-Level Regression

  • Modelling contextual effects
  • Partially pooling respondents across states
    • Complete pooling, no pooling
    • Borrow strengh from similar cases to an extent determined by the data

Multi-Level Regression

  • Modelling contextual effects
  • Partially pooling respondents across states
    • Complete pooling, no pooling
    • Borrow strengh from similar cases to an extent determined by the data
  • Explicit estimation of variances
    • Comparing to robust/clustered standard errors
  • In the context of MRP:
    • To include contextual effects
    • Allow partial pooling between states—states with fewer respondents can borrow strength from other states.

Poststratification

  • Poststratify the demographic-geographic types.
  • The multi-level logistic regression estimates the probability that any person will support Biden given the person’s sex, race, age, education, and state.
  • Compute weighted averages of these probabilities to estimate the proportion of supporters in each state.

\[y^{MRP}_{state s} = \frac{\mathbf{s}um_{c\in s} N_c \theta_c}{\mathbf{s}um_{c\in s} N_c}\]

Intuition

  • Multilevel regression:
    • An 18-29 year old African American female with a bachelor degree in California has 80% of the chance to vote for Biden.
  • Poststratification:
    • There are 11% of voters in California belong to this specific demographic group.

MRP: Implementation

  1. Gather national polls.
  2. Organize polling data and aggregate-level variables into cells
  3. Fit a multi-level regression model for individual survey responses given demographics and geography
  4. Use census data to create cells for poststratification
  5. Make predictions for census cells based on model estimates

MRP: Implementation

  • Cells:
Code
anes_sub = anes.groupby(['state_abb','race_ethnicity', 'sex', 'educ', 'age_grp', 'region', 
                    'state_clinton'])['pres_vote_b'].agg(pres_vote_b='sum').reset_index()
anes_sub['n'] = anes.groupby(['state_abb','race_ethnicity', 'sex', 'educ', 'age_grp', 'region', 
                    'state_clinton'])['pres_vote_b'].agg(n='count').reset_index()['n']
anes_sub[["state_abb", "race_ethnicity", "sex", "educ", "age_grp", "pres_vote_b", "n"]]
     state_abb race_ethnicity     sex  ...       age_grp pres_vote_b  n
0           AK          White  Female  ...         50-59           1  1
1           AK          White  Female  ...  70 and older           0  1
2           AK          White  Female  ...  70 and older           0  1
3           AK          White    Male  ...  70 and older           0  1
4           AK          White    Male  ...         18-29           0  1
...        ...            ...     ...  ...           ...         ... ..
2626        WV          White    Male  ...         60-69           0  1
2627        WV          White    Male  ...  70 and older           0  1
2628        WY          White  Female  ...         60-69           2  2
2629        WY          White  Female  ...         40-49           0  1
2630        WY          White    Male  ...  70 and older           1  1

[2631 rows x 7 columns]

MRP: Implementation



Google Colab

MRP: Additional Thoughts

  • It is a generic approach for stabilizing aggregate estimates.
    • Wang et al. (2015) used MRP to obtain estimates of voting behavior in the 2012 US Presidential election based on a sample of 350,000 Xbox users, empaneled 45 days prior to the election.

MRP: Additional Thoughts

  • Data quality is still an extremly important factor.
    • What is the poll doesn’t tell us anything about a 40-49 year old white male with a high school degree in WY?
    • MRP can only correct it to a degree.

MRP: Additional Thoughts

  • Methodological issues:
    • Rely on the quality of census data
    • No (real) geographical information
    • The curse of dimensionality: deep interactions


Bayesian Spatial Methods

  • Spatially referenced data (latitude–longitude, Easting– Northing, etc.)

  • Geographically neighboring observations are more similar than distant observations
    • Neighboring units are similar because of the shared contexts
    • Spatial dependence can be the results of interaction and diffusion
    • Providing information in addition to observed data

  • To incorporate spatial correlation as an important modeling ingredient.
    • With the recent advances in computational methods (particularly in the area of Monte Carlo methods)

Bayesian and MCMC: A Teaser

  • Bayesian statistical methods: (mid 20th century)

\[\pi(\theta|\mathbf{X}) = \frac{p(\theta)L(\theta|\mathbf{X})}{\int_{\theta}p(\theta)L(\theta|\mathbf{X})d\theta} \propto p(\theta)L(\theta|\mathbf{X})\] \[\text{Posterior Probability} \propto \text{Prior Probability} \times \text{Likelihood Function}\]  

Bayesian and MCMC: A Teaser

  • A cohesive framework for combining complex data models and external and prior knowledge.

  • Unknown parameters are random variables; the observed data is finite and fixed.
    • Frequentist approach: fix parameters and infinite data

  • Bayesian framework also handles prediction cohesively and elegantly
    • Measurement as a prediction problem
    • Bayesian framework treats the prediction as the same process as the marginal distribution of observed data.

Bayesian and MCMC: A Teaser

  • Marchov Chain Monte Carlo (MCMC)
    • To solve high-dimensional problems

Spatial Model and Gaussian Process

  • \(s\) is a finite set of \(n\) observed sites \(\{s_1, s_2, ..., s_n\}\) over a domain \(\mathcal{D}\) (\(s \in \mathcal{D}\))
  • \(\boldsymbol{\omega}\) is a random field accounts for the spatial correlation across the spatial domain \(\mathcal{D}\): \(\{\boldsymbol{\omega}(\mathbf{s}): \mathbf{s} \in \mathcal{D}\}\)
  • Outcome: \(\mathbf{y}(\mathbf{s}) = \{\mathbf{y}(\mathbf{s}_1),\mathbf{y}(\mathbf{s}_2),...,\mathbf{y}(\mathbf{s}_n)\}\)
  • Predictors: \(\mathbf{X}(\mathbf{s}) = \{\mathbf{X}(\mathbf{s}_1),\mathbf{X}(\mathbf{s}_2),...,\mathbf{X}(\mathbf{s}_n)\}\)

Spatial models and gaussian process

  • \(s\) is a finite set of \(n\) observed sites \(\{s_1, s_2, ..., s_n\}\) over a domain \(\mathcal{D}\) (\(s \in \mathcal{D}\))
  • \(\boldsymbol{\omega}\) is a random field accounts for the spatial correlation across the spatial domain \(\mathcal{D}\): \(\{\boldsymbol{\omega}(\mathbf{s}): \mathbf{s} \in \mathcal{D}\}\)
  • Outcome: \(\mathbf{y}(\mathbf{s}) = \{\mathbf{y}(\mathbf{s}_1),\mathbf{y}(\mathbf{s}_2),...,\mathbf{y}(\mathbf{s}_n)\}\)
  • Predictors: \(\mathbf{X}(\mathbf{s}) = \{\mathbf{X}(\mathbf{s}_1),\mathbf{X}(\mathbf{s}_2),...,\mathbf{X}(\mathbf{s}_n)\}\)

\[ \mathbf{y}(\mathbf{s}) = \mathbf{X}(\mathbf{s})\mathbf{\beta} + \boldsymbol{\omega}(\mathbf{s}) + \boldsymbol{\epsilon}(\mathbf{s}) \]

\[\boldsymbol\omega(\mathbf{s}) \sim \mathcal{N}(\mathbf{0},\sigma^2\mathbf{H}(\phi))\]

\[\boldsymbol\epsilon(\mathbf{s}) \sim \mathcal{N}(\mathbf{0}, \tau^2\mathbf{I})\]

Spatial Model and Gaussian Process

\[ \mathbf{y}(\mathbf{s}) = \mathbf{X}(\mathbf{s})\mathbf{\beta} + \boldsymbol{\omega}(\mathbf{s}) + \boldsymbol{\epsilon}(\mathbf{s}) \]

\[\boldsymbol\omega(\mathbf{s}) \sim \mathcal{N}(\mathbf{0},\sigma^2\mathbf{H}(\phi))\]

\[\boldsymbol\epsilon(\mathbf{s}) \sim \mathcal{N}(\mathbf{0}, \tau^2\mathbf{I})\]

  • Idiosyncratic error term \(\boldsymbol\epsilon\):
    • \(\tau^2\): nugget, error variance independent from spatial separation.
  • Spatial error term \(\boldsymbol\omega\):
    • \(\sigma^2\): partial sill, the variance that can be driven by geographic distance between two observations
    • \(\phi\): decay, the level of correlation among observations given distance

Bayesian Kriging Regression

  • Kriging models, named after statistician and mining engineer Daniel G. Krige, originated in the areas of mining and geostatistics that involve spatially correlated data.

  • Create a smooth surface over two non-overlapping contiguous geographic regions.

Bayesian Kriging Regression

  • Also known as Gaussian Random Field (GRF) or Gaussian Process (GP)

  • Other distributional frameworks or special cases of Gaussian process can also be used, such as exponential, spherical, wave, or Mat ́ern processes.

  • Advantages:

    • A correlations structure on the error terms.
    • An optimized (linear) combination of the data at observed points to fill-in unsampled area.
    • Built-in prediction process.

Bayesian Kriging Regression: Implementation

  • Training: estimate the model with observed points
  • Testing: make predictions for unobserved points based on the estimates
  • A collection of points \(\leadsto\) aggregate level estimates



Google Colab