MDI Data Workshop

Monte Carlo Simulation and Resampling Methods

Le Bao

Massive Data Institute, Georgetown University

February 16, 2022

MDI Events

The goal of MDI workshops
MDI Workshop Spring
- Monte Carlo simulations and resampling methods
- Bayesian Simulation with Dr. Nathan Wycoff on March 14 & 15
- Causal Inference with Dr. J.J. Naddeo on April 18 & 19

MDI Distinguished Lecture on March 23rd (stay tuned!)

Check out more MDI events at: https://mdi.georgetown.edu/mdi-events/

Workshop Materials

Slides, Google Colab are available at:

https://bit.ly/mdi-sim

Motivation

Why simulation?
- We want the computer to do hard work that we don’t want to.
Comparing to (deterministic) numerical methods:
- These methods can resemble the repeated sampling/data generation process.
  - The results reflect probabilistic aspects of the problem.
- They allow for simultaneously estimation of several features.
  - High-dimension, multi-modal data, etc.
The advantages of simulation methods:
- Flexible: These tools allow us to control random processes with counting and drawing.
- Simple: Most of these ideas are mathematically simple and easy to do via programming.
- Intuitive: It is very intuitive for illustration of problems and learning statistics.
Plan:
- Monte Carlo simulation: MC integration, rejection sampling, model simulations, etc.
- Resampling methods: bootstraping, permutation tests, cross-validation, etc.

Some Background

Law of Large Numbers:
- The average of the results obtained from a large number of trials should be close to the expected value and will tend to become closer to the expected value as more trials are performed.
  - vs. “Big Data Paradox” (Meng, 2018)

Some Background

Central Limit Theorem:
- The sum of many small independent random variables will be a random variable with an approximate normal distribution, (even if the original variables themselves are not normally distributed).

Experiment: Law of Large Numbers

Law of Large Numbers

np.random.randint(0,2)

coins = np.random.randint(low=0,high=2,size=20)
print(coins)

[1 0 0 0 0 1 1 1 0 1 0 1 0 0 0 0 1 0 0 0]

np.average(coins)

0.35

coins = np.random.randint(low=0,high=2,size=1000)
np.average(coins)

0.476

Experiment: Central Limit Theorem

Central Limit Theorem

np.random.uniform(0, 1)

0.1005848227813454

num = np.random.uniform(0, 1, size = 20)
print(num[0:5])

[0.53896269 0.6107311  0.63005872 0.34595959 0.86314885]

mu = np.average(num)
print(mu)

0.4759697612263528

num = np.random.uniform(0, 1, size = 1000)
np.average(num)

0.5065218284679716

Monte Carlo Method: A Little History

Monte Carlo method is invented during the Manhattan Project at Los Alamos and named after the Monte Carlo Casino in Monaco.
It is instrumental in the development of Hydrogen bomb.

Monte Carlo Method

If a distribution is difficult or impossible to manipulate analytically, then it is often possible to create a set of simulated values that share the same distributional properties, and describe the posterior by using empirical summaries of these simulated values.

Simulation work in applied statistics replaces analytical work with repetitious, low-level effort by the computer.

Simulation methods are revolutionalized because of the growing computing power.
- iPhone 6 is 32,600 times faster than Apollo Guidance Computer (AGC).
- The development of Machov Chain Monte Carlo (MCMC) and Bayesian Statistics

Monte Carlo Method

A method of estimating the value of an unknown quantity using the principles of inferential statistics.

Inferential statistics
- Population: a set of examples
- Sample: a proper subset of a population
- Key fact: a random sample tends to exhibit the same properties as the population from which it is drawn. (\leadsto law of large numbers & central limite theorem)

Random Variable Generation

Python

import numpy as np
np.random.uniform(low = 0, high = 1, size = 1)

array([0.34466712])

np.random.normal(loc = 1, scale = 1, size = 1)

array([2.37987444])

np.random.gamma(shape = 5, size = 1)

array([5.23509673])

np.random.binomial(n = 20, p = 0.5, size = 10)

array([ 7, 12, 14, 13,  9,  9,  5,  9,  8,  8])

Other distributional functions: beta, chisquare, dirichlet, logistic, multinomial, multivariate_normal, negative_binomial, poisson, standard_t, etc.

Random Variable Generation

runif(n = 1, min = 0, max = 1)

[1] 0.09056845

rnorm(n = 1, mean = 0, sd = 1)

[1] -0.501218

rgamma(n = 1, shape = 5)

[1] 1.35786

rbinom(n = 10, size = 20, 0.5)

 [1] 11  9  8 10 10 12  9  8 13 11

Other distributional functions: rpois, rmvnorm, rnbinom, rbeta, rchisq, rexp, rgamma, rlogis, rstab, rt, rgeom, rhyper, rwilcox, rweibull, etc.

Monte Carlo Integration

Suppose g(x) is difficult to express or manipulate but for which we could generate samples on an arbitrary support of interest:[a,b].
A common quantity of interest is \begin{equation*} I[a,b] = \int_{a}^{b} g(x)h(x)dx \end{equation*} that is, the expected value of some function, h(x), of x distributed g(x).
A substitute for analytically calculating this is to randomly generate n values of x from g(x) and calculate: \widehat{I} [a,b] = \frac{1}{n}\sum_{i=1}^{n}h(x_i)
The idea is to replace analytical integration with summartion from a large number of simulated values, rejecting (ignoring) values outside of [a:b].

Monte Carlo Integration

By the law of large numbers, \widehat{I}[a,b] converges with probability one to the desired value, I[a,b].
Although \widehat{I}[a,b] now has “simulation error,” this error is measured by the empirical variance of the simulation estimate:

\begin{equation*} Var(\widehat{I}[a,b]) = \frac{1}{n(n-1)}\sum^{n}_{i=1}(h(x_i)-\widehat{I}[a,b])^2 \end{equation*}

Because the central limit theorem applies here as long as Var(I [a, b]) is finite, credible intervals can be easily calculated by

\begin{equation*} [95\%_{lower}, 95\%_{upper}] = [\widehat{I}[a,b] - 1.96\sqrt{Var(\widehat{I}[a,b])}, \widehat{I}[a,b] + 1.96\sqrt{Var(\widehat{I}[a,b])}] \end{equation*}

Monte Carlo Integration: Example

Normal PDF over a specific region, suppose we want: \begin{align*} I[1,2] &= \int_{1}^{2}\mathcal{N}(x;0,1) \\ &= \frac{1}{n}\sum^{n}_{i=1}x_i^2 \end{align*}
In Python:

normal_sims = np.random.normal(loc = 0, scale = 1, size = 10000)
np.average((normal_sims >= 1) & (normal_sims <= 2))

0.1384

Monte Carlo Integration: Example

Rejection Sampling

If we cannot produce random values from the posterior of interest, we cannot do MC integration as just described.

New idea: rejection sampling according to …
1. find some other distribution that is convenient to sample from,
2. initially it must enclose the target distribution,
3. compare generated values with PDF form and decide which values to keep and which to reject,
4. integral of interest is the normalized ratio of these values.

“Rejection sampling” used in two contexts: obtaining dimensional integral quantities, and generating random variables.

Rejection sampling can be particularly useful in determining the normalizing factor for non-normalized posterior distributions.

Rejection Sampling

Monte Carlo Integration: Exercise

Find the probabilities of the following regions from standard normal PDF:

\begin{equation*} \int_{-\infty}^{-1.96}\mathcal{N}(x;0,1); \int_{1.96}^{\infty}\mathcal{N}(x;0,1) \end{equation*}

Monte Carlo Experiment

Simulating statistical models, in particular, the data generating process in order to:
- Estimate parameters or statistical measures
- Examine the properties of the estimates

In statistical inference we use a sample of data to “guess” population parameter.

For most of the time, we don’t know the ground truth of population parameter and data generating process \leadsto we don’t know how the models perform.

Use Monte Carlo methods, we can control the whole process including both population parameter and the data generation process of sampling data.

Model Simulation

We assume a linear model (data generating process):

\begin{equation*} Y_i = \alpha + \beta X_{i} + \epsilon_i \end{equation*}

\begin{align*} \alpha &= 1.71\\ \beta &= 5.6\\ \bar{\epsilon_i} &= 0\\ \leadsto Y_i &= 1.71 + 5.6 \cdot X_{i} \end{align*}

Model Simulation

n_sim = 5000; n_sample = 2000
alpha = 1.71; beta = 5.6 #truth
est_alpha = []; se_alpha = []; est_beta = []; se_beta = []

for i in range(n_sim):
    eps = np.random.normal(0, 1, n_sample)
    x1 = np.random.uniform(0, 1, n_sample)
    y = (alpha + beta * x1 + eps)
    X = sm.add_constant(x1.reshape((-1, 1)))

    model = sm.OLS(y.reshape((-1, 1)), X)
    ols_result = model.fit()
    coefs = ols_result.params
    ses = ols_result.bse

    est_alpha.append(coefs[0]); est_beta.append(coefs[1])
    se_alpha.append(ses[0]); se_beta.append(ses[1])

Model Simulation: Coefficient Estimates

\begin{align*} \alpha &= 1.71\\ \beta &= 5.6\\ Y_i &= 1.71 + 5.6 \cdot X_{i} \end{align*}

est_alpha[0:5]

[1.7288135742126856, 1.6676334606975451, 1.664387217168158, 1.7604366985784397, 1.6922445964171642]

np.mean(est_alpha)

1.7101211084143129

est_beta[0:5]

[5.643558526288774, 5.655339840853529, 5.66184453990171, 5.541365742564899, 5.568625996561849]

np.mean(est_beta)

5.599126645728215

Model Simulation: Coefficient Estimates

Model Simulation: Standard Error

Variance (standard deviations)
- Standard error from the sample close to population standard deviation.
The standard deviation of the estimates \approx the average standard error from the simulation:

np.mean(se_alpha)

0.04471881239569965

np.std(est_alpha)

0.04470670372765645

np.mean(se_beta)

0.07748244015736212

np.std(est_beta)

0.0768999254172155

Model Simulation: Confidence Interval

Which of the following is the correct interpretation of a (1 - \alpha) confidence interval?
- An interval that has a (1 - \alpha) chance of containing the true value of the parameter.
- An interval that over (1 - \alpha) of replications contains the true value of the parameter, on average.

Coverage
- The proportion of simulated samples for which the estimated confidence interval includes the true parameter.
- Coverage probabilities: the (1 - \alpha) confidence interval based upon the standard error estimates will enclose the true value in (around) (1 - \alpha) of replications.

Coverage Probability

def coverage(beta, se, truth,  df, level=0.95,):
  conf_level = level + (1 - level)/2
  lower = np.array(beta) - qt.ppf(conf_level, df = df)*np.array(se)
  upper = np.array(beta) + qt.ppf(conf_level, df = df)*np.array(se)
  covered = np.where((truth >= lower) & (truth <= upper), 1, 0)
  cover_prob = np.mean(covered)
  conf_int = np.column_stack((lower, upper))
  return cover_prob

Coverage Probability

n_sim = 100; n_sample = 2000; alpha = 1.71; beta = 5.6 
est_alpha = []; se_alpha = []; est_beta = []; se_beta = []

for i in range(n_sim):
    eps = np.random.normal(0, 1, n_sample)
    x1 = np.random.uniform(0, 1, n_sample)
    y = (alpha + beta * x1 + eps)
    X = sm.add_constant(x1.reshape((-1, 1)))

    model = sm.OLS(y.reshape((-1, 1)), X)
    ols_result = model.fit()
    coefs = ols_result.params; ses = ols_result.bse

    est_alpha.append(coefs[0]); est_beta.append(coefs[1])
    se_alpha.append(ses[0]); se_beta.append(ses[1])

coverage(est_beta, se_beta, truth = beta, df = n_sample-1-1)

0.95

Coverage Probability

Model Simulation: Assumption and Property

Homoskedasticity: Var[\epsilon] = \sigma^2{I}
- Heteroskedasticity: the variance of the errors is unequal as the values of variables vary.

Model Simulation: Assumption and Property

Heteroskedasticity

Model	Variable	MAE	MSE	Coverage
Homoskedastic	Alpha	0.0353149	0.0019719	0.9554
Homoskedastic	Beta	0.0618423	0.0059748	0.9524
Heteroskedastic	Alpha	0.0651124	0.0067536	0.7170
Heteroskedastic	Beta	0.1737809	0.0473037	0.5226

Model Simulation: Assumption and Property

Omitted variable bias
- The bias in the estimator that arises when included predictor X is correlated with an omitted variable.
- The omited variable is correlated with both X and y.

Model Simulation: Generalized Linear Models

Generalized linear models:
- Dichotomous outcome models
- Ordered outcome models
- Count models

Two parametrizations of linear model:

\begin{align*} Y &= \mathbf{X} \mathcal{\beta} + \epsilon\text{, } \epsilon \sim N(0, \sigma)\\ Y &\sim N(\mu, \sigma)\text{, } \mu = \mathbf{X} \beta \end{align*}

Parametrization and Link Function

In Python

alpha = 1.71; beta = 5.6 #truth
x1 = np.random.uniform(0, 1, n_sample)
y1 = (alpha + beta * x1 + np.random.normal(0, 1, n_sample))
y2 = np.random.normal(alpha + beta * x1, 1, n_sample)

Logit Model

Logit:

\begin{align*} Pr(Y = 1|X) &= logit^{-1}(\mathbf{X} \beta)\\ &= \frac{exp(\mathbf{X} \beta)}{1 + exp(\mathbf{X} \beta)} \end{align*}

def inv_logit(p):
    return np.exp(p) / (1 + np.exp(p))

Model Simulation: Logit Model

n_sim = 100; n_sample = 2000
alpha = 1.71; beta = 5.6 #truth
est_alpha = []; se_alpha = []; est_beta = []; se_beta = []

for i in range(n_sim):
    x1 = np.random.uniform(0, 1, n_sample)
    y = np.random.binomial(n = 1, p = inv_logit(alpha + beta*x1), size = n_sample)
    X = sm.add_constant(x1.reshape((-1, 1)))

    model = sm.Logit(y.reshape((-1, 1)), X)
    logit_result = model.fit()
    coefs = logit_result.params; ses = logit_result.bse

    est_alpha.append(coefs[0]); est_beta.append(coefs[1])
    se_alpha.append(ses[0]); se_beta.append(ses[1])

    est_alpha.append(coefs[0]); est_beta.append(coefs[1])
    se_alpha.append(ses[0]); se_beta.append(ses[1])

Model Simulation: Probit Model

Probit model:

\begin{align*} Pr(Y = 1) &= probit^{-1}(\mathbf{X} \beta)\\ &= (2\pi)^{-\frac{1}{2}}\int_{-\infty}^{\mathbf{X} \beta} exp[-\frac{1}{2} (\mathbf{X} \beta)^2] d(\mathbf{X} \beta) \end{align*}

In Python:

norm.pdf(alpha + beta * x) # Probit link function
y = np.random.binomial(n = 1, p = norm.pdf(alpha + beta * x), size = n_sample)

Model Simulation: Count Model

Poisson model:

\begin{equation*} E(Y|X) = exp(\mathbf{X} \beta) \end{equation*}

In Python:

y = np.random.poisson(lam = exp(alpha + beta*x), size = n_sample)

Resampling Methods

Resampling can be seen as a variant of simulation methods
- To (re)-draw (sub) samples from the observed distribution.
It resembles the repeated and random sampling process by producing “samples” of data that consist of different mixes of the cases in the original data.
- The idea is to use the sample dataset as a proxy for the population, and then draw repeated subsamples from the sample data to estimate various statistics or to build models.

Some common resampling types:
- Resampling: e.g. bootstrapping
- Spliting sample(s): e.g. cross-validation
- Reshuffling: e.g. permutation test

Where to Use Resampling Methods?

Estimating the precision of sample statistics (medians, variances, percentiles) by using subsets of available data or drawing randomly with replacement from a set of data points
- Jackknifing
- Bootstrapping

Performing statistical tests by reshuffling without replacement
- Permutation tests

Validating models by using random subsets
- Cross validation
- Bootstrapping

Key: randomization (\leadsto simulation)

Bootstrapping

Resample the sample data with replacement
Efron (1979): “to pull oneself up by the bootstraps”
Common procedure:
- A sample S = {x_1, x_2, ..., x_n} of size n drawn from a population P.
- A statistic of interest, \theta, that describes P can be estimated by calculating \hat{\theta} from S
- Bootstrap:
  1. Draw a sample of size n from S with replacement such that each element is selected with probability. This “bootstrap sample” is S_{b1}.
  2. Calculate a new estimate of \theta from S_{b1}, \hat{\theta}^*_1.
  3. Repeat Steps 1 through 2 J times, storing \{\hat{\theta}^*_1, \hat{\theta}^*_2, ..., \hat{\theta}^*_j\}.
- When J is sufficiently large, according to the law of large numbers, we can estimate the center, variance, or confidence interval of the \theta.

Bootstraping Standard Error

Instrument variable model
Two Stage Least Squares (2SLS)
- Y_i = \lambda_0 + \lambda_1 Z_i + u_i (reduced form)
- X_i = \pi_0 + \pi_1 Z_i + v_i (first stage)
- where \widehat{\beta}_1^{X} = \frac{\lambda_1}{\pi_1}. That is,

\begin{align*} \beta_{2SLS} &= (\hat{X'}X)^{-1} \hat{X'}Y\\ &= (X'Z)Z'Z^{-1}Z'X^{-1}X'Z(Z'Z)^{-1}Z'Y \end{align*}

Bootstraping Standard Error: AJR (2001)

Example: Acemoglu, Johnson, & Robinson (2001)
- The colonial origins of development
- The effect of institutions on economic performance
- Instrument variable: settler mortality rates
- High mortality rates \leadsto extractive institution
  - In places where Europeans faced high mortality rates, they could not settle and were more likely to set up extractive institutions

Permutation Tests

Permutation Tests (aka. randomization test)
Key idea: in permutation test, we build, rather than assume, sampling distribution (called the “permutation distribution”) by resampling the observed data.

Number	Treatment	Outcome
1	Treated	4
2	Control	7
3	Treated	8
4	Control	3
5	Treated	7
6	Control	3
7	Control	2
8	Treated	5

\begin{equation*} TE = \frac{4 + 8 + 7 + 5}{4} - \frac{7 + 3 + 3 + 2}{4} = 2.25 \end{equation*}

Permutation Tests

Number	Treatment	Outcome	Random 1	Random 2	Random 3		Random 100
1	Treated	4	Control	Treated	Treated	...	Control
2	Control	7	Control	Treated	Treated	...	Control
3	Treated	8	Treated	Control	Control	...	Treated
4	Control	3	Treated	Control	Control	...	Treated
5	Treated	7	Control	Control	Control	...	Control
6	Control	3	Control	Treated	Treated	...	Treated
7	Control	2	Treated	Treated	Control	...	Control
8	Treated	5	Treated	Control	Treated	...	Treated
ATE	2.25		-0.75	-1.75	-0.25		-0.25

Cross-Validation

Predictive modeling and classification
- Model evaluation: how well does the model predict the new data, i.e. out-of-sample testing.
The training-testing paradigm
- Keep the “testing” data independent of the “training” data
Cross-validation
- To estimate an unbiased generalization of model performance based on resampling the testing and training sets.

Cross-Validation

Exhaustive cross-validation: All possible ways to divide the original sample into a training and a validation set
- Leave-one-out CV (LOOCV)
- Leave-p-out CV (LpOCV)
Non-exhaustive cross-validation: Spilt the data but not to exhaust all the possibilities
- K-fold CV
- Monte Carlo CV
Nested cross-validation

Leave-One-Out Cross-Validation

LOOCV
- Training set: n-1; Testing set: 1.
- {n \choose 1}

Leave-P-Out Cross-Validation

Leave-p-out CV (LpOCV)
- {n \choose p}
- A generalized case of LOOCV
- Computational issue

K-fold Cross-Validation

K-fold CV
- Partition: divide all the data into k groups of subsamples (folds)
- Use k-1 for training, use 1 for testing
- Until every fold has been used for testing
- When k=n —> LOOCV

Monte Carlo Cross-Validation

Monte Carlo CV
- Repeated random sub-sampling
- Comparing to K-fold (or repeated k-fold CV):
  - The data points within folds are not fixed
  - Partitions are independently drew for each iteration

Nested Cross-Validation

Outer loop: K-fold
Inner loop: L-fold

Concluding Note

Simulation is powerful.
- It can resemble the “natural” process of data generation.
- It is particularly capable of handling high-dimension, multi-modal, and complex data in general.
There is also no fixed way of applying simulation.
- Integration
- Model
- MCMC
- Resampling methods
- …