MDI Data Workshop

Monte Carlo Simulation and Resampling Methods

Le Bao

Massive Data Institute, Georgetown University

February 16, 2022

MDI Events

  • The goal of MDI workshops

  • MDI Workshop Spring

    • Monte Carlo simulations and resampling methods
    • Bayesian Simulation with Dr. Nathan Wycoff on March 14 & 15
    • Causal Inference with Dr. J.J. Naddeo on April 18 & 19

  • MDI Distinguished Lecture on March 23rd (stay tuned!)

Workshop Materials

 

Slides, Google Colab are available at:

https://bit.ly/mdi-sim

 

Motivation

  • Why simulation?
    • We want the computer to do hard work that we don’t want to.
  • Comparing to (deterministic) numerical methods:
    • These methods can resemble the repeated sampling/data generation process.
      • The results reflect probabilistic aspects of the problem.
    • They allow for simultaneously estimation of several features.
      • High-dimension, multi-modal data, etc.
  • The advantages of simulation methods:
    • Flexible: These tools allow us to control random processes with counting and drawing.
    • Simple: Most of these ideas are mathematically simple and easy to do via programming.
    • Intuitive: It is very intuitive for illustration of problems and learning statistics.
  • Plan:
    • Monte Carlo simulation: MC integration, rejection sampling, model simulations, etc.
    • Resampling methods: bootstraping, permutation tests, cross-validation, etc.

Some Background

  • Law of Large Numbers:
    • The average of the results obtained from a large number of trials should be close to the expected value and will tend to become closer to the expected value as more trials are performed.
      • vs. “Big Data Paradox” (Meng, 2018)

Some Background

  • Central Limit Theorem:
    • The sum of many small independent random variables will be a random variable with an approximate normal distribution, (even if the original variables themselves are not normally distributed).

Experiment: Law of Large Numbers

  • Law of Large Numbers
np.random.randint(0,2)
0
coins = np.random.randint(low=0,high=2,size=20)
print(coins)
[1 0 0 0 0 1 1 1 0 1 0 1 0 0 0 0 1 0 0 0]
np.average(coins)
0.35
coins = np.random.randint(low=0,high=2,size=1000)
np.average(coins)
0.476

Experiment: Central Limit Theorem

  • Central Limit Theorem
np.random.uniform(0, 1)
0.1005848227813454
num = np.random.uniform(0, 1, size = 20)
print(num[0:5])
[0.53896269 0.6107311  0.63005872 0.34595959 0.86314885]
mu = np.average(num)
print(mu)
0.4759697612263528
num = np.random.uniform(0, 1, size = 1000)
np.average(num)
0.5065218284679716

Monte Carlo Method: A Little History

      Stanislaw Ulam (1909-1984)

    John von Neumann (1903-1957)

      Nicholas Metropolis (1915-1999)

  • Monte Carlo method is invented during the Manhattan Project at Los Alamos and named after the Monte Carlo Casino in Monaco.
  • It is instrumental in the development of Hydrogen bomb.

Monte Carlo Method

  • If a distribution is difficult or impossible to manipulate analytically, then it is often possible to create a set of simulated values that share the same distributional properties, and describe the posterior by using empirical summaries of these simulated values.

  • Simulation work in applied statistics replaces analytical work with repetitious, low-level effort by the computer.

  • Simulation methods are revolutionalized because of the growing computing power.
    • iPhone 6 is 32,600 times faster than Apollo Guidance Computer (AGC).
    • The development of Machov Chain Monte Carlo (MCMC) and Bayesian Statistics

Monte Carlo Method

  • A method of estimating the value of an unknown quantity using the principles of inferential statistics.

  • Inferential statistics
    • Population: a set of examples
    • Sample: a proper subset of a population
    • Key fact: a random sample tends to exhibit the same properties as the population from which it is drawn. (\leadsto law of large numbers & central limite theorem)

Random Variable Generation

  • Python
import numpy as np
np.random.uniform(low = 0, high = 1, size = 1)
array([0.34466712])
np.random.normal(loc = 1, scale = 1, size = 1)
array([2.37987444])
np.random.gamma(shape = 5, size = 1)
array([5.23509673])
np.random.binomial(n = 20, p = 0.5, size = 10)
array([ 7, 12, 14, 13,  9,  9,  5,  9,  8,  8])
  • Other distributional functions: beta, chisquare, dirichlet, logistic, multinomial, multivariate_normal, negative_binomial, poisson, standard_t, etc.

Random Variable Generation

  • R
runif(n = 1, min = 0, max = 1)
[1] 0.09056845
rnorm(n = 1, mean = 0, sd = 1)
[1] -0.501218
rgamma(n = 1, shape = 5)
[1] 1.35786
rbinom(n = 10, size = 20, 0.5)
 [1] 11  9  8 10 10 12  9  8 13 11
  • Other distributional functions: rpois, rmvnorm, rnbinom, rbeta, rchisq, rexp, rgamma, rlogis, rstab, rt, rgeom, rhyper, rwilcox, rweibull, etc.

Monte Carlo Integration

  • Suppose g(x) is difficult to express or manipulate but for which we could generate samples on an arbitrary support of interest:[a,b].

  • A common quantity of interest is \begin{equation*} I[a,b] = \int_{a}^{b} g(x)h(x)dx \end{equation*} that is, the expected value of some function, h(x), of x distributed g(x).

  • A substitute for analytically calculating this is to randomly generate n values of x from g(x) and calculate: \widehat{I} [a,b] = \frac{1}{n}\sum_{i=1}^{n}h(x_i)

  • The idea is to replace analytical integration with summartion from a large number of simulated values, rejecting (ignoring) values outside of [a:b].

Monte Carlo Integration

  • By the law of large numbers, \widehat{I}[a,b] converges with probability one to the desired value, I[a,b].

  • Although \widehat{I}[a,b] now has “simulation error,” this error is measured by the empirical variance of the simulation estimate:

\begin{equation*} Var(\widehat{I}[a,b]) = \frac{1}{n(n-1)}\sum^{n}_{i=1}(h(x_i)-\widehat{I}[a,b])^2 \end{equation*}

  • Because the central limit theorem applies here as long as Var(I [a, b]) is finite, credible intervals can be easily calculated by

\begin{equation*} [95\%_{lower}, 95\%_{upper}] = [\widehat{I}[a,b] - 1.96\sqrt{Var(\widehat{I}[a,b])}, \widehat{I}[a,b] + 1.96\sqrt{Var(\widehat{I}[a,b])}] \end{equation*}

Monte Carlo Integration: Example

  • Normal PDF over a specific region, suppose we want: \begin{align*} I[1,2] &= \int_{1}^{2}\mathcal{N}(x;0,1) \\ &= \frac{1}{n}\sum^{n}_{i=1}x_i^2 \end{align*}

  • In Python:

normal_sims = np.random.normal(loc = 0, scale = 1, size = 10000)
np.average((normal_sims >= 1) & (normal_sims <= 2))
0.1384

Monte Carlo Integration: Example

 

 

Rejection Sampling

  • If we cannot produce random values from the posterior of interest, we cannot do MC integration as just described.

  • New idea: rejection sampling according to …
    1. find some other distribution that is convenient to sample from,
    2. initially it must enclose the target distribution,
    3. compare generated values with PDF form and decide which values to keep and which to reject,
    4. integral of interest is the normalized ratio of these values.

  • “Rejection sampling” used in two contexts: obtaining dimensional integral quantities, and generating random variables.

  • Rejection sampling can be particularly useful in determining the normalizing factor for non-normalized posterior distributions.

Rejection Sampling

 

 

Rejection Sampling

 

 

Rejection Sampling

 

 

Monte Carlo Integration: Exercise

  • Find the probabilities of the following regions from standard normal PDF:

\begin{equation*} \int_{-\infty}^{-1.96}\mathcal{N}(x;0,1); \int_{1.96}^{\infty}\mathcal{N}(x;0,1) \end{equation*}

Monte Carlo Experiment

  • Simulating statistical models, in particular, the data generating process in order to:
    • Estimate parameters or statistical measures
    • Examine the properties of the estimates

  • In statistical inference we use a sample of data to “guess” population parameter.

  • For most of the time, we don’t know the ground truth of population parameter and data generating process \leadsto we don’t know how the models perform.

  • Use Monte Carlo methods, we can control the whole process including both population parameter and the data generation process of sampling data.

Model Simulation

  • We assume a linear model (data generating process):

\begin{equation*} Y_i = \alpha + \beta X_{i} + \epsilon_i \end{equation*}

\begin{align*} \alpha &= 1.71\\ \beta &= 5.6\\ \bar{\epsilon_i} &= 0\\ \leadsto Y_i &= 1.71 + 5.6 \cdot X_{i} \end{align*}

Model Simulation

n_sim = 5000; n_sample = 2000
alpha = 1.71; beta = 5.6 #truth
est_alpha = []; se_alpha = []; est_beta = []; se_beta = []

for i in range(n_sim):
    eps = np.random.normal(0, 1, n_sample)
    x1 = np.random.uniform(0, 1, n_sample)
    y = (alpha + beta * x1 + eps)
    X = sm.add_constant(x1.reshape((-1, 1)))

    model = sm.OLS(y.reshape((-1, 1)), X)
    ols_result = model.fit()
    coefs = ols_result.params
    ses = ols_result.bse

    est_alpha.append(coefs[0]); est_beta.append(coefs[1])
    se_alpha.append(ses[0]); se_beta.append(ses[1])

Model Simulation: Coefficient Estimates

\begin{align*} \alpha &= 1.71\\ \beta &= 5.6\\ Y_i &= 1.71 + 5.6 \cdot X_{i} \end{align*}

est_alpha[0:5]
[1.7288135742126856, 1.6676334606975451, 1.664387217168158, 1.7604366985784397, 1.6922445964171642]
np.mean(est_alpha)
1.7101211084143129
est_beta[0:5]
[5.643558526288774, 5.655339840853529, 5.66184453990171, 5.541365742564899, 5.568625996561849]
np.mean(est_beta)
5.599126645728215

Model Simulation: Coefficient Estimates

 

 

Model Simulation: Standard Error

  • Variance (standard deviations)
    • Standard error from the sample close to population standard deviation.
  • The standard deviation of the estimates \approx the average standard error from the simulation:
np.mean(se_alpha)
0.04471881239569965
np.std(est_alpha)
0.04470670372765645
np.mean(se_beta)
0.07748244015736212
np.std(est_beta)
0.0768999254172155

Model Simulation: Confidence Interval

  • Which of the following is the correct interpretation of a (1 - \alpha) confidence interval?
    • An interval that has a (1 - \alpha) chance of containing the true value of the parameter.
    • An interval that over (1 - \alpha) of replications contains the true value of the parameter, on average.

  • Coverage
    • The proportion of simulated samples for which the estimated confidence interval includes the true parameter.
    • Coverage probabilities: the (1 - \alpha) confidence interval based upon the standard error estimates will enclose the true value in (around) (1 - \alpha) of replications.

Coverage Probability

def coverage(beta, se, truth,  df, level=0.95,):
  conf_level = level + (1 - level)/2
  lower = np.array(beta) - qt.ppf(conf_level, df = df)*np.array(se)
  upper = np.array(beta) + qt.ppf(conf_level, df = df)*np.array(se)
  covered = np.where((truth >= lower) & (truth <= upper), 1, 0)
  cover_prob = np.mean(covered)
  conf_int = np.column_stack((lower, upper))
  return cover_prob

Coverage Probability

n_sim = 100; n_sample = 2000; alpha = 1.71; beta = 5.6 
est_alpha = []; se_alpha = []; est_beta = []; se_beta = []

for i in range(n_sim):
    eps = np.random.normal(0, 1, n_sample)
    x1 = np.random.uniform(0, 1, n_sample)
    y = (alpha + beta * x1 + eps)
    X = sm.add_constant(x1.reshape((-1, 1)))

    model = sm.OLS(y.reshape((-1, 1)), X)
    ols_result = model.fit()
    coefs = ols_result.params; ses = ols_result.bse

    est_alpha.append(coefs[0]); est_beta.append(coefs[1])
    se_alpha.append(ses[0]); se_beta.append(ses[1])

coverage(est_beta, se_beta, truth = beta, df = n_sample-1-1)
0.95

Coverage Probability

Model Simulation: Assumption and Property

  • Homoskedasticity: Var[\epsilon] = \sigma^2{I}
    • Heteroskedasticity: the variance of the errors is unequal as the values of variables vary.

Model Simulation: Assumption and Property

  • Heteroskedasticity
Model Variable MAE MSE Coverage
Homoskedastic Alpha 0.0353149 0.0019719 0.9554
Homoskedastic Beta 0.0618423 0.0059748 0.9524
Heteroskedastic Alpha 0.0651124 0.0067536 0.7170
Heteroskedastic Beta 0.1737809 0.0473037 0.5226

Model Simulation: Assumption and Property

  • Omitted variable bias
    • The bias in the estimator that arises when included predictor X is correlated with an omitted variable.
    • The omited variable is correlated with both X and y.

Model Simulation: Generalized Linear Models

  • Generalized linear models:
    • Dichotomous outcome models
    • Ordered outcome models
    • Count models

  • Two parametrizations of linear model:

\begin{align*} Y &= \mathbf{X} \mathcal{\beta} + \epsilon\text{, } \epsilon \sim N(0, \sigma)\\ Y &\sim N(\mu, \sigma)\text{, } \mu = \mathbf{X} \beta \end{align*}

Logit Model

  • Logit:


\begin{align*} Pr(Y = 1|X) &= logit^{-1}(\mathbf{X} \beta)\\ &= \frac{exp(\mathbf{X} \beta)}{1 + exp(\mathbf{X} \beta)} \end{align*}

def inv_logit(p):
    return np.exp(p) / (1 + np.exp(p))

Model Simulation: Logit Model

n_sim = 100; n_sample = 2000
alpha = 1.71; beta = 5.6 #truth
est_alpha = []; se_alpha = []; est_beta = []; se_beta = []

for i in range(n_sim):
    x1 = np.random.uniform(0, 1, n_sample)
    y = np.random.binomial(n = 1, p = inv_logit(alpha + beta*x1), size = n_sample)
    X = sm.add_constant(x1.reshape((-1, 1)))

    model = sm.Logit(y.reshape((-1, 1)), X)
    logit_result = model.fit()
    coefs = logit_result.params; ses = logit_result.bse

    est_alpha.append(coefs[0]); est_beta.append(coefs[1])
    se_alpha.append(ses[0]); se_beta.append(ses[1])

    est_alpha.append(coefs[0]); est_beta.append(coefs[1])
    se_alpha.append(ses[0]); se_beta.append(ses[1])

Model Simulation: Probit Model

  • Probit model:

\begin{align*} Pr(Y = 1) &= probit^{-1}(\mathbf{X} \beta)\\ &= (2\pi)^{-\frac{1}{2}}\int_{-\infty}^{\mathbf{X} \beta} exp[-\frac{1}{2} (\mathbf{X} \beta)^2] d(\mathbf{X} \beta) \end{align*}

  • In Python:
norm.pdf(alpha + beta * x) # Probit link function
y = np.random.binomial(n = 1, p = norm.pdf(alpha + beta * x), size = n_sample)

Model Simulation: Count Model

  • Poisson model:

\begin{equation*} E(Y|X) = exp(\mathbf{X} \beta) \end{equation*}

  • In Python:
y = np.random.poisson(lam = exp(alpha + beta*x), size = n_sample)

Resampling Methods

  • Resampling can be seen as a variant of simulation methods
    • To (re)-draw (sub) samples from the observed distribution.
  • It resembles the repeated and random sampling process by producing “samples” of data that consist of different mixes of the cases in the original data.
    • The idea is to use the sample dataset as a proxy for the population, and then draw repeated subsamples from the sample data to estimate various statistics or to build models.

  • Some common resampling types:
    • Resampling: e.g. bootstrapping
    • Spliting sample(s): e.g. cross-validation
    • Reshuffling: e.g. permutation test

Where to Use Resampling Methods?

  • Estimating the precision of sample statistics (medians, variances, percentiles) by using subsets of available data or drawing randomly with replacement from a set of data points
    • Jackknifing
    • Bootstrapping

  • Performing statistical tests by reshuffling without replacement
    • Permutation tests

  • Validating models by using random subsets
    • Cross validation
    • Bootstrapping

  • Key: randomization (\leadsto simulation)

Bootstrapping

  • Resample the sample data with replacement
  • Efron (1979): “to pull oneself up by the bootstraps”
  • Common procedure:
    • A sample S = {x_1, x_2, ..., x_n} of size n drawn from a population P.
    • A statistic of interest, \theta, that describes P can be estimated by calculating \hat{\theta} from S
    • Bootstrap:
      1. Draw a sample of size n from S with replacement such that each element is selected with probability. This “bootstrap sample” is S_{b1}.
      2. Calculate a new estimate of \theta from S_{b1}, \hat{\theta}^*_1.
      3. Repeat Steps 1 through 2 J times, storing \{\hat{\theta}^*_1, \hat{\theta}^*_2, ..., \hat{\theta}^*_j\}.
    • When J is sufficiently large, according to the law of large numbers, we can estimate the center, variance, or confidence interval of the \theta.

Bootstraping Standard Error

  • Instrument variable model

  • Two Stage Least Squares (2SLS)

    • Y_i = \lambda_0 + \lambda_1 Z_i + u_i (reduced form)
    • X_i = \pi_0 + \pi_1 Z_i + v_i (first stage)
    • where \widehat{\beta}_1^{X} = \frac{\lambda_1}{\pi_1}. That is,

\begin{align*} \beta_{2SLS} &= (\hat{X'}X)^{-1} \hat{X'}Y\\ &= (X'Z)Z'Z^{-1}Z'X^{-1}X'Z(Z'Z)^{-1}Z'Y \end{align*}

Bootstraping Standard Error: AJR (2001)

  • Example: Acemoglu, Johnson, & Robinson (2001)
    • The colonial origins of development
    • The effect of institutions on economic performance
    • Instrument variable: settler mortality rates
    • High mortality rates \leadsto extractive institution
      • In places where Europeans faced high mortality rates, they could not settle and were more likely to set up extractive institutions

Permutation Tests

  • Permutation Tests (aka. randomization test)
  • Key idea: in permutation test, we build, rather than assume, sampling distribution (called the “permutation distribution”) by resampling the observed data.
Number Treatment Outcome
1 Treated 4
2 Control 7
3 Treated 8
4 Control 3
5 Treated 7
6 Control 3
7 Control 2
8 Treated 5

\begin{equation*} TE = \frac{4 + 8 + 7 + 5}{4} - \frac{7 + 3 + 3 + 2}{4} = 2.25 \end{equation*}

Permutation Tests

Number Treatment Outcome Random 1 Random 2 Random 3 Random 100
1 Treated 4 Control Treated Treated ... Control
2 Control 7 Control Treated Treated ... Control
3 Treated 8 Treated Control Control ... Treated
4 Control 3 Treated Control Control ... Treated
5 Treated 7 Control Control Control ... Control
6 Control 3 Control Treated Treated ... Treated
7 Control 2 Treated Treated Control ... Control
8 Treated 5 Treated Control Treated ... Treated
ATE 2.25 -0.75 -1.75 -0.25 -0.25

Cross-Validation

  • Predictive modeling and classification
    • Model evaluation: how well does the model predict the new data, i.e. out-of-sample testing.
  • The training-testing paradigm
    • Keep the “testing” data independent of the “training” data
  • Cross-validation
    • To estimate an unbiased generalization of model performance based on resampling the testing and training sets.

Cross-Validation

  • Exhaustive cross-validation: All possible ways to divide the original sample into a training and a validation set
    • Leave-one-out CV (LOOCV)
    • Leave-p-out CV (LpOCV)
  • Non-exhaustive cross-validation: Spilt the data but not to exhaust all the possibilities
    • K-fold CV
    • Monte Carlo CV
  • Nested cross-validation

Leave-One-Out Cross-Validation

  • LOOCV
    • Training set: n-1; Testing set: 1.
    • {n \choose 1}

Leave-P-Out Cross-Validation

  • Leave-p-out CV (LpOCV)
    • {n \choose p}
    • A generalized case of LOOCV
    • Computational issue

K-fold Cross-Validation

  • K-fold CV
    • Partition: divide all the data into k groups of subsamples (folds)
    • Use k-1 for training, use 1 for testing
    • Until every fold has been used for testing
    • When k=n —> LOOCV

Monte Carlo Cross-Validation

  • Monte Carlo CV
    • Repeated random sub-sampling
    • Comparing to K-fold (or repeated k-fold CV):
      • The data points within folds are not fixed
      • Partitions are independently drew for each iteration

Nested Cross-Validation

  • Outer loop: K-fold
  • Inner loop: L-fold

Concluding Note

  • Simulation is powerful.
    • It can resemble the “natural” process of data generation.
    • It is particularly capable of handling high-dimension, multi-modal, and complex data in general.
  • There is also no fixed way of applying simulation.
    • Integration
    • Model
    • MCMC
    • Resampling methods