5  General Method of Moments (GMM)

5.1 The Method of Moments (MOM)

A population moment \(\gamma\) can be defined as the expectation of some continuous function \(g\) of a random variable \(x\): \[ \gamma={\mathrm{E}} [g(x)] \]

On the other hand, a sample moment is the sample version of the population moment in a particular sample: \[ \hat \gamma=\frac{1}{n} \sum [g(x)] \]

5.2 OLS as a moment problem

Consider the simple linear regression \[ \bf y=X\beta+u, \quad u \sim IID(0, \sigma^2). \]

If the model is correctly specified, then \[ \rm E (\bf X'u)=0. \]

The MOM principle suggests that we replace the left-hand side with its sample analog \(\frac{1}{n} \bf X'(y-X\beta)\).

Since we know that the true \(\bf \beta\) sets the population moment equal to zero in expectation, it seems reasonable to assume that a good choice of \(\bf \hat \beta\) would be one that sets the sample moment to zero. The MOM procedure suggests an estimate of \(\bf \beta\) that solves \[ \frac{1}{n} \bf X'(y-X \hat \beta)=0. \]

The MOM estimator is \[ \bf \hat \beta=(X'X)^{-1}X'y, \] which is the same as the OLS estimator.

5.3 IV as a moment problem

Consider the simple linear regression \[ \bf y=X\beta+u, \quad u \sim IID(0, \sigma^2). \]

If the model is mis-specified, then \[ \rm E (\bf X'u)\neq 0. \]

We have to find an instrumental variable \(\bf Z\) which is \[ \rm E (\bf Z'u)= 0. \] Or, \[ \rm E (\bf Z'(y-X\beta))= 0. \]

The sample analogy of this is \[ \frac{1}{n} \bf Z'(y-X \hat \beta)=0. \]

That gives us the IV estimator \[ \bf \hat \beta=(Z'X)^{-1}Z'y. \]

5.4 The Generalized Method of Moments

The expectation \({\rm E}(Y^r)\) for any \(r=1,2, \dots\) is called the \(r^{th}\) (raw) moment of \(Y\). The expectation \({\rm E} [(Y-{\rm E}(Y))^r]\) is called the \(r^{th}\) centered moment of \(Y\).

The mean is the first raw moment.

The variance is the second centered moment.

The third centered moment measures the skewness of the distribution.

The fourth centered moment measures the kurtosis of the distribution. Interpreted as a measure of “fatness of tails”.

The standardized kurtosis is [k=.]

For a normal distribution, \(k=3.\)

For a \(t\) distribution with \(v \geq 5\) degrees of freedom, \(k=3+6/(v-4) > 4.\) i.e., the \(t\) distribution has fatter tails than a normal distribution.

The distribution function of a random variable captures all information about the random variable. It can be shown using all moments also captures all information.

This distinction underlies the relative strengths and weaknesses of ML and GMM.

5.5 GMM

The statistical model takes the general form \[ E[m(Y_i; \theta_0)]=0 \] where - \(Y_1, \cdots, Y_n\) are random variables from which the sample \(y_1, \cdots, y_n\) is drawn, - \(m(Y, \theta)\) is a function specifying the model, - \(\theta_0\) is the “true value” of the parameter.

\(E[m(Y_i; \theta_0)]=0\) are called the population moment conditions.

Two ideas behind GMM:

  1. Replace the population mean \(E[.]\) with the sample mean calculated from the observed sample \(y_1, \cdots, y_n\).

  2. Since \(E[m(Y_i; \theta_0)]=0\), choose \(\hat \theta_{GMM}\) to make \(\frac{1}{n}\sum_{i=1}^{n}m(y_i; \hat \theta_{GMM})\) as close to zero as possible.

Define the notation

\[ \bar m(\theta)=\frac{1}{n} \sum_{i=1}^n m(y_i; \theta). \]

\(\hat \theta_{GMM}\) is chosen to make \(\bar m(\theta)'\bar m(\theta)\) as close to zero as possible.

More generally, \(\hat \theta_{GMM}\) is chosen to minimize \(\bar m(\theta)'W \bar m(\theta)\) for some weighting matrix \(W\).

5.5.1 An example

Let’s see an example with GMM, using the same simulated data as before. We have the same situation as before, \(X\) is endogenous. We are doing GMM version of 2sls.

Here I use R’s “gmm” library which makes things easy. It expects two arguments: “g” and “x”, which correponds to \(u\) and \(W\) here. The moment condition is \(E(W u) = 0\) in this example. \(W\) is the instrument, and \(u\) is the residual from regressing \(y\) on endogenous \(X\).

## DGP: data$y <- data$x + data$z + data$u

set.seed(66)
nobs=10000
nDim = 3
sdxx = 1
sdww=1
sdzz=1

## here we have three variables x,z,w.
## z is the omitted variable,x and z are correlated,  w is the instrument, which is correlated with x, but not z.  u is indepent of everything else.
crxz=.6
crzw=0
crxw=.8

covarMat = matrix( c(sdxx^2, crxz, crxw, crxz, sdzz^2, crzw,  crxw, crzw, sdww^2 ) , nrow=nDim , ncol=nDim )
covarMat
     [,1] [,2] [,3]
[1,]  1.0  0.6  0.8
[2,]  0.6  1.0  0.0
[3,]  0.8  0.0  1.0
data  = data.frame(mvrnorm(n=nobs, mu=rep(0,nDim), Sigma=covarMat ))
names(data) <- c('x','z','w')
data$u <- rnorm(nobs,0,1)
# dgp
data$y <- data$x +  data$z + data$u

gmm.fit = gmm(data$y~data$x, data$w)
summary(gmm.fit)

Call:
gmm(g = data$y ~ data$x, x = data$w)


Method:  twoStep 

Kernel:  Quadratic Spectral

Coefficients:
             Estimate   Std. Error  t value    Pr(>|t|) 
(Intercept)   0.011962   0.014070    0.850182   0.395224
data$x        0.969342   0.018531   52.308415   0.000000

J-Test: degrees of freedom is 0 
                J-test                P-value             
Test E(g)=0:    9.45254013696014e-28  *******             
# It returns the same estimates as the 2sls results.
tsls.model <- ivreg(y ~ x | w, data=data)
summary(tsls.model)

Call:
ivreg(formula = y ~ x | w, data = data)

Residuals:
      Min        1Q    Median        3Q       Max 
-5.885042 -0.945863 -0.008974  0.955665  5.307483 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.01196    0.01427   0.838    0.402    
x            0.96934    0.01837  52.766   <2e-16 ***

Diagnostic tests:
                  df1  df2 statistic p-value    
Weak instruments    1 9998     16954  <2e-16 ***
Wu-Hausman          1 9997      6590  <2e-16 ***
Sargan              0   NA        NA      NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.427 on 9998 degrees of freedom
Multiple R-Squared: 0.5056, Adjusted R-squared: 0.5055 
Wald test:  2784 on 1 and 9998 DF,  p-value: < 2.2e-16 

In OLS case, it would be \(E(X u) = 0\).

gmm.ols = gmm(data$y~data$x, data$x)
summary(gmm.ols)

Call:
gmm(g = data$y ~ data$x, x = data$x)


Method:  twoStep 

Kernel:  Quadratic Spectral

Coefficients:
             Estimate     Std. Error   t value      Pr(>|t|)   
(Intercept)  -9.7566e-04   1.2624e-02  -7.7286e-02   9.3840e-01
data$x        1.6041e+00   1.3143e-02   1.2205e+02   0.0000e+00

J-Test: degrees of freedom is 0 
                J-test                P-value             
Test E(g)=0:    6.09365424169492e-26  *******             
# It returns the same estimates as the OLS results.
ols <- lm(y ~ x, data=data)
summary(ols)

Call:
lm(formula = y ~ x, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5738 -0.8574 -0.0150  0.8508  5.5521 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.0009757  0.0128451  -0.076    0.939    
x            1.6040701  0.0131145 122.312   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.284 on 9998 degrees of freedom
Multiple R-squared:  0.5994,    Adjusted R-squared:  0.5994 
F-statistic: 1.496e+04 on 1 and 9998 DF,  p-value: < 2.2e-16

5.6 A few concepts of conditioning

5.6.1 Independence

If \(X\) and \(Y\) are independent then \[ f(x,y)=f(x)f(y) \] and hence \[ f(y|x)=f(y). \]

If \(X\) and \(Y\) are independent then \[ E[g(X)h(Y)]=E[g(X)]\cdot E[h(Y)] \] and hence \[ Cov[g(X),h(Y)]=0. \] i.e. all functions of \(X\) and \(Y\) are uncorrelated.

5.6.2 Law of Iterated Expectations

\[ E[Y]=E[E(Y|X)]. \]

5.6.3 Dependence Concepts

\(X\), \(Y\) independent: \[ Cov[g(X),h(Y)]=0 \]

\(X\), \(Y\) uncorrelated: \[ Cov[X,Y]=0 \]

\(E[Y|X]=0\):

\[ Cov[g(X),Y]=0 \]

5.6.4 Regression

A regression model is a model of \(E[Y_i|X_i]\). For example, \[ Y_i=\beta_0+\beta_1X_i+u_i \] where \(E[u_i|X_i]=0\).

5.6.5 GMM regression

The regression model \[ Y_i=\beta_0+\beta_1X_i+u_i, \quad E[u_i|X_i]=0 \] implies the moment condition \[ E[u_i]=0 \quad \mbox{and} \quad E[X_i u_i]=0 \]

That is, \[ E[Y_i-\beta_0-\beta_1X_i]=0 \] \[ E[X_i(Y_i-\beta_0-\beta_1X_i)]=0 \]

The sample moment conditions are \[ \frac{1}{n}\sum_{i=1}^{n}(y_i-\hat \beta_0-\hat \beta_1x_i)=0 \] \[ \frac{1}{n}\sum_{i=1}^{n}x_i(y_i-\hat \beta_0-\hat \beta_1x_i)=0 \]

These are just normal equations for OLS.

A characteristic of GMM: the specification of the model generates the estimator. i.e. only \(E[Y_i|X_i]=\beta_0+\beta_1 X_i\) is assumed.

Note there are no assumptions that \(u_i\) is homoscedastic, not autocorrelated or normally distributed. These properties affect the statistical properties of the GMM estimator, not its definition.