5 General Method of Moments (GMM)

5.1 The Method of Moments (MOM)

A population moment \(\gamma\) can be defined as the expectation of some continuous function \(g\) of a random variable \(x\): \[ \gamma={\mathrm{E}} [g(x)] \]

On the other hand, a sample moment is the sample version of the population moment in a particular sample: \[ \hat \gamma=\frac{1}{n} \sum [g(x)] \]

5.2 OLS as a moment problem

Consider the simple linear regression \[ \bf y=X\beta+u, \quad u \sim IID(0, \sigma^2). \]

If the model is correctly specified, then \[ \rm E (\bf X'u)=0. \]

The MOM principle suggests that we replace the left-hand side with its sample analog \(\frac{1}{n} \bf X'(y-X\beta)\).

Since we know that the true \(\bf \beta\) sets the population moment equal to zero in expectation, it seems reasonable to assume that a good choice of \(\bf \hat \beta\) would be one that sets the sample moment to zero. The MOM procedure suggests an estimate of \(\bf \beta\) that solves \[ \frac{1}{n} \bf X'(y-X \hat \beta)=0. \]

The MOM estimator is \[ \bf \hat \beta=(X'X)^{-1}X'y, \] which is the same as the OLS estimator.

5.3 IV as a moment problem

Consider the simple linear regression \[ \bf y=X\beta+u, \quad u \sim IID(0, \sigma^2). \]

If the model is mis-specified, then \[ \rm E (\bf X'u)\neq 0. \]

We have to find an instrumental variable \(\bf Z\) which is \[ \rm E (\bf Z'u)= 0. \] Or, \[ \rm E (\bf Z'(y-X\beta))= 0. \]

The sample analogy of this is \[ \frac{1}{n} \bf Z'(y-X \hat \beta)=0. \]

When the model is exactly identified – the number of instruments equals the number of regressors, so \(\bf Z'X\) is square and nonsingular – the sample moment condition has a unique solution, the simple IV estimator \[ \bf \hat \beta=(Z'X)^{-1}Z'y. \] When the model is overidentified – more instruments than regressors – \(\bf Z'X\) is no longer square and we cannot set all sample moments to zero simultaneously. We instead minimize a quadratic form in the moments, which gives the two-stage least squares (2SLS) estimator \[ \bf \hat \beta=(X'P_Z X)^{-1}X'P_Z y, \qquad P_Z = Z(Z'Z)^{-1}Z', \] the special case of the GMM estimator below with weighting matrix \((\bf Z'Z)^{-1}\).

5.4 The Generalized Method of Moments

The expectation \({\rm E}(Y^r)\) for any \(r=1,2, \dots\) is called the \(r^{th}\) (raw) moment of \(Y\). The expectation \({\rm E} [(Y-{\rm E}(Y))^r]\) is called the \(r^{th}\) centered moment of \(Y\).

The mean is the first raw moment.

The variance is the second centered moment.

The third centered moment measures the skewness of the distribution.

The fourth centered moment measures the kurtosis of the distribution. Interpreted as a measure of “fatness of tails”.

The standardized kurtosis is \[k=\frac{E[(Y-E(Y))^4]}{E[(Y-E(Y))^2]^2}.\]

For a normal distribution, \(k=3.\)

For a \(t\) distribution with \(v > 4\) degrees of freedom, \(k=3+6/(v-4) > 3\), i.e., the \(t\) distribution has fatter tails than a normal distribution (the excess kurtosis \(6/(v-4)\) shrinks toward 0 as \(v\) grows).

The distribution function of a random variable captures all information about the random variable. If the moment generating function exists in a neighborhood of zero, the full set of moments also determines the distribution (without such a condition this can fail: the lognormal distribution is not determined by its moments).

This matters directly for GMM: GMM only ever matches a finite set of moment conditions, and if the underlying distribution is not uniquely pinned down by its (even infinite) sequence of moments, then no amount of moment-matching – however many moment conditions we add – can recover the full data-generating distribution. This is precisely the sense in which GMM asks for less than MLE: MLE assumes we know the full distributional family and estimates its parameters (recovering the whole distribution when correctly specified), while GMM only requires correctly specified moment conditions and only ever identifies the parameters entering those moments, not the full distribution. When the distributional assumption behind MLE is correct, MLE is efficient and gives you more (the whole distribution); when it is not, GMM’s weaker requirements make it more robust.

This distinction underlies the relative strengths and weaknesses of ML and GMM.

5.5 GMM

The statistical model takes the general form \[ E[m(Y_i; \theta_0)]=0 \] where - \(Y_1, \cdots, Y_n\) are random variables from which the sample \(y_1, \cdots, y_n\) is drawn, - \(m(Y, \theta)\) is a function specifying the model, - \(\theta_0\) is the “true value” of the parameter.

\(E[m(Y_i; \theta_0)]=0\) are called the population moment conditions.

Two ideas behind GMM:

Replace the population mean \(E[.]\) with the sample mean calculated from the observed sample \(y_1, \cdots, y_n\).
Since \(E[m(Y_i; \theta_0)]=0\), choose \(\hat \theta_{GMM}\) to make \(\frac{1}{n}\sum_{i=1}^{n}m(y_i; \hat \theta_{GMM})\) as close to zero as possible.

Define the notation

\[ \bar m(\theta)=\frac{1}{n} \sum_{i=1}^n m(y_i; \theta). \]

\(\hat \theta_{GMM}\) is chosen to make \(\bar m(\theta)'\bar m(\theta)\) as close to zero as possible.

More generally, \(\hat \theta_{GMM}\) is chosen to minimize \(\bar m(\theta)'W \bar m(\theta)\) for some weighting matrix \(W\).

5.5.1 An example

Let’s see an example with GMM, using the same simulated data as before. We have the same situation as before, \(X\) is endogenous. We are doing GMM version of 2sls.

Here I use R’s “gmm” library which makes things easy. It expects two arguments: “g” and “x”, which corresponds to \(u\) and \(W\) here. The moment condition is \(E(W u) = 0\) in this example. \(W\) is the instrument, and \(u\) is the residual from regressing \(y\) on endogenous \(X\).

## DGP: data$y <- data$x + data$z + data$u

set.seed(66)
nobs=10000
nDim = 3
sdxx = 1
sdww=1
sdzz=1

## here we have three variables x,z,w.
## z is the omitted variable,x and z are correlated,  w is the instrument, which is correlated with x, but not z.  u is independent of everything else.
crxz=.6
crzw=0
crxw=.5

covarMat = matrix( c(sdxx^2, crxz, crxw, crxz, sdzz^2, crzw,  crxw, crzw, sdww^2 ) , nrow=nDim , ncol=nDim )
covarMat

     [,1] [,2] [,3]
[1,]  1.0  0.6  0.5
[2,]  0.6  1.0  0.0
[3,]  0.5  0.0  1.0

data  = data.frame(mvrnorm(n=nobs, mu=rep(0,nDim), Sigma=covarMat ))
names(data) <- c('x','z','w')
data$u <- rnorm(nobs,0,1)
# dgp
data$y <- data$x +  data$z + data$u

gmm.fit = gmm(y ~ x, x = ~w, data = data)
summary(gmm.fit)


Call:
gmm(g = y ~ x, x = ~w, data = data)


Method:  twoStep 

Kernel:  Quadratic Spectral

Coefficients:
             Estimate     Std. Error   t value      Pr(>|t|)   
(Intercept)   1.1531e-02   1.4151e-02   8.1481e-01   4.1518e-01
x             9.5129e-01   2.9971e-02   3.1740e+01  4.3442e-221

J-Test: degrees of freedom is 0 
                J-test                P-value             
Test E(g)=0:    1.16401567357162e-26  *******

# It returns the same estimates as the 2sls results.
tsls.model <- ivreg(y ~ x | w, data=data)
summary(tsls.model)


Call:
ivreg(formula = y ~ x | w, data = data)

Residuals:
      Min        1Q    Median        3Q       Max 
-6.144530 -0.962464  0.003414  0.944728  5.340560 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.01153    0.01433   0.804    0.421    
x            0.95129    0.03007  31.637   <2e-16 ***

Diagnostic tests:
                  df1  df2 statistic p-value    
Weak instruments    1 9998    3074.7  <2e-16 ***
Wu-Hausman          1 9997     807.8  <2e-16 ***
Sargan              0   NA        NA      NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.431 on 9998 degrees of freedom
Multiple R-Squared: 0.5007, Adjusted R-squared: 0.5006 
Wald test:  1001 on 1 and 9998 DF,  p-value: < 2.2e-16

In OLS case, it would be \(E(X u) = 0\).

gmm.ols = gmm(y ~ x, x = ~x, data = data)
summary(gmm.ols)


Call:
gmm(g = y ~ x, x = ~x, data = data)


Method:  twoStep 

Kernel:  Quadratic Spectral

Coefficients:
             Estimate     Std. Error   t value      Pr(>|t|)   
(Intercept)   -0.0042841    0.0126399   -0.3389314    0.7346614
x              1.5961141    0.0128364  124.3424084    0.0000000

J-Test: degrees of freedom is 0 
                J-test                P-value             
Test E(g)=0:    2.66671052451529e-28  *******

# It returns the same estimates as the OLS results.
ols <- lm(y ~ x, data=data)
summary(ols)


Call:
lm(formula = y ~ x, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.3506 -0.8612  0.0123  0.8504  4.9096 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.004284   0.012843  -0.334    0.739    
x            1.596114   0.013079 122.035   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.284 on 9998 degrees of freedom
Multiple R-squared:  0.5983,    Adjusted R-squared:  0.5983 
F-statistic: 1.489e+04 on 1 and 9998 DF,  p-value: < 2.2e-16

5.6 A few concepts of conditioning

5.6.1 Independence

If \(X\) and \(Y\) are independent then \[ f(x,y)=f(x)f(y) \] and hence \[ f(y|x)=f(y). \]

If \(X\) and \(Y\) are independent then \[ E[g(X)h(Y)]=E[g(X)]\cdot E[h(Y)] \] and hence \[ Cov[g(X),h(Y)]=0. \] i.e. all functions of \(X\) and \(Y\) are uncorrelated.

5.6.2 Law of Iterated Expectations

\[ E[Y]=E[E(Y|X)]. \]

5.6.3 Dependence Concepts

\(X\), \(Y\) independent: \[ Cov[g(X),h(Y)]=0 \]

\(X\), \(Y\) uncorrelated: \[ Cov[X,Y]=0 \]

\(E[Y|X]=0\):

\[ Cov[g(X),Y]=0 \]

5.6.4 Regression

A regression model is a model of \(E[Y_i|X_i]\). For example, \[ Y_i=\beta_0+\beta_1X_i+u_i \] where \(E[u_i|X_i]=0\).

5.6.5 GMM regression

The regression model \[ Y_i=\beta_0+\beta_1X_i+u_i, \quad E[u_i|X_i]=0 \] implies the moment condition \[ E[u_i]=0 \quad \mbox{and} \quad E[X_i u_i]=0 \]

That is, \[ E[Y_i-\beta_0-\beta_1X_i]=0 \] \[ E[X_i(Y_i-\beta_0-\beta_1X_i)]=0 \]

The sample moment conditions are \[ \frac{1}{n}\sum_{i=1}^{n}(y_i-\hat \beta_0-\hat \beta_1x_i)=0 \] \[ \frac{1}{n}\sum_{i=1}^{n}x_i(y_i-\hat \beta_0-\hat \beta_1x_i)=0 \]

These are just normal equations for OLS.

A characteristic of GMM: the specification of the model generates the estimator. i.e. only \(E[Y_i|X_i]=\beta_0+\beta_1 X_i\) is assumed.

Note there are no assumptions that \(u_i\) is homoscedastic, not autocorrelated or normally distributed. These properties affect the statistical properties of the GMM estimator, not its definition.