16 Gender Wage Gap Analysis Using US Current Population Survey Data

Author

Xiang Ao

Published

November 17, 2025

Traditionally Oaxaca-Blinder decomposition is used to decompose the wage gap into explained and unexplained parts. The explained part is due to differences in observable characteristics, such as education, experience, occupation, etc. The unexplained part is due to discrimination or other unobserved factors.

It has a problem that when decomposing, using female as reference group or male as reference group will give different results.

Słoczyński (2015) and Słoczyński (2013) have shown that a regression adjustment type of analysis is equivalent to Oaxaca-Blinder decomposition in a nonstandard way.

16.1 Oaxaca-Blinder decomposition

Assume the regression model is different for male and female: \[ y_i = X_i \beta_a + \epsilon_a \]

where \(a\) is gender (female is 1), \(X\)’s are person characteristics.

\[ E[y_i | a_i = 1] - E[y_i | a_i = 0] = E[X_i | a_i=0] (\beta_1 - \beta_0) + (E[X_i | a_i =1 ] - E[X_i | a_i = 0]) \beta_1 \] These two equations are different forms of the Oaxaca-Blinder decomposition. The first element is the “unexplained component”, which is the difference in the coefficients, and the second element is the “explained component”, which is the difference in the means of the covariates.

These two equations can give different results for the two components, one has a baseline of male, and other one female.

We can further shows that

\[ \begin{align} E[y_i | a_i =1] - E[y_i | a_i=0] &= E[X_i | a_i=1] (\beta_1 - \beta_0) + (E[X_i | a_i =1 ] - E[X_i | a_i = 0]) \beta_0 \\ &= E[y_i(1) - y_i(0) | a_i=1] + (E[y_i(0) | a_i=1] - E[y_i(0) | a_i = 0]) \\ &= \tau_{PATT} + (E[y_i(0) | a_i=1] - E[y_i(0) | a_i=0]) \end{align} \]

That is, the observed difference is the sum of \(\tau_{PATT}\) and selection bias. It is also know in the OB decomposition literature as unexplained component and explained component. The former as “discrimination”.

We can simply use Regression Adjustment for \(\tau_{PATT}\). The easiest would be an OLS of \(y_i\) on demeaned \(X_i\) and \(a_i\) and interaction, then the coefficient on \(a_i\) is \(\tau_{PATT}\).

He aslo advocates the “population average gender effect” (PAGE):

\[ \tau_{PAGE} = E[\tau(X_i)] \]

which is basically ATE, or conditional average effect (CATE), then average over the \(X\)’s.

In this version, the two forms of OB decomposition can be \(\tau_{PAGM}\) and \(\tau_{PAGW}\), for men and women.

The nice thing about this estimator \(\tau_{PAGE}\) is that first, it is consistent with the treatment effect (causal inference) literature, and it is invariant to the choice of reference group.

The second advantage is that we also do not need to assume homogeneity of treatment effect, which is a strong assumption in the Oaxaca-Blinder decomposition.

The third advantage is that we can use nonparametric methods to estimate the treatment effect. Traditional OB decomposition is based on linear regression, which assumes a linear relationship between the covariates and the outcome.

16.2 Example

I am using the data from https://github.com/d2cml-ai/CausalAI-Course/tree/main/data.

“The data set we consider is from the 2015 March Supplement of the U.S. Current Population Survey. We select white non-hispanic individuals, aged 25 to 64 years, and working more than 35 hours per week for at least 50 weeks of the year. We exclude self-employed workers; individuals living in group quarters; individuals in the military, agricultural or private household sectors; individuals with inconsistent reports on earnings and employment status; individuals with allocated or missing information in any of the variables used in the analysis; and individuals with hourly wage below 3.”

I am using this data to study gender wage gap (GWP).

We can model GWP using potential outcome framework. Here treatment is gender discrimination, and outcome is logged wage. Treatment is 1 if female, 0 if male.

The ATE would be

\[ \delta_{ATE} = E[Y(1) - Y(0)] \]

where \(Y(1)\) is the potential outcome if treated, and \(Y(0)\) is the potential outcome if not treated.

16.2.1 Regression adjustment

Since this is to estimate ATE, we can use regression adjustment to estimate the ATE. First let’s use de-meaned \(X\)’s. Then the coefficient on gender is the ATE.

We use These variables as \(X\)’s:

“shs”,“hsg”,“scl”,“clg”,“ad”,“ne”,“mw”,“so”,“we”,“exp1”, “occ”, “occ2”, “ind”, and “ind2”.

They are:

“Less then High School”,“High School Graduate”,“Some College”,“College Graduate”,“Advanced Degree”, “Northeast”,“Midwest”,“South”,“West”,“Experience”, “occupation”, “occupation 2”, “industry”, and “industry 2”.

I decide to use “occ2” and “ind2” as they are broader categories of occupation and industry. We don’t have enough data for finer categories “occ” and “ind”.

“exp1” seems to be the only variable that is continuous, let’s demean it first. Let’s also input “occ2” and “ind2” as factors.

Let’s include demeaned “exp1”, sohs”, “hsg”, “socl”, “clg”, “mw”, “sout”, and “we” first. I leave out “occ2” and “ind2” since there are too many levels to demean.

library(tidyverse)
library(broom)


data <- read_csv("wage2015_subsample_inference.csv") |>
  rename(socl = scl, sohs = shs, sout = so) |>
  mutate(exp_dm = exp1 - mean(exp1, na.rm = TRUE), female=sex, occ2 = factor(occ2), ind2 = factor(ind2)) |>
  mutate(sohs_dm = sohs - mean(sohs, na.rm = TRUE),
         hsg_dm = hsg - mean(hsg, na.rm = TRUE),
         socl_dm = socl - mean(socl, na.rm = TRUE),
         clg_dm = clg - mean(clg, na.rm = TRUE),
         mw_dm = mw - mean(mw, na.rm = TRUE),
         sout_dm = sout - mean(sout, na.rm = TRUE),
         we_dm = we - mean(we, na.rm = TRUE))

glimpse(data)

Rows: 5,150
Columns: 30
$ rownames <dbl> 10, 12, 15, 18, 19, 30, 43, 44, 47, 71, 73, 77, 84, 89, 96, 1…
$ wage     <dbl> 9.615385, 48.076923, 11.057692, 13.942308, 28.846154, 11.7307…
$ lwage    <dbl> 2.263364, 3.872802, 2.403126, 2.634928, 3.361977, 2.462215, 2…
$ sex      <dbl> 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1…
$ sohs     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ hsg      <dbl> 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0…
$ socl     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1…
$ clg      <dbl> 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0…
$ ad       <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ mw       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ sout     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ we       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ ne       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ exp1     <dbl> 7.0, 31.0, 18.0, 25.0, 22.0, 1.0, 42.0, 37.0, 31.0, 4.0, 7.0,…
$ exp2     <dbl> 0.4900, 9.6100, 3.2400, 6.2500, 4.8400, 0.0100, 17.6400, 13.6…
$ exp3     <dbl> 0.343000, 29.791000, 5.832000, 15.625000, 10.648000, 0.001000…
$ exp4     <dbl> 0.24010000, 92.35210000, 10.49760000, 39.06250000, 23.4256000…
$ occ      <dbl> 3600, 3050, 6260, 420, 2015, 1650, 5120, 5240, 4040, 3255, 40…
$ occ2     <fct> 11, 10, 19, 1, 6, 5, 17, 17, 13, 10, 13, 14, 11, 11, 1, 19, 1…
$ ind      <dbl> 8370, 5070, 770, 6990, 9470, 7460, 7280, 5680, 8590, 8190, 82…
$ ind2     <fct> 18, 9, 4, 12, 22, 14, 14, 9, 19, 18, 18, 18, 18, 18, 17, 4, 4…
$ exp_dm   <dbl> -6.760583, 17.239417, 4.239417, 11.239417, 8.239417, -12.7605…
$ female   <dbl> 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1…
$ sohs_dm  <dbl> -0.02330097, -0.02330097, -0.02330097, -0.02330097, -0.023300…
$ hsg_dm   <dbl> -0.2438835, -0.2438835, 0.7561165, -0.2438835, -0.2438835, -0…
$ socl_dm  <dbl> -0.2780583, -0.2780583, -0.2780583, -0.2780583, -0.2780583, -…
$ clg_dm   <dbl> 0.6823301, 0.6823301, -0.3176699, -0.3176699, 0.6823301, 0.68…
$ mw_dm    <dbl> -0.2596117, -0.2596117, -0.2596117, -0.2596117, -0.2596117, -…
$ sout_dm  <dbl> -0.2965049, -0.2965049, -0.2965049, -0.2965049, -0.2965049, -…
$ we_dm    <dbl> -0.2161165, -0.2161165, -0.2161165, -0.2161165, -0.2161165, -…

# construct matrices for estimation from the data

# 1. basic model
reg1 <- lm(lwage ~ female, data=data)
tidy(reg1)

# A tibble: 2 × 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)   2.99      0.0107    280.    0     
2 female       -0.0383    0.0160     -2.40  0.0165

reg2 <- lm(lwage ~ female*(exp_dm + sohs_dm + hsg_dm + socl_dm + clg_dm + mw_dm + sout_dm + we_dm), data=data)
tidy(reg2)

# A tibble: 18 × 5
   term           estimate std.error statistic  p.value
   <chr>             <dbl>     <dbl>     <dbl>    <dbl>
 1 (Intercept)     3.02     0.00974    310.    0       
 2 female         -0.115    0.0147      -7.80  7.39e-15
 3 exp_dm          0.00801  0.000957     8.37  7.31e-17
 4 sohs_dm        -0.811    0.0620     -13.1   1.63e-38
 5 hsg_dm         -0.706    0.0350     -20.1   6.09e-87
 6 socl_dm        -0.553    0.0351     -15.7   1.45e-54
 7 clg_dm         -0.256    0.0345      -7.41  1.42e-13
 8 mw_dm           0.0337   0.0280       1.20  2.29e- 1
 9 sout_dm         0.0148   0.0271       0.548 5.84e- 1
10 we_dm           0.0555   0.0290       1.91  5.59e- 2
11 female:exp_dm   0.00212  0.00140      1.51  1.30e- 1
12 female:sohs_dm -0.0180   0.117       -0.154 8.78e- 1
13 female:hsg_dm   0.0398   0.0507       0.786 4.32e- 1
14 female:socl_dm  0.0466   0.0482       0.968 3.33e- 1
15 female:clg_dm   0.0992   0.0468       2.12  3.41e- 2
16 female:mw_dm   -0.124    0.0417      -2.98  2.93e- 3
17 female:sout_dm -0.0530   0.0403      -1.31  1.89e- 1
18 female:we_dm   -0.0655   0.0435      -1.51  1.32e- 1

We see the raw gap is 3.8%. Adjusting for some covariates we have 11% gap.

Or, we can use “marginaleffects” instead of demeaning.

reg3 <- lm(lwage ~ female*(exp1 + sohs + hsg+ socl + clg + mw + sout + we ), data=data)

library(marginaleffects)

avg_comparisons(reg3,
                variables = "female",
                vcov = ~subclass)


 Estimate Std. Error    z Pr(>|z|)    S  2.5 %  97.5 %
   -0.115     0.0147 -7.8   <0.001 47.2 -0.143 -0.0858

Term: female
Type: response
Comparison: 1 - 0

We get the same result as the demeaned regression.

If we want \(\tau_{ATT}\), then we can use the “newdata” argument to specify the treatment group,

avg_comparisons(reg3,
                variables = "female",
                vcov = ~subclass,
                newdata = subset(female == 1))


 Estimate Std. Error     z Pr(>|z|)    S  2.5 %  97.5 %
   -0.113     0.0148 -7.65   <0.001 45.5 -0.142 -0.0844

Term: female
Type: response
Comparison: 1 - 0

Next, we adjust for occupation and industry. We can use “occ2” and “ind2” as factors.

reg4 <- lm(lwage ~ female*(exp1 + sohs + hsg+ socl + clg + mw + sout + we + occ2 + ind2), data=data)
avg_comparisons(reg4,
                variables = "female",
                vcov = ~subclass,
                newdata = subset(female == 1))


 Estimate Std. Error     z Pr(>|z|)    S  2.5 %  97.5 %
   -0.075     0.0162 -4.62   <0.001 18.0 -0.107 -0.0432

Term: female
Type: response
Comparison: 1 - 0

So in the model with full interaction, we have a gap of 7.5%.

Now let’s try nonparametric estimation.

16.2.2 Nonparametric estimation

If we’d like to go for nonparametric, relaxing linearity assumption, then we can use “npcausal”. Note this may take a while, depending on which machine learning packages you decide to include. We can also use other nonparametric estimators, such as TMLE or doubleML.

library(npcausal)

#SL.library <- c("SL.earth","SL.glmnet","SL.glm.interaction", "SL.mean","SL.ranger", "SL.xgboost")


SL.library <- c("SL.earth","SL.glmnet","SL.mean","SL.ranger", "SL.xgboost")

# construct W from model matrix of reg3
reg3 <- lm(lwage ~ exp_dm + sohs + hsg+ socl + clg + mw + sout + we + occ2 + ind2, data=data)
W <- as.data.frame(model.matrix(reg3)[, -1]) # remove intercept

Y <- data$lwage
A <- data$female
set.seed(123)
aipw_ate <- ate(y=Y, a=A, x=W, nsplits=5, sl.lib=SL.library)


  |                                                                            
  |                                                                      |   0%


  |                                                                            
  |====                                                                  |   5%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |==========                                                            |  15%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |=====================                                                 |  30%


  |                                                                            
  |========================                                              |  35%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |================================                                      |  45%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |======================================                                |  55%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |==============================================                        |  65%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |============================================================          |  85%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |==================================================================    |  95%
  |                                                                            
  |======================================================================| 100%
     parameter         est         se       ci.ll       ci.ul pval
1      E{Y(0)}  2.99695506 0.01093465  2.97552315  3.01838697    0
2      E{Y(1)}  2.93906408 0.01327342  2.91304817  2.96507999    0
3 E{Y(1)-Y(0)} -0.05789098 0.01597522 -0.08920241 -0.02657955    0

aipw_att <- att(y=Y, a=A, x=W, nsplits=5, sl.lib=SL.library)


  |                                                                            
  |                                                                      |   0%


  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |==============                                                        |  20%


  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |======================================================================| 100%
      parameter         est         se       ci.ll       ci.ul pval
1      E(Y|A=1)  2.94948490 0.01160804  2.92673314  2.97223666    0
2   E{Y(0)|A=1}  3.01689509 0.01221863  2.99294658  3.04084360    0
3 E{Y-Y(0)|A=1} -0.06741019 0.01409116 -0.09502886 -0.03979152    0

We can see this effect is about -6.6%, similar to the regression adjustment result -7.5%.