24 What model to use for rare events

Published

October 26, 2017

24.1 Introduction

In empirical studies, people are worried about rare event situation. That is, when you have, for example, lots of 0’s and only a few 1’s, or vice versa. Do you run a logit model, or do you use a “rare event logit”? When should you use either approach? Or there is a third approach?

Paul Allison said in his blog (https://statisticalhorizons.com/logistic-regression-for-rare-events):

“Prompted by a 2001 article by King and Zeng, many researchers worry about whether they can legitimately use conventional logistic regression for data in which events are rare. Although King and Zeng accurately described the problem and proposed an appropriate solution, there are still a lot of misconceptions about this issue.

The problem is not specifically the rarity of events, but rather the possibility of a small number of cases on the rarer of the two outcomes. If you have a sample size of 1000 but only 20 events, you have a problem. If you have a sample size of 10,000 with 200 events, you may be OK. If your sample has 100,000 cases with 2000 events, you’re golden.”

In general I agree with him. However, when exactly should we use King and Zeng’s “relogit”?

Allison also mentioned two other methods. One is called the Firth method, a penalized likelihood approach. The other one is the exact logistic regression, which is for small samples.

In this simulation exercise, I’ll see how three methods perform: logit, bias-reduced logistic regression (using brglm2 with method="brglmFit" and type="AS_mean"), and Firth model.

24.2 Simulation

Here I have some code for using multiple cores to run these three models. The bias-reduced logistic regression is implemented in R via the brglm2 package, the Firth method is implemented in logistf.

library(brglm2)
library(logistf)
require(snowfall)
set.seed(666)

# initialize parallel cores.
sfInit(parallel=TRUE, cpus=16)

gen.sim <- function(df){
  x <- rnorm(df['nobs'], 0, 1)

  # generate binary data
  p <- df['p']
  alpha <- -log((1-p)/p)
  z = alpha + 2*x
  pr = 1/(1+exp(-z))
  y = rbinom(df['nobs'], 1, pr)
  df = data.frame(y=y, x=x)

  # With small nobs and small p (e.g. nobs=10, p=.01), y can easily come out
  # all-zero (>90% chance in that cell), which makes glm/logistf unable to
  # estimate a coefficient on x. Guard against that rather than letting a
  # single degenerate draw crash the whole parallel simulation loop.
  if (var(y) == 0) {
    return(c(logit=NA, brglm2=NA, logistf=NA))
  }

  # logit
  m1 <- tryCatch(glm(y ~ x, family='binomial'), error = function(e) NULL)
  m1.x <- if (!is.null(m1)) summary(m1)$coefficients['x','Estimate'] - 2 else NA

  # bias-reduced logistic regression (brglm2)
  m2 <- tryCatch(
    glm(y ~ x, family = binomial(link = "logit"), data = df,
        method = "brglmFit", type = "AS_mean"),
    error = function(e) NULL
  )
  m2.x <- if (!is.null(m2)) coef(m2)['x'] - 2 else NA

  # logistf (Firth)
  m3 <- tryCatch(logistf(y ~ x, data=df), error = function(e) NULL)
  m3.x <- if (!is.null(m3)) coef(m3)['x'] - 2 else NA

  return(c(logit=m1.x, brglm2=m2.x, logistf=m3.x))
}

# set parameter space
sim.grid = seq(1, 100, 1)
p.grid = c(.01, .05, .1)
nobs.grid = c(10, 30, 50, 100, 200, 500, 1000, 10000)

data.grid <- expand.grid(nobs.grid, sim.grid, p.grid)
names(data.grid) <- c('nobs', 'nsim', 'p')

# export functions and libraries to parallel workers
sfExport(list=list("gen.sim"))
sfLibrary(brglm2)
sfLibrary(logistf)

results <- data.frame(t(sfApply(data.grid, 1, gen.sim)))

# stop the cluster
sfStop()

forshiny <- cbind(data.grid, results)
write.csv(forshiny, 'results.csv')

We simulate 100 times with sample size from 10 to 10000, event probability .01, .05, and .1.

Since there are many simulations, we used the snowfall library to speed things up.

(The original post plotted bias and MSE by sample size and event probability from the results.csv output above; those figures are not reproduced here — the findings are summarized in prose below.)

In the case of small sample and rare event (for example, any situation that the product of sample size and probability is less than 5), none of these three models perform well. This is understandable, after all, the rarer of the two groups has only less than 5 observations. When the product is more than 50, there is not much difference between these three models. For the situations in between, that is, the product of sample size and probability is greater than 5 but less than 50, we found that the bias-reduced logistic regression (brglm2) and logistf perform better than logit. In most cases, logistf is the best.

In the small sample situation, maybe it’s better to use the exact logistic regression.

--- title: "What model to use for rare events" date: "2017-10-26" --- ## Introduction In empirical studies, people are worried about rare event situation. That is, when you have, for example, lots of 0's and only a few 1's, or vice versa. Do you run a logit model, or do you use a "rare event logit"? When should you use either approach? Or there is a third approach? Paul Allison said in his blog (https://statisticalhorizons.com/logistic-regression-for-rare-events): "Prompted by a 2001 article by King and Zeng, many researchers worry about whether they can legitimately use conventional logistic regression for data in which events are rare. Although King and Zeng accurately described the problem and proposed an appropriate solution, there are still a lot of misconceptions about this issue. The problem is not specifically the rarity of events, but rather the possibility of a small number of cases on the rarer of the two outcomes. If you have a sample size of 1000 but only 20 events, you have a problem. If you have a sample size of 10,000 with 200 events, you may be OK. If your sample has 100,000 cases with 2000 events, you're golden." In general I agree with him. However, when exactly should we use King and Zeng's "relogit"? Allison also mentioned two other methods. One is called the Firth method, a penalized likelihood approach. The other one is the exact logistic regression, which is for small samples. In this simulation exercise, I'll see how three methods perform: logit, bias-reduced logistic regression (using brglm2 with `method="brglmFit"` and `type="AS_mean"`), and Firth model. ## Simulation Here I have some code for using multiple cores to run these three models. The bias-reduced logistic regression is implemented in R via the `brglm2` package, the Firth method is implemented in `logistf`. ```{r} #| label: rare-events-sim #| eval: false library(brglm2) library(logistf) require(snowfall) set.seed(666) # initialize parallel cores. sfInit(parallel=TRUE, cpus=16) gen.sim <- function(df){ x <- rnorm(df['nobs'], 0, 1) # generate binary data p <- df['p'] alpha <- -log((1-p)/p) z = alpha + 2*x pr = 1/(1+exp(-z)) y = rbinom(df['nobs'], 1, pr) df = data.frame(y=y, x=x) # With small nobs and small p (e.g. nobs=10, p=.01), y can easily come out # all-zero (>90% chance in that cell), which makes glm/logistf unable to # estimate a coefficient on x. Guard against that rather than letting a # single degenerate draw crash the whole parallel simulation loop. if (var(y) == 0) { return(c(logit=NA, brglm2=NA, logistf=NA)) } # logit m1 <- tryCatch(glm(y ~ x, family='binomial'), error = function(e) NULL) m1.x <- if (!is.null(m1)) summary(m1)$coefficients['x','Estimate'] - 2 else NA # bias-reduced logistic regression (brglm2) m2 <- tryCatch( glm(y ~ x, family = binomial(link = "logit"), data = df, method = "brglmFit", type = "AS_mean"), error = function(e) NULL ) m2.x <- if (!is.null(m2)) coef(m2)['x'] - 2 else NA # logistf (Firth) m3 <- tryCatch(logistf(y ~ x, data=df), error = function(e) NULL) m3.x <- if (!is.null(m3)) coef(m3)['x'] - 2 else NA return(c(logit=m1.x, brglm2=m2.x, logistf=m3.x)) } # set parameter space sim.grid = seq(1, 100, 1) p.grid = c(.01, .05, .1) nobs.grid = c(10, 30, 50, 100, 200, 500, 1000, 10000) data.grid <- expand.grid(nobs.grid, sim.grid, p.grid) names(data.grid) <- c('nobs', 'nsim', 'p') # export functions and libraries to parallel workers sfExport(list=list("gen.sim")) sfLibrary(brglm2) sfLibrary(logistf) results <- data.frame(t(sfApply(data.grid, 1, gen.sim))) # stop the cluster sfStop() forshiny <- cbind(data.grid, results) write.csv(forshiny, 'results.csv') ``` We simulate 100 times with sample size from 10 to 10000, event probability .01, .05, and .1. Since there are many simulations, we used the `snowfall` library to speed things up. (The original post plotted bias and MSE by sample size and event probability from the `results.csv` output above; those figures are not reproduced here — the findings are summarized in prose below.) In the case of small sample and rare event (for example, any situation that the product of sample size and probability is less than 5), none of these three models perform well. This is understandable, after all, the rarer of the two groups has only less than 5 observations. When the product is more than 50, there is not much difference between these three models. For the situations in between, that is, the product of sample size and probability is greater than 5 but less than 50, we found that the bias-reduced logistic regression (brglm2) and logistf perform better than logit. In most cases, logistf is the best. In the small sample situation, maybe it's better to use the exact logistic regression.