As a follow-up to an earlier post, I was pleasantly surprised to discover that the code to handle two-way cluster-robust standard errors in R that I blogged about earlier worked out of the box with the IV regression routine available in the AER package (ivreg).
# Tests of R function using Mitchell Petersen's test-data # Read the data test <- read.table( url(paste("http://www.kellogg.northwestern.edu/", "faculty/petersen/htm/papers/se/", "test_data.txt",sep="")), col.names=c("firmid", "year", "x", "y")) # Create a silly instrument (x plus noise) test <- within(test, z <- x + rnorm(length(x), mean=0, sd=sd(x)/3)) # The fitted model library(AER) fm <- ivreg(y ~ x | z, data=test) # Tests library(sandwich); library(lmtest) coeftest(fm) # OLS coeftest(fm, vcov=vcovHC(fm, type="HC0")) # White source("http://iangow.me/~igow/code/cluster2.R") coeftest.cluster(test,fm, cluster1="firmid") # Clustered by firm coeftest.cluster(test,fm, cluster1="year") # Clustered by year coeftest.cluster(test,fm, cluster1="firmid", cluster2="year") # Clustered by firm and year # Save the data to a Stata file for comparison library(foreign) write.dta(dataframe=test, file="~/Dropbox/WRDS/petersen.dta")
Here’s the output
> test <- within(test, z <- x + rnorm(length(x), mean=0, sd=sd(x)/3)) Error in within(test, z <- x + rnorm(length(x), mean = 0, sd = sd(x)/3)) : object 'test' not found > test <- read.table( + url(paste("http://www.kellogg.northwestern.edu/", + "faculty/petersen/htm/papers/se/", + "test_data.txt",sep="")), + col.names=c("firmid", "year", "x", "y")) > > test <- within(test, z <- x + rnorm(length(x), mean=0, sd=sd(x)/3)) > fm <- ivreg(y ~ x | z, data=test) Error: could not find function "ivreg" > library(AER) Loading required package: car Loading required package: MASS Loading required package: nnet Loading required package: survival Loading required package: splines Loading required package: Formula Loading required package: lmtest Loading required package: zoo Attaching package: ‘zoo’ The following object(s) are masked from ‘package:base’: as.Date, as.Date.numeric Loading required package: sandwich Loading required package: strucchange > fm <- ivreg(y ~ x | z, data=test) > library(sandwich); library(lmtest) > coeftest(fm) # OLS t test of coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.029685 0.028359 1.0468 0.2953 x 1.033802 0.030142 34.2982 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > coeftest(fm, vcov=vcovHC(fm, type="HC0")) # White t test of coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.029685 0.028354 1.0469 0.2952 x 1.033802 0.029851 34.6326 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > coeftest.cluster(test,fm, cluster1="firmid") # Clustered by firm Error: could not find function "coeftest.cluster" > coeftest.cluster(test,fm, cluster1="year") # Clustered by year Error: could not find function "coeftest.cluster" > source("~/Dropbox/AGL/Code/R/cluster2.R") > coeftest.cluster(test,fm, cluster1="firmid") # Clustered by firm t test of coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.029685 0.067011 0.443 0.6578 x 1.033802 0.051423 20.104 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > coeftest.cluster(test,fm, cluster1="year") # Clustered by year t test of coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.029685 0.023395 1.2688 0.2046 x 1.033802 0.036246 28.5219 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > coeftest.cluster(test,fm, cluster1="firmid", + cluster2="year") # Clustered by firm and year t test of coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.029685 0.065065 0.4562 0.6482 x 1.033802 0.055377 18.6684 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > library(foreign) > write.dta(dataframe=test, file="~/Dropbox/WRDS/petersen.dta")
Now, compare with Stata. First, I use the in-built ivreg routine, then I consider the ivreg2 routine available here.
It seems that the R routine produces identical estimates of standard errors to Stata’s ivreg routine, which only handles one-way clustering, but both produce different estimates from the ivreg2 routine. My guess is that it’s some differences in the degrees-of-freedom correction used; the numbers are fairly close.
. use "~/Dropbox/WRDS/petersen.dta" (Written by R. ) . ivreg y (x=z) Instrumental variables (2SLS) regression Source | SS df MS Number of obs = 5000 -------------+------------------------------ F( 1, 4998) = 1176.37 Model | 5270.65875 1 5270.65875 Prob > F = 0.0000 Residual | 20097.6441 4998 4.02113727 R-squared = 0.2078 -------------+------------------------------ Adj R-squared = 0.2076 Total | 25368.3028 4999 5.0746755 Root MSE = 2.0053 ------------------------------------------------------------------------------ y | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- x | 1.033802 .0301416 34.30 0.000 .9747116 1.092893 _cons | .0296853 .0283594 1.05 0.295 -.0259115 .0852821 ------------------------------------------------------------------------------ Instrumented: x Instruments: z ------------------------------------------------------------------------------ . ivreg y (x=z), cluster(firmid) Instrumental variables (2SLS) regression Number of obs = 5000 F( 1, 499) = 404.17 Prob > F = 0.0000 R-squared = 0.2078 Root MSE = 2.0053 (Std. Err. adjusted for 500 clusters in firmid) ------------------------------------------------------------------------------ | Robust y | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- x | 1.033802 .0514227 20.10 0.000 .9327707 1.134834 _cons | .0296853 .0670107 0.44 0.658 -.1019726 .1613431 ------------------------------------------------------------------------------ Instrumented: x Instruments: z ------------------------------------------------------------------------------ . ivreg y (x=z), cluster(year) Instrumental variables (2SLS) regression Number of obs = 5000 F( 1, 9) = 813.50 Prob > F = 0.0000 R-squared = 0.2078 Root MSE = 2.0053 (Std. Err. adjusted for 10 clusters in year) ------------------------------------------------------------------------------ | Robust y | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- x | 1.033802 .0362459 28.52 0.000 .9518085 1.115796 _cons | .0296853 .0233954 1.27 0.236 -.0232389 .0826094 ------------------------------------------------------------------------------ Instrumented: x Instruments: z ------------------------------------------------------------------------------ . ivreg2 y (x=z), cluster(year) IV (2SLS) estimation -------------------- Estimates efficient for homoskedasticity only Statistics robust to heteroskedasticity and clustering on year Number of clusters (year) = 10 Number of obs = 5000 F( 1, 9) = 813.50 Prob > F = 0.0000 Total (centered) SS = 25368.30284 Centered R2 = 0.2078 Total (uncentered) SS = 25374.51146 Uncentered R2 = 0.2080 Residual SS = 20097.64409 Root MSE = 2.005 ------------------------------------------------------------------------------ | Robust y | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- x | 1.033802 .0343824 30.07 0.000 .9664141 1.101191 _cons | .0296853 .0221926 1.34 0.181 -.0138115 .073182 ------------------------------------------------------------------------------ Underidentification test (Kleibergen-Paap rk LM statistic): 9.987 Chi-sq(1) P-val = 0.0016 ------------------------------------------------------------------------------ Weak identification test (Cragg-Donald Wald F statistic): 4.5e+04 (Kleibergen-Paap rk Wald F statistic): 6.1e+04 Stock-Yogo weak ID test critical values: 10% maximal IV size 16.38 15% maximal IV size 8.96 20% maximal IV size 6.66 25% maximal IV size 5.53 Source: Stock-Yogo (2005). Reproduced by permission. NB: Critical values are for Cragg-Donald F statistic and i.i.d. errors. ------------------------------------------------------------------------------ Hansen J statistic (overidentification test of all instruments): 0.000 (equation exactly identified) ------------------------------------------------------------------------------ Instrumented: x Excluded instruments: z ------------------------------------------------------------------------------ . ivreg2 y (x=z), cluster(firmid) IV (2SLS) estimation -------------------- Estimates efficient for homoskedasticity only Statistics robust to heteroskedasticity and clustering on firmid Number of clusters (firmid) = 500 Number of obs = 5000 F( 1, 499) = 404.17 Prob > F = 0.0000 Total (centered) SS = 25368.30284 Centered R2 = 0.2078 Total (uncentered) SS = 25374.51146 Uncentered R2 = 0.2080 Residual SS = 20097.64409 Root MSE = 2.005 ------------------------------------------------------------------------------ | Robust y | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- x | 1.033802 .0513661 20.13 0.000 .9331267 1.134478 _cons | .0296853 .0669369 0.44 0.657 -.1015087 .1608792 ------------------------------------------------------------------------------ Underidentification test (Kleibergen-Paap rk LM statistic): 308.025 Chi-sq(1) P-val = 0.0000 ------------------------------------------------------------------------------ Weak identification test (Cragg-Donald Wald F statistic): 4.5e+04 (Kleibergen-Paap rk Wald F statistic): 3.2e+04 Stock-Yogo weak ID test critical values: 10% maximal IV size 16.38 15% maximal IV size 8.96 20% maximal IV size 6.66 25% maximal IV size 5.53 Source: Stock-Yogo (2005). Reproduced by permission. NB: Critical values are for Cragg-Donald F statistic and i.i.d. errors. ------------------------------------------------------------------------------ Hansen J statistic (overidentification test of all instruments): 0.000 (equation exactly identified) ------------------------------------------------------------------------------ Instrumented: x Excluded instruments: z ------------------------------------------------------------------------------ . ivreg2 y (x=z), cluster(firmid fyear) variable fyear not found in option cluster() r(111); . ivreg2 y (x=z), cluster(firmid year) IV (2SLS) estimation -------------------- Estimates efficient for homoskedasticity only Statistics robust to heteroskedasticity and clustering on firmid and year Number of clusters (firmid) = 500 Number of obs = 5000 Number of clusters (year) = 10 F( 1, 9) = 328.27 Prob > F = 0.0000 Total (centered) SS = 25368.30284 Centered R2 = 0.2078 Total (uncentered) SS = 25374.51146 Uncentered R2 = 0.2080 Residual SS = 20097.64409 Root MSE = 2.005 ------------------------------------------------------------------------------ | Robust y | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- x | 1.033802 .0541255 19.10 0.000 .9277184 1.139886 _cons | .0296853 .0645686 0.46 0.646 -.096867 .1562375 ------------------------------------------------------------------------------ Underidentification test (Kleibergen-Paap rk LM statistic): 9.731 Chi-sq(1) P-val = 0.0018 ------------------------------------------------------------------------------ Weak identification test (Cragg-Donald Wald F statistic): 4.5e+04 (Kleibergen-Paap rk Wald F statistic): 4.0e+04 Stock-Yogo weak ID test critical values: 10% maximal IV size 16.38 15% maximal IV size 8.96 20% maximal IV size 6.66 25% maximal IV size 5.53 Source: Stock-Yogo (2005). Reproduced by permission. NB: Critical values are for Cragg-Donald F statistic and i.i.d. errors. ------------------------------------------------------------------------------ Hansen J statistic (overidentification test of all instruments): 0.000 (equation exactly identified) ------------------------------------------------------------------------------ Instrumented: x Excluded instruments: z ------------------------------------------------------------------------------
I don’t fully comprehend the R script, I must admit, but would you not need to correct the SE when using IV? When using OLS on real data, the script above give me almost identical SE as Stata, but when using AER ivreg there are rather big differences between Stata and R.
Here is a page on clustered errors in IV is SAS:
http://works.bepress.com/tbrachet/2/
Absolutely one may need to use cluster-robust standard errors with IV (OLS is simply a special case of IV with X=Z). I’ve found AER ivreg and Stata’s ivreg to give the same results. I believe this is the case for the example on the blog posting.
Perhaps, my question was unclear (english is my second language). In
general, should the coeftest.cluster() function give the correct
‘theoretical’ sample clustered SEs when using IV (say using AER’s
ivreg)?
For example, doing manual TSLS I’d need to correct my SEs ex post, and
I wondered whether the coeftest.cluster()/ivreg() are `smart’ enough
to account for this (i.e. does ivreg provides measures in such a way
so that you’d expect clustered SEs to correct sample estimates?).
As said, when using real data I do at times get somewhat different
t-stats, which could be just different coding, but could also be due
to coeftest.cluster() `assuming’ OLS when it is not appropriate.
Thanks again.
coeftest.cluster() takes a fitted model as an input and, if the fitted model comes from AER’s ivreg(), it produces standard errors equal to those given by Stata’s ivreg for the same data and clustering. So it won’t “assume OLS when it is not appropriate.”
Thanks for clearing that up Ian. I appreciate it. When I’ve got better time I will have to dig further in to core R functions 🙂
Thanks again
I got error ‘undefined column selected’ when my model includes factor variables. Can you make the function works with factor regressors? Thanks
Muhammad:
It must be something more than the inclusion of factor variables. If I modify the code to include
test$year <- as.factor(test$year)
it still works fine.
This should be useful: https://cran.r-project.org/web/packages/ivpack/ivpack.pdf