|
1 | 1 | # CovarianceMatrices.jl
|
2 | 2 |
|
3 | 3 | [](https://travis-ci.org/gragusa/CovarianceMatrices.jl)
|
4 |
| - |
5 | 4 | [](http://pkg.julialang.org/?pkg=CovarianceMatrices&ver=0.5)
|
6 |
| - |
7 | 5 | [](https://coveralls.io/github/gragusa/CovarianceMatrices.jl?branch=master)
|
8 | 6 | [](http://codecov.io/github/gragusa/CovarianceMatrices.jl?branch=master)
|
9 | 7 |
|
10 | 8 | Heteroskedasticity and Autocorrelation Consistent Covariance Matrix Estimation for Julia.
|
11 | 9 |
|
12 |
| -## Status |
13 |
| - |
14 |
| -- [x] Basic interface |
15 |
| -- [x] Implement Type0-Type4 (HC0, HC1, HC2, HC3, HC4) variances |
16 |
| -- [x] Implement Type5-Type5c |
17 |
| -- [x] HAC with manual bandwidth |
18 |
| -- [x] HAC automatic bandwidth (Andrews) |
19 |
| - - [x] AR(1) model |
20 |
| - - [ ] AR(p) model |
21 |
| - - [ ] MA(p) model |
22 |
| - - [ ] ARMA(p) model |
23 |
| - |
24 |
| -- [ ] HAC automatic bandwidth (Newey-West) |
25 |
| -- [ ] De-meaned versions: E[(X-\mu)(X-\mu)'] |
26 |
| -- [x] VARHAC |
27 |
| -- [ ] Extend API to allow passing option to automatic bandwidth selection methods |
28 |
| -- [x] Cluster Robust HC |
29 |
| -- [ ] Two-way cluster robust |
30 |
| -- [x] Interface with `GLM.jl` |
31 |
| -- [ ] Drop-in `show` method for `GLM.jl` |
32 |
| - |
33 |
| -## Install |
| 10 | +## Installation |
34 | 11 |
|
35 |
| -``` |
| 12 | +The package is registered on [METADATA](http::/github.com/JuliaLang/METADATA.jl), so installing it amounts to issue |
| 13 | +```julia |
36 | 14 | Pkg.add("CovarianceMatrices")
|
37 | 15 | ```
|
38 | 16 |
|
39 |
| -## Basic usage |
| 17 | +## Introduction |
| 18 | + |
| 19 | +This package provides types and methods useful to obtain consistent estimates of the long run covariance matrix of a random process. |
| 20 | + |
| 21 | +Three classes of estimators are considered: |
| 22 | + |
| 23 | +1. **HAC** - heteroskedasticity and autocorrelation consistent (Andrews, 1996; Newey and West) |
| 24 | +2. **HC** - hetheroskedasticity (White, 1982) |
| 25 | +3. **CRVE** - cluster robust (Arellano, 1986) |
| 26 | + |
| 27 | +The typical application of these estimators is in inference about the parameters of a models. Accordingly, this package extends methods defined in [StatsBase.jl](http://github.com/JuliaStat/StatsBase.jl) and [GLM.jl](http://github.com/JuliaStat/GLM.jl) to make it easy obtaining inference that is robust to heteroskedasticity and/or autocorrelation, or presence of cluster components. |
| 28 | + |
| 29 | +# Quick tour |
| 30 | + |
| 31 | +## HAC (Heteroskedasticity and Autocorrelation Consistent) |
| 32 | + |
| 33 | +Available kernel types are: |
| 34 | + |
| 35 | +- `TruncatedKernel` |
| 36 | +- `BartlettKernel` |
| 37 | +- `ParzenKernel` |
| 38 | +- `TukeyHanningKernel` |
| 39 | +- `QuadraticSpectralKernel` |
| 40 | + |
| 41 | +For example, `ParzenKernel(NeweyWest)` return an instance of `TruncatedKernel` parametrized by `NeweyWest`, the type that corresponds to the optimal bandwidth calculated following Newey and West (1994). Similarly, `ParzenKernel(Andrews)` corresponds to the optimal bandwidth obtained in Andrews (1991). If the bandwidth is known, it can be directly passed, i.e. `TruncatedKernel(2)`. |
| 42 | + |
| 43 | +The examples below hopefully clarify the API, that is however relatively simple to use. |
40 | 44 |
|
41 |
| -### Heteroskedasticity Robust |
| 45 | +### Long run variance of the regression coefficient |
42 | 46 |
|
| 47 | +In the regression context, the function `vcov` does all the work. Its API` is |
| 48 | +```julia |
| 49 | +vcov(::DataFrameRegressionModel, ::HAC; prewhite = true) |
43 | 50 | ```
|
| 51 | + |
| 52 | +The example below, describe a typical call to it on a regression model on generated data (a regression with autoregressive error component): |
| 53 | +```julia |
44 | 54 | using CovarianceMatrices
|
45 | 55 | using DataFrames
|
46 |
| -using GLM |
| 56 | +srand(1) |
| 57 | +n = 500 |
| 58 | +x = randn(n,5) |
| 59 | +u = Array{Float64}(2*n) |
| 60 | +u[1] = rand() |
| 61 | +for j in 2:2*n |
| 62 | + u[j] = 0.78*u[j-1] + randn() |
| 63 | +end |
| 64 | +u = u[n+1:2*n] |
| 65 | +y = 0.1 + x*[0.2, 0.3, 0.0, 0.0, 0.5] + u |
47 | 66 |
|
48 |
| -# A Gamma example, from McCullagh & Nelder (1989, pp. 300-2) |
49 |
| -# The weights are added just to test the interface and are not part |
50 |
| -# of the original data |
51 |
| -clotting = DataFrame( |
52 |
| - u = log([5,10,15,20,30,40,60,80,100]), |
53 |
| - lot1 = [118,58,42,35,27,25,21,19,18], |
54 |
| - lot2 = [69,35,26,21,18,16,13,12,12], |
55 |
| - w = [1/8, 1/9, 1/25, 1/6, 1/14, 1/25, 1/15, 1/13, 0.3022039] |
56 |
| -) |
57 |
| -wOLS = fit(GeneralizedLinearModel, lot1~u, clotting, Normal(), wts = array(clotting[:w])) |
| 67 | +df = DataFrame() |
| 68 | +df[:y] = y |
| 69 | +for j in enumerate([:x1, :x2, :x3, :x4, :x5]) |
| 70 | + df[j[2]] = x[:,j[1]] |
| 71 | +end |
| 72 | +``` |
| 73 | +Using the data in `df`, the coefficient of the regression can be estimated using `GLM` |
58 | 74 |
|
59 |
| -vcov(wOLS, HC0()) |
60 |
| -vcov(wOLS, HC1()) |
61 |
| -vcov(wOLS, HC2()) |
62 |
| -vcov(wOLS, HC3()) |
63 |
| -vcov(wOLS, HC4()) |
| 75 | +```julia |
| 76 | +lm1 = glm(y~x1+x2+x3+x4+x5, df, Normal(), IdentityLink()) |
| 77 | +``` |
64 | 78 |
|
65 |
| -stderr(wOLS, HC0()) |
66 |
| -stderr(wOLS, HC1()) |
67 |
| -stderr(wOLS, HC2()) |
68 |
| -stderr(wOLS, HC3()) |
69 |
| -stderr(wOLS, HC4()) |
| 79 | +To get a consistent estimate of the long run variance of the estimated coefficient using a Quadratic Spectral kernel with automatic bandwidth selection a la' Andrews |
| 80 | +```julia |
| 81 | +vcov(lm1, QuadraticSpectralKernel(Andrews), prewhite = false) |
| 82 | +``` |
| 83 | +If one wants to estimate the long-time variance using the same kernel, but with a bandwidth selected a la' Newey-West |
| 84 | +```julia |
| 85 | +vcov(lm1, QuadraticSpectralKernel(NeweyWest), prewhite = false) |
| 86 | +``` |
| 87 | +The standard errors can be obtained by the `stderr` function |
| 88 | +```julia |
| 89 | +vcov(::DataFrameRegressionModel, ::HAC; prewhite = true) |
| 90 | +``` |
| 91 | +Sometime is useful to access the bandwidth automatically selected. This can be done using the `optimalbw` |
| 92 | +```julia |
| 93 | +optimalbw(NeweyWest, QuadraticSpectralKernel, lm1; prewhite = false) |
| 94 | +optimalbw(Andrews, QuadraticSpectralKernel, lm1; prewhite = false) |
| 95 | +``` |
70 | 96 |
|
| 97 | +### Long run variance of the average of the process |
| 98 | + |
| 99 | +Sometime is of interest estimating the long-run variance of the average of the process. At the moment this can be done by carrying out a regression on a constant (the sample mean of the realization of the process) and using `vcov` or `stderr` to obtain a consistent variance. |
| 100 | + |
| 101 | +```julia |
| 102 | +lm2 = glm(u~1, df, Normal(), IdentityLink()) |
| 103 | +vcov(lm1, ParzenKernel(NeweyWest), prewhite = false) |
| 104 | +stderr(lm1, ParzenKernel(NeweyWest), prewhite = false) |
71 | 105 | ```
|
72 | 106 |
|
73 |
| -### Heteroskedasticity and Autocorrelation Robust |
| 107 | +## HC (Heteroskedastic consistent) |
74 | 108 |
|
| 109 | +As in the HAC case, `vcov` and `stderr` are the main functions. They know get as argument the type of robust variance being sought |
| 110 | +```julia |
| 111 | +vcov(::DataFrameRegressionModel, ::HC) |
75 | 112 | ```
|
76 |
| -## Simulated AR(1) and estimate it using OLS |
77 |
| -srand(1) |
78 |
| -y = zeros(Float64, 100) |
79 |
| -rho = 0.8 |
80 |
| -y[1] = randn() |
81 |
| -for j = 2:100 |
82 |
| - y[j] = rho * y[j-1] + randn() |
83 |
| -end |
| 113 | +Where HC is an abstract type with the following concrete types: |
84 | 114 |
|
85 |
| -data = DataFrame(y = y[2:100], yl = y[1:99]) |
86 |
| -AR1 = fit(GeneralizedLinearModel, y~yl, data, Normal()) |
| 115 | +- `HC0` |
| 116 | +- `HC1` (this is `HC0` with the degree of freedom adjustment) |
| 117 | +- `HC2` |
| 118 | +- `HC3` |
| 119 | +- `HC4` |
| 120 | +- `HC4m` |
| 121 | +- `HC5` |
87 | 122 |
|
88 |
| -## Default assume iid (which is correct in this case) |
89 |
| -vcov(AR1) |
90 | 123 |
|
91 |
| -## Truncated kernel (TruncatedKernel) |
92 |
| -vcov(AR1, TruncatedKernel(0)) ## Same as iid because bandwidth = 0 |
93 |
| -vcov(AR1, TruncatedKernel(1)) |
94 |
| -vcov(AR1, TruncatedKernel(2)) |
95 |
| -vcov(AR1, TruncatedKernel()) ## Optimal bandwidth |
| 124 | +``` |
| 125 | +using CovarianceMatrices |
| 126 | +using DataFrames |
| 127 | +using GLM |
96 | 128 |
|
97 |
| -## Bartelett kernel |
98 |
| -vcov(AR1, BartlettKernel(0)) ## Same as iid because bandwidth = 0 |
99 |
| -vcov(AR1, BartlettKernel(1)) |
100 |
| -vcov(AR1, BartlettKernel(2)) |
101 |
| -vcov(AR1, BartlettKernel()) ## Optimal bandwidth |
| 129 | +# A Gamma example, from McCullagh & Nelder (1989, pp. 300-2) |
| 130 | +# The weights are added just to test the interface and are not part |
| 131 | +# of the original data |
| 132 | +clotting = DataFrame( |
| 133 | + u = log([5,10,15,20,30,40,60,80,100]), |
| 134 | + lot1 = [118,58,42,35,27,25,21,19,18], |
| 135 | + lot2 = [69,35,26,21,18,16,13,12,12], |
| 136 | + w = 9*[1/8, 1/9, 1/25, 1/6, 1/14, 1/25, 1/15, 1/13, 0.3022039] |
| 137 | +) |
| 138 | +wOLS = fit(GeneralizedLinearModel, lot1~u, clotting, Normal(), wts = array(clotting[:w])) |
102 | 139 |
|
103 |
| -## Parzen kernel |
104 |
| -vcov(AR1, ParzenKernel(0)) ## Same as iid because bandwidth = 0 |
105 |
| -vcov(AR1, ParzenKernel(1)) |
106 |
| -vcov(AR1, ParzenKernel(2)) |
107 |
| -vcov(AR1, ParzenKernel()) ## Optimal bandwidth |
| 140 | +vcov(wOLS, HC0 |
| 141 | +vcov(wOLS, HC1) |
| 142 | +vcov(wOLS, HC2) |
| 143 | +vcov(wOLS, HC3) |
| 144 | +vcov(wOLS, HC4) |
| 145 | +vcov(wOLS, HC4m) |
| 146 | +vcov(wOLS, HC5) |
| 147 | +``` |
108 | 148 |
|
109 |
| -## Quadratic-Spectral kernel |
110 |
| -vcov(AR1, QuadraticSpectralKernel(0.1)) |
111 |
| -vcov(AR1, QuadraticSpectralKernel(.5)) |
112 |
| -vcov(AR1, QuadraticSpectralKernel(2.)) |
113 |
| -vcov(AR1, QuadraticSpectralKernel()) ## Optimal bandwidth |
| 149 | +## CRHC (Cluster robust heteroskedasticty consistent) |
| 150 | +The API of this class of variance estimators is subject to change, so please use with care. In particular, since the `CRHC` type needs to know with variable is used for clustering---a good way pass it must be thought out. For the moment, the following approach works --- as long as no missing values are present in the original dataframe. |
114 | 151 |
|
| 152 | +```julia |
| 153 | +using RDatasets |
| 154 | +df = dataset("plm", "Grunfeld") |
| 155 | +lm = glm(Inv~Value+Capital, df, Normal(), IdentityLink()) |
| 156 | +vcov(lm, CRHC1(convert(Array, df[:Firm]))) |
| 157 | +stderr(lm, CRHC1(convert(Array, df[:Firm]))) |
115 | 158 | ```
|
0 commit comments