By Benjamin Avanzi, Guillaume Boglioni Beaulieu, Pierre Lafaye de Micheaux, Ho Ming Lee, Bernard Wong, and Rui Zhou.
This repository contains the code for the paper Beyond pairwise correlation: capturing nonlinear and higher-order dependence with distance statistics, submitted to the 2026 All Actuaries Summit. The paper aims to introduce distance-based dependence statistics for testing and modelling dependence structures between random variables and/or random vectors, as complementary tools to the correlation coefficient. We illustrate these statistics using a range of real-world and synthetic datasets. The paper can be found here.
The code for visualising the motivating examples, together with illustrations of the computation of the distance-based dependence statistics, is available here.
We also provide the datasets used for illustration in this repository. They can be found in the data folder.
The dataset used in this paper are:
- World demographics data (CIA, 2020): This dataset contains the birth and death rates from different countries in the first trimester of 2020. It can be loaded in
Rusing theHellCorpackage via:library(HellCor) data("worlddemographics") - Multi-line insurance data: This dataset is a synthetic dataset generated to illustrate the pitfalls of correlation. It represents an insurance portfolio with two lines of business: motor insurance, with three components (vehicle repair cost, bodily injury liability, and claims handling cost), and another line containing medical claim cost. Details of the data generation can be found on this page.
- pg15training (Dutang & Charpentier, 2026): This dataset contains 100,000 third-party liability policies for a private motor insurance product in France. It is one of the datasets in the
RpackageCASdatasets.library(CASdatasets) data("pg15training") - S&P-500 data: This dataset contains the monthly stock returns downloaded from Yahoo Finance over the period 1926-01-01 to 2026-03-20. It was downloaded using the following code in
R:getSymbols("^GSPC", src = "yahoo", from = "1926-01-01") price <- Cl(GSPC) ret <- monthlyReturn(price, type = "log")
- LA mortality data (Shumway et al., 1988): This dataset contains weekly data on cardiovascular mortality, temperature, and pollutant particulates for Los Angeles County over the period 1970--1979. It can be downloaded in
Rusing the following code:library(dCovTS) data(MortTempPart)
For those interested in running the code themselves, the main files are r_illust.qmd and jdcov.qmd. The file r_illust.qmd is written in R and contains the illustrations for the Hellinger correlation (Geenens and Lafaye de Micheaux, 2022), distance covariance (Székely and Rizzo, 2009), and the auto-distance correlation function (Zhou, 2012).
For the joint distance covariance (Chakraborty & Zhang, 2019), you can run jdcov.qmd, which is written in Python. The code was executed in Quarto with Python 3.14.3. Required packages are listed in requirements.txt and can be installed with
pip install -r requirements.txtCentral Intelligence Agency. (2020). The world factbook.
Chakraborty, S., & Zhang, X. (2019). Distance metrics for measuring joint dependence with application to causal inference. Journal of the American Statistical Association, 114(528), 1638–1650. https://doi.org/10.1080/01621459.2018.1513364
Dutang, C., & Charpentier, A. (2026). CASdatasets: Insurance datasets [R package version 1.2-1]. https://doi.org/10.57745/P0KHAG
Geenens, G., & Lafaye de Micheaux, P. (2022). The Hellinger correlation. Journal of the American Statistical Association, 117(538), 639–653. https://doi.org/10.1080/01621459.2020.1791132
Shumway, R. H., Azari, A. S., & Pawitan, Y. (1988). Modeling mortality fluctuations in Los Angeles as functions of pollution and weather effects. Environmental Research, 45(2), 224–241. https://doi.org/10.1016/S0013-9351(88)80049-5
Székely, G. J., & Rizzo, M. L. (2009). Brownian distance covariance. The Annals of Applied Statistics, 3(4), 1236–1265. https://doi.org/10.1214/09-AOAS312
Zhou, Z. (2012). Measuring nonlinear dependence in time-series, a distance correlation approach. Journal of Time Series Analysis, 33(3), 438–457. https://doi.org/10.1111/j.1467-9892.2011.00780.x