Skip to content
This repository was archived by the owner on May 4, 2019. It is now read-only.

Port to Nulls.jl #288

Merged
merged 19 commits into from
Oct 19, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 7 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,30 +11,22 @@ Documentation:
[![](https://img.shields.io/badge/docs-stable-blue.svg)](https://JuliaStats.github.io/DataArrays.jl/stable)
[![](https://img.shields.io/badge/docs-latest-blue.svg)](https://JuliaStats.github.io/DataArrays.jl/latest)

The DataArrays package provides array types for working efficiently with [missing data](https://en.wikipedia.org/wiki/Missing_data)
in Julia, based on the `null` value from the [Nulls.jl](https://github.com/JuliaData/Nulls.jl) package.
In particular, it provides the following:

The DataArrays package extends Julia by introducing data structures that can contain missing data. In particular, the package introduces three new data types to Julia:

* `NA`: A singleton type that represents a single missing value.
* `DataArray{T}`: An array-like data structure that can contain values of type `T`, but can also contain missing values.
* `PooledDataArray{T}`: A variant of `DataArray{T}` optimized for representing arrays that contain many repetitions of a small number of unique values -- as commonly occurs when working with categorical data.

# The `NA` Value

Many languages represent missing values using a reserved value like `NULL` or `NA`. A missing integer value, for example, might be represented as a `NULL` value in SQL or as an `NA` value in R.

Julia takes its conception of `NA` from R, where `NA` denotes missingness based on lack of information. If, for example, we were to measure people's heights as integers, an `NA` might reflect our ignorance of a specific person's height.

Conceptualizing the use of `NA` as a signal of uncertainty will help you understand how `NA` interacts with other values. For example, it explains why `NA + 1` is `NA`, but `NA & false` is `false`. In general, `NA` corrupts any computation whose results cannot be determined without knowledge of the value that is `NA`.

# DataArray's

Most Julian arrays cannot contain `NA` values: only `Array{NAtype}` and heterogeneous Arrays can contain `NA` values. Of these, only heterogeneous arrays could contain values of any type other than `NAtype`.
Most Julian arrays cannot contain `null` values: only `Array{Union{T, Null}}` and more generally `Array{>:Null}` can contain `null` values.

The generic use of heterogeneous Arrays is discouraged in Julia because it is inefficient: accessing any value requires dereferencing a pointer. The `DataArray` type allows one to work around this inefficiency by providing tightly-typed arrays that can contain values of exactly one type, but can also contain `NA` values.
The generic use of heterogeneous `Array` is discouraged in Julia versions below 0.7 because it is inefficient: accessing any value requires dereferencing a pointer. The `DataArray` type allows one to work around this inefficiency by providing tightly-typed arrays that can contain values of exactly one type, but can also contain `null` values.

For example, a `DataArray{Int}` can contain integers and NA values. We can construct one as follows:
For example, a `DataArray{Int}` can contain integers and `null` values. We can construct one as follows:

da = @data([1, 2, NA, 4])
da = @data([1, 2, null, 4])

# PooledDataArray's

Expand Down
1 change: 1 addition & 0 deletions REQUIRE
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
julia 0.6
Nulls 0.1.2
StatsBase 0.15.0
Reexport
SpecialFunctions
8 changes: 4 additions & 4 deletions benchmark/operators.jl
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,11 @@ srand(1776)

const TEST_NAMES = [
"Vector",
"DataVector No NA",
"DataVector Half NA",
"DataVector No null",
"DataVector Half null",
"Matrix",
"DataMatrix No NA",
"DataMatrix Half NA"
"DataMatrix No null",
"DataMatrix Half null"
]

function make_test_types(genfunc, sz)
Expand Down
12 changes: 6 additions & 6 deletions benchmark/reduce.jl
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,10 @@ srand(1776)

const TEST_NAMES = [
"Vector",
"DataVector No NA skipna=false",
"DataVector No NA skipna=true",
"DataVector Half NA skipna=false",
"DataVector Half NA skipna=true"
"DataVector No null skipnull=false",
"DataVector No null skipnull=true",
"DataVector Half null skipnull=false",
"DataVector Half null skipnull=true"
]

function make_test_types(genfunc, sz)
Expand All @@ -29,9 +29,9 @@ macro perf(fn, replications)
println($fn)
fns = [()->$fn(Data[1]),
()->$fn(Data[2]),
()->$fn(Data[2]; skipna=true),
()->$fn(Data[2]; skipnull=true),
()->$fn(Data[3]),
()->$fn(Data[3]; skipna=true)]
()->$fn(Data[3]; skipnull=true)]
gc_disable()
df = compare(fns, $replications)
gc_enable()
Expand Down
12 changes: 6 additions & 6 deletions benchmark/reducedim.jl
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,10 @@ srand(1776)

const TEST_NAMES = [
"Matrix",
"DataMatrix No NA skipna=false",
"DataMatrix No NA skipna=true",
"DataMatrix Half NA skipna=false",
"DataMatrix Half NA skipna=true"
"DataMatrix No null skipnull=false",
"DataMatrix No null skipnull=true",
"DataMatrix Half null skipnull=false",
"DataMatrix Half null skipnull=true"
]

function make_test_types(genfunc, sz)
Expand All @@ -29,9 +29,9 @@ macro perf(fn, dim, replications)
println($fn, " (region = ", $dim, ")")
fns = [()->$fn(Data[1], $dim),
()->$fn(Data[2], $dim),
()->$fn(Data[2], $dim; skipna=true),
()->$fn(Data[2], $dim; skipnull=true),
()->$fn(Data[3], $dim),
()->$fn(Data[3], $dim; skipna=true)]
()->$fn(Data[3], $dim; skipnull=true)]
gc_disable()
df = compare(fns, $replications)
gc_enable()
Expand Down
11 changes: 1 addition & 10 deletions docs/src/da.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,7 @@
# Representing missing data

```@meta
CurrentModule = DataArrays
```

```@docs
NA
NAtype
```

## Arrays with possibly missing data

```@docs
Expand All @@ -19,9 +12,7 @@ DataArray
DataVector
DataMatrix
@data
isna
dropna
padna
padnull
levels
```

Expand Down
7 changes: 3 additions & 4 deletions docs/src/index.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,10 @@
# DataArrays.jl

This package provides functionality for working with [missing data](https://en.wikipedia.org/wiki/Missing_data)
in Julia.
This package provides array types for working efficiently with [missing data](https://en.wikipedia.org/wiki/Missing_data)
in Julia, based on the `null` value from the [Nulls.jl](https://github.com/JuliaData/Nulls.jl) package.
In particular, it provides the following:

* `NA`: A singleton representing a missing value
* `DataArray{T}`: An array type that can house both values of type `T` and missing values
* `DataArray{T}`: An array type that can house both values of type `T` and missing values (of type `Null`)
* `PooledDataArray{T}`: An array type akin to `DataArray` but optimized for arrays with a smaller set of unique
values, as commonly occurs with categorical data

Expand Down
27 changes: 10 additions & 17 deletions spec/literals.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,51 +19,44 @@ Julia's parser rewrites both of these literals as calls to the `vcat`
function. The `vcat` function computes the tightest type that would
enclose all of the values in the literal array. (REVISE)

Because of the strange place occupied by `NAtype` in Julia's type
hierarchy, the tightest type that would enclose any literal array
containing a single `NA` would be `Any`, which is not very useful.
As such, the DataArrays package needs to provide an alternative
tool for writing out literal DataArray's.

This is accomplished by using two macros, `@data` and `@pdata`,
which rewrite array literals into a form that will allow proper
typing.
Two macros, `@data` and `@pdata`, rewrite array literals into a form
that will allow direct construction of `DataArray`s and `PooledDataArray`s.

# Basic Principle

The basic mechanism that powers the `@data` and `@pdata` macros is the
rewriting of array literals as a call to DataArray or PooledDataArray
with a rewritten array literal and a Boolean mask that specifies where
`NA` occurred in the original literal.
`null` occurred in the original literal.

For example,

@data [1, 2, NA, 4]
@data [1, 2, null, 4]

will be rewritten as,

DataArray([1, 2, 1, 4], [false, false, true, false])

Note the added `1` created during the rewriting of the array literal.
This value is called a `stub` and is always the first value found
in the literal array that is not `NA`. The use of stubs explains two
in the literal array that is not `null`. The use of stubs explains two
important properties of the `@data` and `@pdata` macros:

* If the entries of the array literal are not fixed values, but function calls, these function calls must be pure. Otherwise the impure funcion may be called more times than expected.
* It is not possible to specify a literal DataArray that contains only `NA` values.
* None of the variables used in a literal array can be called `NA`. This is just good style anyway, so it is not much of a limitation.
* It is not possible to specify a literal DataArray that contains only `null` values.
* None of the variables used in a literal array can be called `null`. This is just good style anyway, so it is not much of a limitation.

# Limitations

We restate the limitations noted above:

* If the entries of the array literal are not fixed values, but function calls, these function calls must be pure. Otherwise the impure funcion may be called more times than expected.
* It is not possible to specify a literal DataArray that contains only `NA` values.
* None of the variables used in a literal array can be called `NA`. This is just good style anyway, so it is not much of a limitation.
* It is not possible to specify a literal DataArray that contains only `null` values.
* None of the variables used in a literal array can be called `null`. This is just good style anyway, so it is not much of a limitation.


Note that the latter limitation is not very important, because a DataArray
with only `NA` values is already problematic because it has no well-defined
with only `null` values is already problematic because it has no well-defined
type in Julia.

One final limitation is that the rewriting rules are not able to
Expand Down
21 changes: 3 additions & 18 deletions src/DataArrays.jl
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ module DataArrays
using Base: promote_op
using Base.Cartesian, Reexport
@reexport using StatsBase
@reexport using Nulls
using SpecialFunctions

const DEFAULT_POOLED_REF_TYPE = UInt32
Expand All @@ -25,23 +26,10 @@ module DataArrays
DataArray,
DataMatrix,
DataVector,
dropna,
each_failna,
each_dropna,
each_replacena,
EachFailNA,
EachDropNA,
EachReplaceNA,
FastPerm,
getpoolidx,
gl,
head,
isna,
levels,
NA,
NAException,
NAtype,
padna,
padnull,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left this method here, but we could define it in Nulls if that's really useful. No hurry, though.

pdata,
PooledDataArray,
PooledDataMatrix,
Expand All @@ -51,11 +39,9 @@ module DataArrays
rep,
replace!,
setlevels!,
setlevels,
tail
setlevels

include("utils.jl")
include("natype.jl")
include("abstractdataarray.jl")
include("dataarray.jl")
include("pooleddataarray.jl")
Expand All @@ -71,7 +57,6 @@ module DataArrays
include("extras.jl")
include("grouping.jl")
include("statistics.jl")
include("predicates.jl")
include("literals.jl")
include("deprecated.jl")
end
Loading