Skip to content
This repository was archived by the owner on May 4, 2019. It is now read-only.

Commit 8a7003b

Browse files
authored
Port to Nulls.jl (#288)
Replace NA with Nulls.null and NAtype with Nulls.Null. Use Nulls.levels instead of defining our own function. Rename all functions and arguments to use "null" instead of "na", with deprecations. Move rounding and transpose operations have been moved to Nulls, but drop functions from SpecialFunctions as we don't want Nulls to depend on SpecialFunctions and keeping them in DataArrays would be type piracy. Deprecate dropnull(x) in favor of efficient specialization of collect(Nulls.drop(x)). Unexport all iterators, which are an implementation detail and should be used via similar Nulls functions. Stop exporting nonexistent head() and tail() functions. Remove method redundant with ==(::AbstractArray{>:Null, ::AbstractArray{>:Null}).
1 parent 8b9e896 commit 8a7003b

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

53 files changed

+1269
-1426
lines changed

README.md

Lines changed: 7 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -11,30 +11,22 @@ Documentation:
1111
[![](https://img.shields.io/badge/docs-stable-blue.svg)](https://JuliaStats.github.io/DataArrays.jl/stable)
1212
[![](https://img.shields.io/badge/docs-latest-blue.svg)](https://JuliaStats.github.io/DataArrays.jl/latest)
1313

14+
The DataArrays package provides array types for working efficiently with [missing data](https://en.wikipedia.org/wiki/Missing_data)
15+
in Julia, based on the `null` value from the [Nulls.jl](https://github.com/JuliaData/Nulls.jl) package.
16+
In particular, it provides the following:
1417

15-
The DataArrays package extends Julia by introducing data structures that can contain missing data. In particular, the package introduces three new data types to Julia:
16-
17-
* `NA`: A singleton type that represents a single missing value.
1818
* `DataArray{T}`: An array-like data structure that can contain values of type `T`, but can also contain missing values.
1919
* `PooledDataArray{T}`: A variant of `DataArray{T}` optimized for representing arrays that contain many repetitions of a small number of unique values -- as commonly occurs when working with categorical data.
2020

21-
# The `NA` Value
22-
23-
Many languages represent missing values using a reserved value like `NULL` or `NA`. A missing integer value, for example, might be represented as a `NULL` value in SQL or as an `NA` value in R.
24-
25-
Julia takes its conception of `NA` from R, where `NA` denotes missingness based on lack of information. If, for example, we were to measure people's heights as integers, an `NA` might reflect our ignorance of a specific person's height.
26-
27-
Conceptualizing the use of `NA` as a signal of uncertainty will help you understand how `NA` interacts with other values. For example, it explains why `NA + 1` is `NA`, but `NA & false` is `false`. In general, `NA` corrupts any computation whose results cannot be determined without knowledge of the value that is `NA`.
28-
2921
# DataArray's
3022

31-
Most Julian arrays cannot contain `NA` values: only `Array{NAtype}` and heterogeneous Arrays can contain `NA` values. Of these, only heterogeneous arrays could contain values of any type other than `NAtype`.
23+
Most Julian arrays cannot contain `null` values: only `Array{Union{T, Null}}` and more generally `Array{>:Null}` can contain `null` values.
3224

33-
The generic use of heterogeneous Arrays is discouraged in Julia because it is inefficient: accessing any value requires dereferencing a pointer. The `DataArray` type allows one to work around this inefficiency by providing tightly-typed arrays that can contain values of exactly one type, but can also contain `NA` values.
25+
The generic use of heterogeneous `Array` is discouraged in Julia versions below 0.7 because it is inefficient: accessing any value requires dereferencing a pointer. The `DataArray` type allows one to work around this inefficiency by providing tightly-typed arrays that can contain values of exactly one type, but can also contain `null` values.
3426

35-
For example, a `DataArray{Int}` can contain integers and NA values. We can construct one as follows:
27+
For example, a `DataArray{Int}` can contain integers and `null` values. We can construct one as follows:
3628

37-
da = @data([1, 2, NA, 4])
29+
da = @data([1, 2, null, 4])
3830

3931
# PooledDataArray's
4032

REQUIRE

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
julia 0.6
2+
Nulls 0.1.2
23
StatsBase 0.15.0
34
Reexport
45
SpecialFunctions

benchmark/operators.jl

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,11 +6,11 @@ srand(1776)
66

77
const TEST_NAMES = [
88
"Vector",
9-
"DataVector No NA",
10-
"DataVector Half NA",
9+
"DataVector No null",
10+
"DataVector Half null",
1111
"Matrix",
12-
"DataMatrix No NA",
13-
"DataMatrix Half NA"
12+
"DataMatrix No null",
13+
"DataMatrix Half null"
1414
]
1515

1616
function make_test_types(genfunc, sz)

benchmark/reduce.jl

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -6,10 +6,10 @@ srand(1776)
66

77
const TEST_NAMES = [
88
"Vector",
9-
"DataVector No NA skipna=false",
10-
"DataVector No NA skipna=true",
11-
"DataVector Half NA skipna=false",
12-
"DataVector Half NA skipna=true"
9+
"DataVector No null skipnull=false",
10+
"DataVector No null skipnull=true",
11+
"DataVector Half null skipnull=false",
12+
"DataVector Half null skipnull=true"
1313
]
1414

1515
function make_test_types(genfunc, sz)
@@ -29,9 +29,9 @@ macro perf(fn, replications)
2929
println($fn)
3030
fns = [()->$fn(Data[1]),
3131
()->$fn(Data[2]),
32-
()->$fn(Data[2]; skipna=true),
32+
()->$fn(Data[2]; skipnull=true),
3333
()->$fn(Data[3]),
34-
()->$fn(Data[3]; skipna=true)]
34+
()->$fn(Data[3]; skipnull=true)]
3535
gc_disable()
3636
df = compare(fns, $replications)
3737
gc_enable()

benchmark/reducedim.jl

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -6,10 +6,10 @@ srand(1776)
66

77
const TEST_NAMES = [
88
"Matrix",
9-
"DataMatrix No NA skipna=false",
10-
"DataMatrix No NA skipna=true",
11-
"DataMatrix Half NA skipna=false",
12-
"DataMatrix Half NA skipna=true"
9+
"DataMatrix No null skipnull=false",
10+
"DataMatrix No null skipnull=true",
11+
"DataMatrix Half null skipnull=false",
12+
"DataMatrix Half null skipnull=true"
1313
]
1414

1515
function make_test_types(genfunc, sz)
@@ -29,9 +29,9 @@ macro perf(fn, dim, replications)
2929
println($fn, " (region = ", $dim, ")")
3030
fns = [()->$fn(Data[1], $dim),
3131
()->$fn(Data[2], $dim),
32-
()->$fn(Data[2], $dim; skipna=true),
32+
()->$fn(Data[2], $dim; skipnull=true),
3333
()->$fn(Data[3], $dim),
34-
()->$fn(Data[3], $dim; skipna=true)]
34+
()->$fn(Data[3], $dim; skipnull=true)]
3535
gc_disable()
3636
df = compare(fns, $replications)
3737
gc_enable()

docs/src/da.md

Lines changed: 1 addition & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,7 @@
1-
# Representing missing data
2-
31
```@meta
42
CurrentModule = DataArrays
53
```
64

7-
```@docs
8-
NA
9-
NAtype
10-
```
11-
125
## Arrays with possibly missing data
136

147
```@docs
@@ -19,9 +12,7 @@ DataArray
1912
DataVector
2013
DataMatrix
2114
@data
22-
isna
23-
dropna
24-
padna
15+
padnull
2516
levels
2617
```
2718

docs/src/index.md

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,10 @@
11
# DataArrays.jl
22

3-
This package provides functionality for working with [missing data](https://en.wikipedia.org/wiki/Missing_data)
4-
in Julia.
3+
This package provides array types for working efficiently with [missing data](https://en.wikipedia.org/wiki/Missing_data)
4+
in Julia, based on the `null` value from the [Nulls.jl](https://github.com/JuliaData/Nulls.jl) package.
55
In particular, it provides the following:
66

7-
* `NA`: A singleton representing a missing value
8-
* `DataArray{T}`: An array type that can house both values of type `T` and missing values
7+
* `DataArray{T}`: An array type that can house both values of type `T` and missing values (of type `Null`)
98
* `PooledDataArray{T}`: An array type akin to `DataArray` but optimized for arrays with a smaller set of unique
109
values, as commonly occurs with categorical data
1110

spec/literals.md

Lines changed: 10 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -19,51 +19,44 @@ Julia's parser rewrites both of these literals as calls to the `vcat`
1919
function. The `vcat` function computes the tightest type that would
2020
enclose all of the values in the literal array. (REVISE)
2121

22-
Because of the strange place occupied by `NAtype` in Julia's type
23-
hierarchy, the tightest type that would enclose any literal array
24-
containing a single `NA` would be `Any`, which is not very useful.
25-
As such, the DataArrays package needs to provide an alternative
26-
tool for writing out literal DataArray's.
27-
28-
This is accomplished by using two macros, `@data` and `@pdata`,
29-
which rewrite array literals into a form that will allow proper
30-
typing.
22+
Two macros, `@data` and `@pdata`, rewrite array literals into a form
23+
that will allow direct construction of `DataArray`s and `PooledDataArray`s.
3124

3225
# Basic Principle
3326

3427
The basic mechanism that powers the `@data` and `@pdata` macros is the
3528
rewriting of array literals as a call to DataArray or PooledDataArray
3629
with a rewritten array literal and a Boolean mask that specifies where
37-
`NA` occurred in the original literal.
30+
`null` occurred in the original literal.
3831

3932
For example,
4033

41-
@data [1, 2, NA, 4]
34+
@data [1, 2, null, 4]
4235

4336
will be rewritten as,
4437

4538
DataArray([1, 2, 1, 4], [false, false, true, false])
4639

4740
Note the added `1` created during the rewriting of the array literal.
4841
This value is called a `stub` and is always the first value found
49-
in the literal array that is not `NA`. The use of stubs explains two
42+
in the literal array that is not `null`. The use of stubs explains two
5043
important properties of the `@data` and `@pdata` macros:
5144

5245
* If the entries of the array literal are not fixed values, but function calls, these function calls must be pure. Otherwise the impure funcion may be called more times than expected.
53-
* It is not possible to specify a literal DataArray that contains only `NA` values.
54-
* None of the variables used in a literal array can be called `NA`. This is just good style anyway, so it is not much of a limitation.
46+
* It is not possible to specify a literal DataArray that contains only `null` values.
47+
* None of the variables used in a literal array can be called `null`. This is just good style anyway, so it is not much of a limitation.
5548

5649
# Limitations
5750

5851
We restate the limitations noted above:
5952

6053
* If the entries of the array literal are not fixed values, but function calls, these function calls must be pure. Otherwise the impure funcion may be called more times than expected.
61-
* It is not possible to specify a literal DataArray that contains only `NA` values.
62-
* None of the variables used in a literal array can be called `NA`. This is just good style anyway, so it is not much of a limitation.
54+
* It is not possible to specify a literal DataArray that contains only `null` values.
55+
* None of the variables used in a literal array can be called `null`. This is just good style anyway, so it is not much of a limitation.
6356

6457

6558
Note that the latter limitation is not very important, because a DataArray
66-
with only `NA` values is already problematic because it has no well-defined
59+
with only `null` values is already problematic because it has no well-defined
6760
type in Julia.
6861

6962
One final limitation is that the rewriting rules are not able to

src/DataArrays.jl

Lines changed: 3 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ module DataArrays
44
using Base: promote_op
55
using Base.Cartesian, Reexport
66
@reexport using StatsBase
7+
@reexport using Nulls
78
using SpecialFunctions
89

910
const DEFAULT_POOLED_REF_TYPE = UInt32
@@ -25,23 +26,10 @@ module DataArrays
2526
DataArray,
2627
DataMatrix,
2728
DataVector,
28-
dropna,
29-
each_failna,
30-
each_dropna,
31-
each_replacena,
32-
EachFailNA,
33-
EachDropNA,
34-
EachReplaceNA,
3529
FastPerm,
3630
getpoolidx,
3731
gl,
38-
head,
39-
isna,
40-
levels,
41-
NA,
42-
NAException,
43-
NAtype,
44-
padna,
32+
padnull,
4533
pdata,
4634
PooledDataArray,
4735
PooledDataMatrix,
@@ -51,11 +39,9 @@ module DataArrays
5139
rep,
5240
replace!,
5341
setlevels!,
54-
setlevels,
55-
tail
42+
setlevels
5643

5744
include("utils.jl")
58-
include("natype.jl")
5945
include("abstractdataarray.jl")
6046
include("dataarray.jl")
6147
include("pooleddataarray.jl")
@@ -71,7 +57,6 @@ module DataArrays
7157
include("extras.jl")
7258
include("grouping.jl")
7359
include("statistics.jl")
74-
include("predicates.jl")
7560
include("literals.jl")
7661
include("deprecated.jl")
7762
end

0 commit comments

Comments
 (0)