I am not sure how to handle NA
within Julia DataFrames.
For example with the following DataFrame:
> import DataFrames
> a = DataFrames.@data([1, 2, 3, 4, 5]);
> b = DataFrames.@data([3, 4, 5, 6, NA]);
> ndf = DataFrames.DataFrame(a=a, b=b)
I can successfully execute the following operation on column :a
> ndf[ndf[:a] .== 4, :]
but if I try the same operation on :b
I get an error NAException("cannot index an array with a DataArray containing NA values")
.
> ndf[ndf[:b] .== 4, :]
NAException("cannot index an array with a DataArray containing NA values")
while loading In[108], in expression starting on line 1
in to_index at /Users/abisen/.julia/v0.3/DataArrays/src/indexing.jl:85
in getindex at /Users/abisen/.julia/v0.3/DataArrays/src/indexing.jl:210
in getindex at /Users/abisen/.julia/v0.3/DataFrames/src/dataframe/dataframe.jl:268
Which is because of the presence of NA value.
My question is how should DataFrames with NA
should typically be handled? I can understand that >
or <
operation against NA
would be undefined
but ==
should work (no?).
Regarding to this question I asked before, you can change this NA behavior directly in the modules sourcecode if you want. In the file
indexing.jl
there is a function namedBase.to_index(A::DataArray)
beginning at line 75, where you can alter the code to set NA's in the boolean array tofalse
. For example you can do the following:Ignoring NA's with
isna()
will cause a less readable sourcecode and in big formulas, a performance loss:In many cases you want to treat NA as separate instances, i.e. assume that that everything that is NA is "equal" and everything else is different.
If this is the behaviour you want, current DataFrames API doesn't help you much, as both
(NA == NA)
and(NA == 1)
returnsNA
instead of their expected boolean results.This makes extremely tedious DataFrame filters using loops:
function filter(df,c) for r in eachrow(df) if (isna(c) && isna(r:[c])) || ( !isna(r[:c]) && r[:c] == c ) ...
and breaks select-like functionalities inDataFramesMeta.jl
andQuery.jl
whenNA
values are present or requested for..One workaround is to use
isequal(a,b)
in place ofa==b
What's your desired behavior here? If you want to do selections like this you can make the condition (not a NAN) AND (equal to 4). If the first test fails then the second one never happens.
In some cases you might just want to drop all rows with NAs in certain columns