Julia DataFrames.jl - filter data with NA's (N

I am not sure how to handle NA within Julia DataFrames.

For example with the following DataFrame:

> import DataFrames
> a = DataFrames.@data([1, 2, 3, 4, 5]);
> b = DataFrames.@data([3, 4, 5, 6, NA]);
> ndf = DataFrames.DataFrame(a=a, b=b)

I can successfully execute the following operation on column :a

> ndf[ndf[:a] .== 4, :]

but if I try the same operation on :b I get an error NAException("cannot index an array with a DataArray containing NA values").

> ndf[ndf[:b] .== 4, :]

NAException("cannot index an array with a DataArray containing NA values")
while loading In[108], in expression starting on line 1

in to_index at /Users/abisen/.julia/v0.3/DataArrays/src/indexing.jl:85
in getindex at /Users/abisen/.julia/v0.3/DataArrays/src/indexing.jl:210
in getindex at /Users/abisen/.julia/v0.3/DataFrames/src/dataframe/dataframe.jl:268

Which is because of the presence of NA value.

My question is how should DataFrames with NA should typically be handled? I can understand that > or < operation against NA would be undefined but == should work (no?).

标签： julia

3条回答

孤傲高冷的网名

2楼-- · 2019-04-08 02:11

Regarding to this question I asked before, you can change this NA behavior directly in the modules sourcecode if you want. In the file indexing.jl there is a function named Base.to_index(A::DataArray) beginning at line 75, where you can alter the code to set NA's in the boolean array to false. For example you can do the following:

# Indexing with NA throws an error
function Base.to_index(A::DataArray)
    A[A.na] = false
    any(A.na) && throw(NAException("cannot index an array with a DataArray containing NA values"))
    Base.to_index(A.data)
end

Ignoring NA's with isna() will cause a less readable sourcecode and in big formulas, a performance loss:

@timeit ndf[(!isna(ndf[:b])) & (ndf[:b] .== 4),:]  #3.68 µs per loop
@timeit ndf[ndf[:b] .== 4, :]  #2.32 µs per loop

## 71x179 2D Array
@timeit dm[(!isna(dm)) & (dm .< 3)] = 1  #14.55 µs per loop  
@timeit dm[dm .< 3] = 1  #754.79 ns per loop

0人赞添加讨论(0) 举报

你好瞎i

3楼-- · 2019-04-08 02:12

In many cases you want to treat NA as separate instances, i.e. assume that that everything that is NA is "equal" and everything else is different.

If this is the behaviour you want, current DataFrames API doesn't help you much, as both (NA == NA) and (NA == 1) returns NA instead of their expected boolean results.

This makes extremely tedious DataFrame filters using loops: function filter(df,c) for r in eachrow(df) if (isna(c) && isna(r:[c])) || ( !isna(r[:c]) && r[:c] == c ) ... and breaks select-like functionalities in DataFramesMeta.jl and Query.jl when NA values are present or requested for..

One workaround is to use isequal(a,b) in place of a==b

test = @where(df, isequal.(:a,"cc"), isequal.(:b,NA) ) #from DataFramesMeta.jl

0人赞添加讨论(0) 举报

老娘就宠你

4楼-- · 2019-04-08 02:28

What's your desired behavior here? If you want to do selections like this you can make the condition (not a NAN) AND (equal to 4). If the first test fails then the second one never happens.

using DataFrames
a = @data([1, 2, 3, 4, 5]);
b = @data([3, 4, 5, 6, NA]);
ndf = DataFrame(a=a, b=b)
ndf[(!isna(ndf[:b]))&(ndf[:b].==4),:]

In some cases you might just want to drop all rows with NAs in certain columns

ndf = ndf[!isna(ndf[:b]),:]

0人赞添加讨论(0) 举报

Julia DataFrames.jl - filter data with NA's (N

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间