How to convert a mixed-type Matrix to DataFrame in

2019-03-05 22:51发布

One nice feature of DataFrames is that it can store columns with different types and it can "auto-recognise" them, e.g.:

using DataFrames, DataStructures

df1 = wsv"""
parName region  forType             value
vol     AL      broadL_highF        3.3055628012
vol     AL      con_highF           2.1360975151
vol     AQ      broadL_highF        5.81984502
vol     AQ      con_highF           8.1462998309
"""
typeof(df1[:parName])
DataArrays.DataArray{String,1}
typeof(df1[:value])
DataArrays.DataArray{Float64,1}

When I do try however to reach the same result starting from a Matrix (imported from spreadsheet) I "loose" that auto-conversion:

dataMatrix = [
    "parName"   "region"    "forType"       "value";
    "vol"       "AL"        "broadL_highF"  3.3055628012;
    "vol"       "AL"        "con_highF"     2.1360975151;
    "vol"       "AQ"        "broadL_highF"  5.81984502;
    "vol"       "AQ"        "con_highF"     8.1462998309;
]
h    = [Symbol(c) for c in dataMatrix[1,:]]
vals = dataMatrix[2:end, :]
df2  = convert(DataFrame,OrderedDict(zip(h,[vals[:,i] for i in 1:size(vals,2)])))

typeof(df2[:parName])  
DataArrays.DataArray{Any,1}
typeof(df2[:value])  
DataArrays.DataArray{Any,1}

There are several questions on S.O. on how to convert a Matrix to Dataframe (e.g. DataFrame from Array with Header, Convert Julia array to dataframe), but none of the answer there deal with the conversion of a mixed-type matrix.

How could I create a DataFrame from a matrix auto-recognising the type of the columns ?

EDIT: I did benchmark the three solutions: (1) convert the df (using the dictionary or matrix constructor.. first one is faster) and then apply try-catch for type conversion (my original answer); (2) convert to string and then use df.inlinetable (Dan Getz answer); (3) check the type of each element and their column-wise consistency (Alexander Morley answer).

These are the results:

# second time for compilation.. further times ~ results
@time toDf1(m) # 0.000946 seconds (336 allocations: 19.811 KiB)
@time toDf2(m) # 0.000194 seconds (306 allocations: 17.406 KiB)
@time toDf3(m) # 0.001820 seconds (445 allocations: 35.297 KiB)

So, crazy it is, the most efficient solution seems to "pour out the water" and reduce the problem to an already solved one ;-)

Thank you for all the answers.

4条回答
等我变得足够好
2楼-- · 2019-03-05 23:07

Another method would be reuse the working solution i.e. convert the matrix into a string appropriate for DataFrames to consume. In code, this is:

using DataFrames

dataMatrix = [
    "parName"   "region"    "forType"       "value";
    "vol"       "AL"        "broadL_highF"  3.3055628012;
    "vol"       "AL"        "con_highF"     2.1360975151;
    "vol"       "AQ"        "broadL_highF"  5.81984502;
    "vol"       "AQ"        "con_highF"     8.1462998309;
]

s = join(
  [join([dataMatrix[i,j] for j in indices(dataMatrix, 2)]
  , '\t') for i in indices(dataMatrix, 1)], '\n')

df = DataFrames.inlinetable(s; separator='\t', header=true)

The resulting df has its column types guessed by DataFrame.

Unrelated, but this answer reminds me of the how a mathematician boils water joke.

查看更多
欢心
3楼-- · 2019-03-05 23:09
mat2df(mat) = 
    DataFrame([[mat[2:end,i]...] for i in 1:size(mat,2)], Symbol.(mat[1,:]))

Seems to work and is faster than @dan-getz's answer (at least for this data matrix) :)

using DataFrames, BenchmarkTools

dataMatrix = [
    "parName"   "region"    "forType"       "value";
    "vol"       "AL"        "broadL_highF"  3.3055628012;
    "vol"       "AL"        "con_highF"     2.1360975151;
    "vol"       "AQ"        "broadL_highF"  5.81984502;
    "vol"       "AQ"        "con_highF"     8.1462998309;
]

mat2df(mat) = 
    DataFrame([[mat[2:end,i]...] for i in 1:size(mat,2)], Symbol.(mat[1,:]))

function mat2dfDan(mat)
    s = join([join([dataMatrix[i,j] for j in indices(dataMatrix, 2)], '\t') 
                for i in indices(dataMatrix, 1)],'\n')

    DataFrames.inlinetable(s; separator='\t', header=true)
end

-

julia> @benchmark mat2df(dataMatrix)

BenchmarkTools.Trial: 
  memory estimate:  5.05 KiB
  allocs estimate:  75
  --------------
  minimum time:     18.601 μs (0.00% GC)
  median time:      21.318 μs (0.00% GC)
  mean time:        31.773 μs (2.50% GC)
  maximum time:     4.287 ms (95.32% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark mat2dfDan(dataMatrix)

BenchmarkTools.Trial: 
  memory estimate:  17.55 KiB
  allocs estimate:  318
  --------------
  minimum time:     69.183 μs (0.00% GC)
  median time:      81.326 μs (0.00% GC)
  mean time:        90.284 μs (2.97% GC)
  maximum time:     5.565 ms (93.72% GC)
  --------------
  samples:          10000
  evals/sample:     1
查看更多
爱情/是我丢掉的垃圾
4楼-- · 2019-03-05 23:13

While I didn't find a complete solution, a partial one is to try to convert the individual columns ex-post:

"""
    convertDf!(df)

Try to convert each column of the converted df from Any to In64, Float64 or String (in that order).    
"""
function convertDf!(df)
    for c in names(df)
        try
          df[c] = convert(DataArrays.DataArray{Int64,1},df[c])
        catch
            try
              df[c] = convert(DataArrays.DataArray{Float64,1},df[c])
            catch
                try
                  df[c] = convert(DataArrays.DataArray{String,1},df[c])
                catch
                end
            end
        end
    end
end 

While surely incomplete, it is enough for my needs.

查看更多
Animai°情兽
5楼-- · 2019-03-05 23:16

While I think there may be a better way to go about the whole thing this should do what you want.

df = DataFrame()
for (ind,s) in enumerate(Symbol.(dataMatrix[1,:])) # convert first row to symbols and iterate through them.
    # check all types the same else assign to Any
    T = typeof(dataMatrix[2,ind])
    T = all(typeof.(dataMatrix[2:end,ind]).==T) ? T : Any
    # convert to type of second element then add to data frame
    df[s] = T.(dataMatrix[2:end,ind])
end
查看更多
登录 后发表回答