One nice feature of DataFrames is that it can store columns with different types and it can "auto-recognise" them, e.g.:
using DataFrames, DataStructures
df1 = wsv"""
parName region forType value
vol AL broadL_highF 3.3055628012
vol AL con_highF 2.1360975151
vol AQ broadL_highF 5.81984502
vol AQ con_highF 8.1462998309
"""
typeof(df1[:parName])
DataArrays.DataArray{String,1}
typeof(df1[:value])
DataArrays.DataArray{Float64,1}
When I do try however to reach the same result starting from a Matrix (imported from spreadsheet) I "loose" that auto-conversion:
dataMatrix = [
"parName" "region" "forType" "value";
"vol" "AL" "broadL_highF" 3.3055628012;
"vol" "AL" "con_highF" 2.1360975151;
"vol" "AQ" "broadL_highF" 5.81984502;
"vol" "AQ" "con_highF" 8.1462998309;
]
h = [Symbol(c) for c in dataMatrix[1,:]]
vals = dataMatrix[2:end, :]
df2 = convert(DataFrame,OrderedDict(zip(h,[vals[:,i] for i in 1:size(vals,2)])))
typeof(df2[:parName])
DataArrays.DataArray{Any,1}
typeof(df2[:value])
DataArrays.DataArray{Any,1}
There are several questions on S.O. on how to convert a Matrix to Dataframe (e.g. DataFrame from Array with Header, Convert Julia array to dataframe), but none of the answer there deal with the conversion of a mixed-type matrix.
How could I create a DataFrame from a matrix auto-recognising the type of the columns ?
EDIT: I did benchmark the three solutions: (1) convert the df (using the dictionary or matrix constructor.. first one is faster) and then apply try-catch for type conversion (my original answer); (2) convert to string and then use df.inlinetable (Dan Getz answer); (3) check the type of each element and their column-wise consistency (Alexander Morley answer).
These are the results:
# second time for compilation.. further times ~ results
@time toDf1(m) # 0.000946 seconds (336 allocations: 19.811 KiB)
@time toDf2(m) # 0.000194 seconds (306 allocations: 17.406 KiB)
@time toDf3(m) # 0.001820 seconds (445 allocations: 35.297 KiB)
So, crazy it is, the most efficient solution seems to "pour out the water" and reduce the problem to an already solved one ;-)
Thank you for all the answers.
Another method would be reuse the working solution i.e. convert the matrix into a string appropriate for DataFrames to consume. In code, this is:
The resulting
df
has its column types guessed by DataFrame.Unrelated, but this answer reminds me of the how a mathematician boils water joke.
Seems to work and is faster than @dan-getz's answer (at least for this data matrix) :)
-
While I didn't find a complete solution, a partial one is to try to convert the individual columns ex-post:
While surely incomplete, it is enough for my needs.
While I think there may be a better way to go about the whole thing this should do what you want.