I am trying to learn more about the FSharp.Data project by using it for reading a CSV file. The CSV file is a simplified version of the data from the digit recognizer competition on Kaggle.
When I read the CSV file which contains 785 columns and 113 rows (including header row) then the following two lines of code executes really slow:
type trainingSet = CsvProvider<"Data/trainSmall.csv", ",", CacheRows=false>
let data = trainingSet.Load("Data/trainSmall.csv")
When I sent the first line to the F# interactive it returns in about 10 seconds whereas when I sent the second line of code to the F# interactive it takes more than 5 minutes before the interactive prompt replies.
I am running the code on my MacBook Pro from 2013 with a 2.6 GHz I5 processor and 16GB ram using F# 3.0 and Xamarin Studio. I have tried the same experiment with Windows7 / VS2013 running under a VM on the same hardware. The results are comparable. When I use the same machine and try to do the exact same thing with R it is so fast that I cannot time it with an ordinary watch.
Please advice me on the proper usage of the CSV typeprovider from Fsharp.Data!
Humm, the second line is supposed to be doing mostly nothing, as the rows are read by demand. Something is wrong there, can you please submit an issue on github with a repro file?
I recommend that you don't use CsvProvider for this. You're loading a matrix so you won't get any benefit of having the type of each column inferred, as they are all the same. You can still use the CSV parser of F# Data by using CsvFile. CsvProvider is optimized for files with not many columns but potentially many rows. The way the code is generated will try to generate a tuple with 785 elements on your example, which just won't work