Wanted to take something like this https://github.com/fitzscott/AirQuality/blob/master/HiveDataTypeGuesser.java and create a Hive UDAF to create an aggregate function that returns a data type guess.
Does Spark have something like this already built-in? Would be very useful for new wide datasets to explore data. Would be helpful for ML too, e.g. to decide categorical vs numerical variables.
How do you normally determine data types in Spark?
P.S. Frameworks like h2o automatically determine data type scanning a sample of data, or whole dataset. So then one can decide e.g. if a variable should be a categorical variable or numerical.
P.P.S. Another use case is if you get an arbitrary data set (we get them quite often), and want to save as a Parquet table. Providing correct data types make parquet more space effiecient (and probably more query-time performant, e.g. better parquet bloom filters than just storing everything as string/varchar).
Partially. There are some tools in Spark ecosystem which perform schema inference like
spark-csv
orpyspark-csv
and category inference (categorical vs. numerical) likeVectorIndexer
.So far so good. Problem is that schema inference has limited applicability, is not an easy task in general, can introduce hard to diagnose problems and can be quite expensive:
Depending on a data representation it can be impossible to determine correct data type or inferred type can lead to information loss:
Automatic schema inference can mask different problems with input data and if it is not supported by additional tools which can highlight possible issues it can be dangerous. Moreover any mistakes during data loading and cleaning can be propagated through complete data processing pipeline.
Arguably we should develop good understanding of input data before we even start to think about possible representation and encoding.
Schema inference and / or category inference may require full data scan and / or large lookup tables. Both can be expensive or even not feasible on large datasets.
Edit:
It looks like schema inference capabilities on CSV files have been added directly to Spark SQL. See
CSVInferSchema
.