I have a spark Time Series data frame. I would like to split it into 80-20 (train-test). As this is a time series data frame, I don't want to do a random split. How do I do this in order to pass the first data frame into train and the second to test?
相关问题
- how to define constructor for Python's new Nam
- streaming md5sum of contents of a large remote tar
- How to maintain order of key-value in DataFrame sa
- How to get the background from multiple images by
- Evil ctypes hack in python
You can use
pyspark.sql.functions.percent_rank()
to get the percentile ranking of your DataFrame ordered by the timestamp/date column. Then pick all the columns with arank <= 0.8
as your training set and the rest as your test set.For example, if you had the following DataFrame:
You'd want the first 4 rows in your training set and the last one in your training set. First add a column
rank
:Now use
rank
to split your data intotrain
andtest
: