How can I select an exact number of random rows from a DataFrame efficiently? The data contains an index column that can be used. If I have to use maximum size, what is more efficient, count() or max() on the index column?
相关问题
- How to maintain order of key-value in DataFrame sa
- Spark on Yarn Container Failure
- Unity - Get Random Color at Spawning
- In Spark Streaming how to process old data and del
- Filter from Cassandra table by RDD values
相关文章
- Livy Server: return a dataframe as JSON?
- why 48 bit seed in util Random class?
- SQL query Frequency Distribution matrix for produc
- Need help generating discrete random numbers from
- How to filter rows for a specific aggregate with s
- How to name file when saveAsTextFile in spark?
- Spark save(write) parquet only one file
- Get random records with Doctrine
A possible approach is to calculate the number of rows using
.count()
, then usesample()
frompython
's random library to generate a random sequence of arbitrary length from this range. Lastly use the resulting list of numbersvals
to subset your index column.Example: