I have a large unsorted CSV file (>4M records). Each record has a category, which is described in the first three columns. The rest of the record is address data which may or may not be unique.
A, 1, c, address1 # the category for this record is A1t
A, 1, c, address2
C, 3, e, address3 # the category for this record is C3e
B, 2, a, address4
I would like to pull a random sample of unique records within each category (so 5 unique records in category A1t
, 5 unique records from C3e
, etc.). I put together a partial solution using sort
. However, it only pulls one non-random record in each category:
sort -u -t, -k1,3
Is there a way to pull several random sample records within each category?
I think there must be a way to do this by using a combination of pipes, uniq
, awk
or shuf
, but haven't been able to figure it out. I would prefer a command-line solution since I'm interested in knowing if this is possible using only bash.
Inspired by the use of
sort -R
in the answer by jm666. This is a GNU extension tosort
, so it may not work on non-Gnu systems.Here, we use sort to sort the entire file once, with the non-category fields sorted in a random order. Since the category fields are the primary key, the result is in category order with random order of the following fields.
From there, we need to find the first five entries in each category. There are probably hackier ways to do this, but I went with a simple
awk
program.If your
sort
doesn't randomise, then the random sample can be extracted withawk
:It would also be possible to keep all the entries in awk to avoid the sort, but that's likely to be a lot slower and it will use an exorbitant amount of memory.
If i understand right - simple, not very effective bash solution
decomposition
cut -d, -f1-3 < "$csvfile"
- filter out all "categories" (first 3 fields)sort -u
- get sorted unique categorieswhile read...
)grep "^$cat" "$csvfile"
find all lines from this categorysort -uR
- sort them randomly by hash (note, the duplicates has the same hash, take unique)head -5
print the first 5 records (from the randomly sorted list)