Pull random unique samples within sorted categorie

2019-07-28 05:28发布

I have a large unsorted CSV file (>4M records). Each record has a category, which is described in the first three columns. The rest of the record is address data which may or may not be unique.

A, 1, c, address1  # the category for this record is A1t
A, 1, c, address2
C, 3, e, address3  # the category for this record is C3e
B, 2, a, address4

I would like to pull a random sample of unique records within each category (so 5 unique records in category A1t, 5 unique records from C3e, etc.). I put together a partial solution using sort. However, it only pulls one non-random record in each category:

sort -u -t, -k1,3

Is there a way to pull several random sample records within each category?

I think there must be a way to do this by using a combination of pipes, uniq, awk or shuf, but haven't been able to figure it out. I would prefer a command-line solution since I'm interested in knowing if this is possible using only bash.

2条回答
一纸荒年 Trace。
2楼-- · 2019-07-28 06:20

Inspired by the use of sort -R in the answer by jm666. This is a GNU extension to sort, so it may not work on non-Gnu systems.

Here, we use sort to sort the entire file once, with the non-category fields sorted in a random order. Since the category fields are the primary key, the result is in category order with random order of the following fields.

From there, we need to find the first five entries in each category. There are probably hackier ways to do this, but I went with a simple awk program.

sort -ut, -k1,3 -k4R "$csvfile" | awk -F, 'a!=$1$2$3{a=$1$2$3;n=0}++n<=5'

If your sort doesn't randomise, then the random sample can be extracted with awk:

# Warning! Only slightly tested :)
sort -ut, "$csvfile" | awk -F, '
      function sample(){
        for(;n>5;--n)v[int(n*rand())+1]=v[n];
        for(;n;--n)print v[n]
      }
      a!=$1$2$3{a=$1$2$3;sample()}
      {v[++n]=$0}
      END      {sample()}'

It would also be possible to keep all the entries in awk to avoid the sort, but that's likely to be a lot slower and it will use an exorbitant amount of memory.

查看更多
等我变得足够好
3楼-- · 2019-07-28 06:21

If i understand right - simple, not very effective bash solution

csvfile="./ca.txt"
while read -r cat
do
    grep "^$cat," "$csvfile" | sort -uR | head -5
done < <(cut -d, -f1-3 < "$csvfile" |sort -u)

decomposition

  • cut -d, -f1-3 < "$csvfile" - filter out all "categories" (first 3 fields)
  • sort -u - get sorted unique categories
  • for each unique category (while read...)
  • grep "^$cat" "$csvfile" find all lines from this category
  • sort -uR - sort them randomly by hash (note, the duplicates has the same hash, take unique)
  • head -5 print the first 5 records (from the randomly sorted list)
查看更多
登录 后发表回答