Pull random unique samples within sorted categorie

I have a large unsorted CSV file (>4M records). Each record has a category, which is described in the first three columns. The rest of the record is address data which may or may not be unique.

A, 1, c, address1  # the category for this record is A1t
A, 1, c, address2
C, 3, e, address3  # the category for this record is C3e
B, 2, a, address4

I would like to pull a random sample of unique records within each category (so 5 unique records in category A1t, 5 unique records from C3e, etc.). I put together a partial solution using sort. However, it only pulls one non-random record in each category:

sort -u -t, -k1,3

Is there a way to pull several random sample records within each category?

I think there must be a way to do this by using a combination of pipes, uniq, awk or shuf, but haven't been able to figure it out. I would prefer a command-line solution since I'm interested in knowing if this is possible using only bash.

标签： bash sorting unix random command-line

2条回答

一纸荒年 Trace。

2楼-- · 2019-07-28 06:20

Inspired by the use of sort -R in the answer by jm666. This is a GNU extension to sort, so it may not work on non-Gnu systems.

Here, we use sort to sort the entire file once, with the non-category fields sorted in a random order. Since the category fields are the primary key, the result is in category order with random order of the following fields.

From there, we need to find the first five entries in each category. There are probably hackier ways to do this, but I went with a simple awk program.

sort -ut, -k1,3 -k4R "$csvfile" | awk -F, 'a!=$1$2$3{a=$1$2$3;n=0}++n<=5'

If your sort doesn't randomise, then the random sample can be extracted with awk:

# Warning! Only slightly tested :)
sort -ut, "$csvfile" | awk -F, '
      function sample(){
        for(;n>5;--n)v[int(n*rand())+1]=v[n];
        for(;n;--n)print v[n]
      }
      a!=$1$2$3{a=$1$2$3;sample()}
      {v[++n]=$0}
      END      {sample()}'

It would also be possible to keep all the entries in awk to avoid the sort, but that's likely to be a lot slower and it will use an exorbitant amount of memory.

0人赞添加讨论(0) 举报

等我变得足够好

3楼-- · 2019-07-28 06:21

If i understand right - simple, not very effective bash solution

csvfile="./ca.txt"
while read -r cat
do
    grep "^$cat," "$csvfile" | sort -uR | head -5
done < <(cut -d, -f1-3 < "$csvfile" |sort -u)

decomposition

cut -d, -f1-3 < "$csvfile" - filter out all "categories" (first 3 fields)
sort -u - get sorted unique categories
for each unique category (while read...)
grep "^$cat" "$csvfile" find all lines from this category
sort -uR - sort them randomly by hash (note, the duplicates has the same hash, take unique)
head -5 print the first 5 records (from the randomly sorted list)

0人赞添加讨论(0) 举报

Pull random unique samples within sorted categorie

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间