Split CSV to Multiple Files Containing a Set Numbe

As a beginner of awk I am able to split the data with unique value by

awk -F, '{print >> $1".csv";close($1)}' myfile.csv

But I would like to split a large CSV file based on additional condition which is the occurrences of unique values in a specific column.

Specifically, with input

111,1,0,1
111,1,1,1
222,1,1,1
333,1,0,0
333,1,1,1
444,1,1,1
444,0,0,0
555,1,1,1
666,1,0,0

I would like the output files to be

111,1,0,1
111,1,1,1
222,1,1,1
333,1,0,0
333,1,1,1

and

444,1,1,1
444,1,0,1
555,1,1,1
666,1,0,0

each of which contains three(in this case) unique values, 111,222,333and 444,555,666respectively, in first column. Any help would be appreciated.

标签： csv awk split conditional-statements find-occurrences

2条回答

女痞

2楼-- · 2019-07-20 14:40

this one-liner would help:

awk -F, -v u=3 -v i=1 '{a[$1];
   if (length(a)>u){close(i".csv");++i;delete a;a[$1]}print>i".csv"}' file

You change the u=3 value into x to gain x unique values per file.

If you run this line with your input file, you should got 1.csv and 2.csv

Edit (add some test output):

kent$  ll
total 4.0K
drwxr-xr-x  2 kent kent  60 Mar 25 18:19 ./
drwxrwxrwt 19 root root 580 Mar 25 18:18 ../
-rw-r--r--  1 kent kent  90 Mar 25 17:57 f

kent$  cat f
111,1,0,1
111,1,1,1
222,1,1,1
333,1,0,0
333,1,1,1
444,1,1,1
444,0,0,0
555,1,1,1
666,1,0,0

kent$  awk -F, -v u=3 -v i=1 '{fn=i".csv";a[$1];if (length(a)>u){close(fn);++i;delete a;a[$1]}print>fn}' f  

kent$  head *.csv
==> 1.csv <==
111,1,0,1
111,1,1,1
222,1,1,1
333,1,0,0
333,1,1,1

==> 2.csv <==
444,1,1,1
444,0,0,0
555,1,1,1
666,1,0,0

0人赞添加讨论(0) 举报

萌系小妹纸

3楼-- · 2019-07-20 14:59

This will do the trick and I find it pretty readable and easy to understand:

awk -F',' 'BEGIN { count=0; filename=1 }
            x[$1]++==0 {count++}
            count==4 { count=1; filename++}
            {print >> filename".csv"; close(filename".csv");}' file

We start with our count at 0 and our filename at 1. We then count each unique value we get from the fist column, and whenever its the 4th one, we reset our count and move to the next filename.

Here's some sample data I used, which is just yours with some additional lines.

~$ cat test.txt
111,1,0,1
111,1,1,1
222,1,1,1
333,1,0,0
333,1,1,1
444,1,1,1
444,0,0,0
555,1,1,1
666,1,0,0
777,1,1,1
777,1,0,1
777,1,1,0
777,1,1,1
888,1,0,1
888,1,1,1
999,1,1,1
999,0,0,0
999,0,0,1
101,0,0,0
102,0,0,0

And running the awk like so:

~$ awk -F',' 'BEGIN { count=0; filename=1 }
            x[$1]++==0 {count++}
            count==4 { count=1; filename++}
            {print >> filename".csv"; close(filename".csv");}' test.txt

We see the following output files and content:

~$ cat 1.csv
111,1,0,1
111,1,1,1
222,1,1,1
333,1,0,0
333,1,1,1

~$ cat 2.csv
444,1,1,1
444,0,0,0
555,1,1,1
666,1,0,0

~$ cat 3.csv
777,1,1,1
777,1,0,1
777,1,1,0
777,1,1,1
888,1,0,1
888,1,1,1
999,1,1,1
999,0,0,0
999,0,0,1

~$ cat 4.csv
101,0,0,0
102,0,0,0

0人赞添加讨论(0) 举报

Split CSV to Multiple Files Containing a Set Numbe

Edit (add some test output):

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间