I've got a large (by number of lines) plain text file that I'd like to split into smaller files, also by number of lines. So if my file has around 2M lines, I'd like to split it up into 10 files that contain 200k lines, or 100 files that contain 20k lines (plus one file with the remainder; being evenly divisible doesn't matter).

I could do this fairly easily in Python but I'm wondering if there's any kind of ninja way to do this using bash and unix utils (as opposed to manually looping and counting / partitioning lines).

标签： bash file unix

10条回答

孤独寂梦人

2楼-- · 2019-01-01 03:11

How about the split command?

split -l 200000 mybigfile.txt

0人赞添加讨论(0) 举报

不再属于我。

3楼-- · 2019-01-01 03:12

Use:

sed -n '1,100p' filename > output.txt

Here, 1 and 100 are the line numbers which you will capture in output.txt.

0人赞添加讨论(0) 举报

旧时光的记忆

4楼-- · 2019-01-01 03:14

split (from GNU coreutils, since version 8.8 from 2010-12-22) includes the following parameter:

-n, --number=CHUNKS     generate CHUNKS output files; see explanation below

CHUNKS may be:
  N       split into N files based on size of input
  K/N     output Kth of N to stdout
  l/N     split into N files without splitting lines/records
  l/K/N   output Kth of N to stdout without splitting lines/records
  r/N     like 'l' but use round robin distribution
  r/K/N   likewise but only output Kth of N to stdout

Thus, split -n 4 input output. will generate four files (output.a{a,b,c,d}) with the same amount of bytes, but lines might be broken in the middle.

If we want to preserve full lines (i.e. split by lines), then this should work:

split -n l/4 input output.

Related answer: https://stackoverflow.com/a/19031247

0人赞添加讨论(0) 举报

梦寄多情

5楼-- · 2019-01-01 03:15

split the file "file.txt" into 10000 lines files:

split -l 10000 file.txt

0人赞添加讨论(0) 举报

弹指情弦暗扣

6楼-- · 2019-01-01 03:16

HDFS getmerge small file and spilt into property size.

This method will cause line break

split -b 125m compact.file -d -a 3 compact_prefix

I try to getmerge and split into about 128MB every file.

split into 128m ,judge sizeunit is M or G ,please test before use.

begainsize=`hdfs dfs -du -s -h /externaldata/$table_name/$date/ | awk '{ print $1}' `
sizeunit=`hdfs dfs -du -s -h /externaldata/$table_name/$date/ | awk '{ print $2}' `
if [ $sizeunit = "G" ];then
    res=$(printf "%.f" `echo "scale=5;$begainsize*8 "|bc`)
else
    res=$(printf "%.f" `echo "scale=5;$begainsize/128 "|bc`)  # celling ref http://blog.csdn.net/naiveloafer/article/details/8783518
fi
echo $res
# split into $res files with number suffix.  ref  http://blog.csdn.net/microzone/article/details/52839598
compact_file_name=$compact_file"_"
echo "compact_file_name :"$compact_file_name
split -n l/$res $basedir/$compact_file -d -a 3 $basedir/${compact_file_name}

0人赞添加讨论(0) 举报

余生无你

7楼-- · 2019-01-01 03:17

Have you looked at the split command?

$ split --help
Usage: split [OPTION] [INPUT [PREFIX]]
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default
size is 1000 lines, and default PREFIX is `x'.  With no INPUT, or when INPUT
is -, read standard input.

Mandatory arguments to long options are mandatory for short options too.
  -a, --suffix-length=N   use suffixes of length N (default 2)
  -b, --bytes=SIZE        put SIZE bytes per output file
  -C, --line-bytes=SIZE   put at most SIZE bytes of lines per output file
  -d, --numeric-suffixes  use numeric suffixes instead of alphabetic
  -l, --lines=NUMBER      put NUMBER lines per output file
      --verbose           print a diagnostic to standard error just
                            before each output file is opened
      --help     display this help and exit
      --version  output version information and exit

You could do something like this:

split -l 200000 filename

which will create files each with 200000 lines named xaa xab xac ...

Another option, split by size of output file (still splits on line breaks):

 split -C 20m --numeric-suffixes input_filename output_prefix

creates files like output_prefix01 output_prefix02 output_prefix03 ... each of max size 20 megabytes.

0人赞添加讨论(0) 举报

1 2 下一页

How to split a large text file into smaller files

split -b 125m compact.file -d -a 3 compact_prefix

split into 128m ,judge sizeunit is M or G ,please test before use.

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间