可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I\'ve got a large (by number of lines) plain text file that I\'d like to split into smaller files, also by number of lines. So if my file has around 2M lines, I\'d like to split it up into 10 files that contain 200k lines, or 100 files that contain 20k lines (plus one file with the remainder; being evenly divisible doesn\'t matter).
I could do this fairly easily in Python but I\'m wondering if there\'s any kind of ninja way to do this using bash and unix utils (as opposed to manually looping and counting / partitioning lines).
回答1:
Have you looked at the split command?
$ split --help
Usage: split [OPTION] [INPUT [PREFIX]]
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default
size is 1000 lines, and default PREFIX is `x\'. With no INPUT, or when INPUT
is -, read standard input.
Mandatory arguments to long options are mandatory for short options too.
-a, --suffix-length=N use suffixes of length N (default 2)
-b, --bytes=SIZE put SIZE bytes per output file
-C, --line-bytes=SIZE put at most SIZE bytes of lines per output file
-d, --numeric-suffixes use numeric suffixes instead of alphabetic
-l, --lines=NUMBER put NUMBER lines per output file
--verbose print a diagnostic to standard error just
before each output file is opened
--help display this help and exit
--version output version information and exit
You could do something like this:
split -l 200000 filename
which will create files each with 200000 lines named xaa xab xac
...
Another option, split by size of output file (still splits on line breaks):
split -C 20m --numeric-suffixes input_filename output_prefix
creates files like output_prefix01 output_prefix02 output_prefix03 ...
each of max size 20 megabytes.
回答2:
How about the split command?
split -l 200000 mybigfile.txt
回答3:
Yes, there is a split
command. It will split a file by lines or bytes.
$ split --help
Usage: split [OPTION]... [INPUT [PREFIX]]
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default
size is 1000 lines, and default PREFIX is `x\'. With no INPUT, or when INPUT
is -, read standard input.
Mandatory arguments to long options are mandatory for short options too.
-a, --suffix-length=N use suffixes of length N (default 2)
-b, --bytes=SIZE put SIZE bytes per output file
-C, --line-bytes=SIZE put at most SIZE bytes of lines per output file
-d, --numeric-suffixes use numeric suffixes instead of alphabetic
-l, --lines=NUMBER put NUMBER lines per output file
--verbose print a diagnostic just before each
output file is opened
--help display this help and exit
--version output version information and exit
SIZE may have a multiplier suffix:
b 512, kB 1000, K 1024, MB 1000*1000, M 1024*1024,
GB 1000*1000*1000, G 1024*1024*1024, and so on for T, P, E, Z, Y.
回答4:
use split
Split a file into fixed-size pieces, creates output files containing consecutive sections of INPUT (standard input if none is given or INPUT is `-\')
Syntax
split [options] [INPUT [PREFIX]]
http://ss64.com/bash/split.html
回答5:
Use:
sed -n \'1,100p\' filename > output.txt
Here, 1 and 100 are the line numbers which you will capture in output.txt
.
回答6:
you can also use awk
awk -vc=1 \'NR%200000==0{++c}{print $0 > c\".txt\"}\' largefile
回答7:
split the file \"file.txt\" into 10000 lines files:
split -l 10000 file.txt
回答8:
In case you just want to split by x number of lines each file, the given answers about split
are OK. But, i am curious about no one paid attention to requirements:
- \"without having to count them\" -> using wc + cut
- \"having the remainder in extra file\" -> split does by default
I can\'t do that without \"wc + cut\", but I\'m using that:
split -l $(expr `wc $filename | cut -d \' \' -f3` / $chunks) $filename
This can be easily added to your bashrc functions so you can just invoke it passing filename and chunks:
split -l $(expr `wc $1 | cut -d \' \' -f3` / $2) $1
In case you want just x chunks without remainder in extra file, just adapt the formula to sum it (chunks - 1) on each file. I do use this approach because usually i just want x number of files rather than x lines per file:
split -l $(expr `wc $1 | cut -d \' \' -f3` / $2 + `expr $2 - 1`) $1
You can add that to a script and call it your \"ninja way\", because if nothing suites your needs, you can build it :-)
回答9:
split
(from GNU coreutils, since version 8.8 from 2010-12-22) includes the following parameter:
-n, --number=CHUNKS generate CHUNKS output files; see explanation below
CHUNKS may be:
N split into N files based on size of input
K/N output Kth of N to stdout
l/N split into N files without splitting lines/records
l/K/N output Kth of N to stdout without splitting lines/records
r/N like \'l\' but use round robin distribution
r/K/N likewise but only output Kth of N to stdout
Thus, split -n 4 input output.
will generate four files (output.a{a,b,c,d}
) with the same amount of bytes, but lines might be broken in the middle.
If we want to preserve full lines (i.e. split by lines), then this should work:
split -n l/4 input output.
Related answer: https://stackoverflow.com/a/19031247
回答10:
HDFS getmerge small file and spilt into property size.
This method will cause line break
split -b 125m compact.file -d -a 3 compact_prefix
I try to getmerge and split into about 128MB every file.
split into 128m ,judge sizeunit is M or G ,please test before use.
begainsize=`hdfs dfs -du -s -h /externaldata/$table_name/$date/ | awk \'{ print $1}\' `
sizeunit=`hdfs dfs -du -s -h /externaldata/$table_name/$date/ | awk \'{ print $2}\' `
if [ $sizeunit = \"G\" ];then
res=$(printf \"%.f\" `echo \"scale=5;$begainsize*8 \"|bc`)
else
res=$(printf \"%.f\" `echo \"scale=5;$begainsize/128 \"|bc`) # celling ref http://blog.csdn.net/naiveloafer/article/details/8783518
fi
echo $res
# split into $res files with number suffix. ref http://blog.csdn.net/microzone/article/details/52839598
compact_file_name=$compact_file\"_\"
echo \"compact_file_name :\"$compact_file_name
split -n l/$res $basedir/$compact_file -d -a 3 $basedir/${compact_file_name}