Splitting bulk text file every n line

2019-04-30 00:00发布

问题:

I have a folder that contains multiple text files. I'm trying to split all text files at 10000 line per file while keeping the base file name i.e. if filename1.txt contains 20000 lines the output will be filename1-1.txt (10000 lines) and filename1-2.txt (10000 lines).

I tried to use split -10000 filename1.txt but this is not keeping the base filename and i have to repeat the command for each text file in the folder. I also tried doing for f in *.txt; do split -10000 $f.txt; done. This didn't work too.

Any idea how can i do this? Thanks.

回答1:

for f in filename*.txt; do split -d -a1 -l10000 --additional-suffix=.txt "$f" "${f%.txt}-"; done

Or, written over multiple lines:

for f in filename*.txt
do
    split -d -a1 -l10000 --additional-suffix=.txt "$f" "${f%.txt}-"
done

How it works:

  • -d tells split to use numeric suffixes

  • -a1 tells split to start with only single digits for the suffix.

  • -l10000 tells split to split every 10,000 lines.

  • --additional-suffix=.txt tells split to add .txt to the end of the names of the new files.

  • "$f" tells split the name of the file to split.

  • "${f%.txt}-" tells split the prefix name to use for the split files.

Example

Suppose that we start with these files:

$ ls
filename1.txt  filename2.txt

Then we run our command:

$ for f in filename*.txt; do split -d -a1 -l10000 --additional-suffix=.txt "$f" "${f%.txt}-"; done

When this is done, we now have the original files and the new split files:

$ ls
filename1-0.txt  filename1-1.txt  filename1.txt  filename2-0.txt  filename2-1.txt  filename2.txt

Using older, less featureful forms of split

If your split does not offer --additional-suffix, then consider:

for f in filename*.txt
do 
    split -d -a1 -l10000 "$f" "${f%.txt}-"
    for g in "${f%.txt}-"*
    do 
        mv "$g" "$g.txt"
    done
done


回答2:

No need for shell loops, just one simple awk command does it for all files:

awk 'FNR%1000==1{if(FNR==1)c=0; close(out); out=FILENAME; sub(/.txt/,"-"++c".txt)} {print > out}' *