Is it possible to split a huge text file (base

2019-09-04 04:43发布

问题:

I have a .tar.gz file. It contains one 20GB-sized text file with 20.5 million lines. I cannot extract this file as a whole and save to disk. I must do either one of the following options:

  1. Specify a number of lines in each file - say, 1 million, - and get 21 files. This would be a preferred option.
  2. Extract a part of that file based on line numbers, that is, say, from 1000001 to 2000001, to get a file with 1M lines. I will have to repeat this step 21 times with different parameters, which is very bad.

Is it possible at all?

This answer - bash: extract only part of tar.gz archive - describes a different problem.

回答1:

To extract a file from f.tar.gz and split it into files, each with no more than 1 million lines, use:

tar Oxzf f.tar.gz | split -l1000000

The above will name the output files by the default method. If you prefer the output files to be named prefix.nn where nn is a sequence number, then use:

tar Oxzf f.tar.gz |split -dl1000000 - prefix.

Under this approach:

  • The original file is never written to disk. tar reads from the .tar.gz file and pipes its contents to split which divides it up into pieces before writing the pieces to disk.

  • The .tar.gz file is read only once.

  • split, through its many options, has a great deal of flexibility.

Explanation

For the tar command:

  • O tells tar to send the output to stdout. This way we can pipe it to split without ever having to save the original file on disk.

  • x tells tar to extract the file (as opposed to, say, creating an archive).

  • z tells tar that the archive is in gzip format. On modern tars, this is optional

  • f tells tar to use, as input, the file name specified.

For the split command:

  • -l tells split to split files limited by number of lines (as opposed to, say, bytes).

  • -d tells split to use numeric suffixes for the output files.

  • - tells split to get its input from stdin



回答2:

You can use the --to-stdout (or -O) option in tar to send the output to stdout. Then use sed to specify which set of lines you want.

#!/bin/bash
l=1
inc=1000000
p=1
while test $l -lt 21000000; do
  e=$(($l+$inc))
  tar -xfz --to-stdout myfile.tar.gz file-to-extract.txt |
      sed -n -e "$l,$e p" > part$p.txt
  l=$(($l+$inc))
  p=$(($p+1))
done


回答3:

Here's a pure Bash solution for option #1, automatically splitting lines into multiple output files.

#!/usr/bin/env bash

set -eu

filenum=1
chunksize=1000000
ii=0
while read line
do
  if [ $ii -ge $chunksize ]
  then
    ii=0
    filenum=$(($filenum + 1))
    > out/file.$filenum
  fi

  echo $line >> out/file.$filenum
  ii=$(($ii + 1))
done

This will take any lines from stdin and create files like out/file.1 with the first million lines, out/file.2 with the second million lines, etc. Then all you need is to feed the input to the above script, like this:

tar xfzO big.tar.gz | ./split.sh

This will never save any intermediate file on disk, or even in memory. It is entirely a streaming solution. It's somewhat wasteful of time, but very efficient in terms of space. It's also very portable, and should work in shells other than Bash, and on ancient systems with little change.



回答4:

you can use

sed -n 1,20p /Your/file/Path

Here you mention your first line number and the last line number I mean to say this could look like

sed -n 1,20p /Your/file/Path >> file1

and use start line number and end line number in a variable and use it accordingly.