Bash tool to get nth line from a file

2019-01-01 03:01发布

问题:

Is there a \"canonical\" way of doing that? I\'ve been using head -n | tail -1 which does the trick, but I\'ve been wondering if there\'s a Bash tool that specifically extracts a line (or a range of lines) from a file.

By \"canonical\" I mean a program whose main function is doing that.

回答1:

head and pipe with tail will be slow for a huge file. I would suggest sed like this:

sed \'NUMq;d\' file

Where NUM is the number of the line you want to print; so, for example, sed \'10q;d\' file to print the 10th line of file.

Explanation:

NUMq will quit immediately when the line number is NUM.

d will delete the line instead of printing it; this is inhibited on the last line because the q causes the rest of the script to be skipped when quitting.

If you have NUM in a variable, you will want to use double quotes instead of single:

sed \"${NUM}q;d\" file


回答2:

sed -n \'2p\' < file.txt

will print 2nd line

sed -n \'2011p\' < file.txt

2011th line

sed -n \'10,33p\' < file.txt

line 10 up to line 33

sed -n \'1p;3p\' < file.txt

1st and 3th line

and so on...

For adding lines with sed, you can check this:

sed: insert a line in a certain position



回答3:

I have a unique situation where I can benchmark the solutions proposed on this page, and so I\'m writing this answer as a consolidation of the proposed solutions with included run times for each.

Set Up

I have a 3.261 gigabyte ASCII text data file with one key-value pair per row. The file contains 3,339,550,320 rows in total and defies opening in any editor I have tried, including my go-to Vim. I need to subset this file in order to investigate some of the values that I\'ve discovered only start around row ~500,000,000.

Because the file has so many rows:

  • I need to extract only a subset of the rows to do anything useful with the data.
  • Reading through every row leading up to the values I care about is going to take a long time.
  • If the solution reads past the rows I care about and continues reading the rest of the file it will waste time reading almost 3 billion irrelevant rows and take 6x longer than necessary.

My best-case-scenario is a solution that extracts only a single line from the file without reading any of the other rows in the file, but I can\'t think of how I would accomplish this in Bash.

For the purposes of my sanity I\'m not going to be trying to read the full 500,000,000 lines I\'d need for my own problem. Instead I\'ll be trying to extract row 50,000,000 out of 3,339,550,320 (which means reading the full file will take 60x longer than necessary).

I will be using the time built-in to benchmark each command.

Baseline

First let\'s see how the head tail solution:

$ time head -50000000 myfile.ascii | tail -1
pgm_icnt = 0

real    1m15.321s

The baseline for row 50 million is 00:01:15.321, if I\'d gone straight for row 500 million it\'d probably be ~12.5 minutes.

cut

I\'m dubious of this one, but it\'s worth a shot:

$ time cut -f50000000 -d$\'\\n\' myfile.ascii
pgm_icnt = 0

real    5m12.156s

This one took 00:05:12.156 to run, which is much slower than the baseline! I\'m not sure whether it read through the entire file or just up to line 50 million before stopping, but regardless this doesn\'t seem like a viable solution to the problem.

AWK

I only ran the solution with the exit because I wasn\'t going to wait for the full file to run:

$ time awk \'NR == 50000000 {print; exit}\' myfile.ascii
pgm_icnt = 0

real    1m16.583s

This code ran in 00:01:16.583, which is only ~1 second slower, but still not an improvement on the baseline. At this rate if the exit command had been excluded it would have probably taken around ~76 minutes to read the entire file!

Perl

I ran the existing Perl solution as well:

$ time perl -wnl -e \'$.== 50000000 && print && exit;\' myfile.ascii
pgm_icnt = 0

real    1m13.146s

This code ran in 00:01:13.146, which is ~2 seconds faster than the baseline. If I\'d run it on the full 500,000,000 it would probably take ~12 minutes.

sed

The top answer on the board, here\'s my result:

$ time sed \"50000000q;d\" myfile.ascii
pgm_icnt = 0

real    1m12.705s

This code ran in 00:01:12.705, which is 3 seconds faster than the baseline, and ~0.4 seconds faster than Perl. If I\'d run it on the full 500,000,000 rows it would have probably taken ~12 minutes.

mapfile

I have bash 3.1 and therefore cannot test the mapfile solution.

Conclusion

It looks like, for the most part, it\'s difficult to improve upon the head tail solution. At best the sed solution provides a ~3% increase in efficiency.

(percentages calculated with the formula % = (runtime/baseline - 1) * 100)

Row 50,000,000

  1. 00:01:12.705 (-00:00:02.616 = -3.47%) sed
  2. 00:01:13.146 (-00:00:02.175 = -2.89%) perl
  3. 00:01:15.321 (+00:00:00.000 = +0.00%) head|tail
  4. 00:01:16.583 (+00:00:01.262 = +1.68%) awk
  5. 00:05:12.156 (+00:03:56.835 = +314.43%) cut

Row 500,000,000

  1. 00:12:07.050 (-00:00:26.160) sed
  2. 00:12:11.460 (-00:00:21.750) perl
  3. 00:12:33.210 (+00:00:00.000) head|tail
  4. 00:12:45.830 (+00:00:12.620) awk
  5. 00:52:01.560 (+00:40:31.650) cut

Row 3,338,559,320

  1. 01:20:54.599 (-00:03:05.327) sed
  2. 01:21:24.045 (-00:02:25.227) perl
  3. 01:23:49.273 (+00:00:00.000) head|tail
  4. 01:25:13.548 (+00:02:35.735) awk
  5. 05:47:23.026 (+04:24:26.246) cut


回答4:

With awk it is pretty fast:

awk \'NR == num_line\' file

When this is true, the default behaviour of awk is performed: {print $0}.


Alternative versions

If your file happens to be huge, you\'d better exit after reading the required line. This way you save CPU time.

awk \'NR == num_line {print; exit}\' file

If you want to give the line number from a bash variable you can use:

awk \'NR == n\' n=$num file
awk -v n=$num \'NR == n\' file   # equivalent


回答5:

Wow, all the possibilities!

Try this:

sed -n \"${lineNum}p\" $file

or one of these depending upon your version of Awk:

awk  -vlineNum=$lineNum \'NR == lineNum {print $0}\' $file
awk -v lineNum=4 \'{if (NR == lineNum) {print $0}}\' $file
awk \'{if (NR == lineNum) {print $0}}\' lineNum=$lineNum $file

(You may have to try the nawk or gawk command).

Is there a tool that only does the print that particular line? Not one of the standard tools. However, sed is probably the closest and simplest to use.



回答6:

# print line number 52
sed \'52!d\' file

Useful one-line scripts for sed



回答7:

This question being tagged Bash, here\'s the Bash (≥4) way of doing: use mapfile with the -s (skip) and -n (count) option.

If you need to get the 42nd line of a file file:

mapfile -s 41 -n 1 ary < file

At this point, you\'ll have an array ary the fields of which containing the lines of file (including the trailing newline), where we have skipped the first 41 lines (-s 41), and stopped after reading one line (-n 1). So that\'s really the 42nd line. To print it out:

printf \'%s\' \"${ary[0]}\"

If you need a range of lines, say the range 42–666 (inclusive), and say you don\'t want to do the math yourself, and print them on stdout:

mapfile -s $((42-1)) -n $((666-42+1)) ary < file
printf \'%s\' \"${ary[@]}\"

If you need to process these lines too, it\'s not really convenient to store the trailing newline. In this case use the -t option (trim):

mapfile -t -s $((42-1)) -n $((666-42+1)) ary < file
# do stuff
printf \'%s\\n\' \"${ary[@]}\"

You can have a function do that for you:

print_file_range() {
    # $1-$2 is the range of file $3 to be printed to stdout
    local ary
    mapfile -s $(($1-1)) -n $(($2-$1+1)) ary < \"$3\"
    printf \'%s\' \"${ary[@]}\"
}

No external commands, only Bash builtins!



回答8:

You may also used sed print and quit:

sed -n \'10{p;q;}\' file   # print line 10


回答9:

According to my tests, in terms of performance and readability my recommendation is:

tail -n+N | head -1

N is the line number that you want. For example, tail -n+7 input.txt | head -1 will print the 7th line of the file.

tail -n+N will print everything starting from line N, and head -1 will make it stop after one line.


The alternative head -N | tail -1 is perhaps slightly more readable. For example, this will print the 7th line:

head -7 input.txt | tail -1

When it comes to performance, there is not much difference for smaller sizes, but it will be outperformed by the tail | head (from above) when the files become huge.

The top-voted sed \'NUMq;d\' is interesting to know, but I would argue that it will be understood by fewer people out of the box than the head/tail solution and it is also slower than tail/head.

In my tests, both tails/heads versions outperformed sed \'NUMq;d\' consistently. That is in line with the other benchmarks that were posted. It is hard to find a case where tails/heads was really bad. It is also not surprising, as these are operations that you would expect to be heavily optimized in a modern Unix system.

To get an idea about the performance differences, these are the number that I get for a huge file (9.3G):

  • tail -n+N | head -1: 3.7 sec
  • head -N | tail -1: 4.6 sec
  • sed Nq;d: 18.8 sec

Results may differ, but the performance head | tail and tail | head is, in general, comparable for smaller inputs, and sed is always slower by a significant factor (around 5x or so).

To reproduce my benchmark, you can try the following, but be warned that it will create a 9.3G file in the current working directory:

#!/bin/bash
readonly file=tmp-input.txt
readonly size=1000000000
readonly pos=500000000
readonly retries=3

seq 1 $size > $file
echo \"*** head -N | tail -1 ***\"
for i in $(seq 1 $retries) ; do
    time head \"-$pos\" $file | tail -1
done
echo \"-------------------------\"
echo
echo \"*** tail -n+N | head -1 ***\"
echo

seq 1 $size > $file
ls -alhg $file
for i in $(seq 1 $retries) ; do
    time tail -n+$pos $file | head -1
done
echo \"-------------------------\"
echo
echo \"*** sed Nq;d ***\"
echo

seq 1 $size > $file
ls -alhg $file
for i in $(seq 1 $retries) ; do
    time sed $pos\'q;d\' $file
done
/bin/rm $file

Here is the output of a run on my machine (ThinkPad X1 Carbon with an SSD and 16G of memory). I assume in the final run everything will come from the cache, not from disk:

*** head -N | tail -1 ***
500000000

real    0m9,800s
user    0m7,328s
sys     0m4,081s
500000000

real    0m4,231s
user    0m5,415s
sys     0m2,789s
500000000

real    0m4,636s
user    0m5,935s
sys     0m2,684s
-------------------------

*** tail -n+N | head -1 ***

-rw-r--r-- 1 phil 9,3G Jan 19 19:49 tmp-input.txt
500000000

real    0m6,452s
user    0m3,367s
sys     0m1,498s
500000000

real    0m3,890s
user    0m2,921s
sys     0m0,952s
500000000

real    0m3,763s
user    0m3,004s
sys     0m0,760s
-------------------------

*** sed Nq;d ***

-rw-r--r-- 1 phil 9,3G Jan 19 19:50 tmp-input.txt
500000000

real    0m23,675s
user    0m21,557s
sys     0m1,523s
500000000

real    0m20,328s
user    0m18,971s
sys     0m1,308s
500000000

real    0m19,835s
user    0m18,830s
sys     0m1,004s


回答10:

You can also use Perl for this:

perl -wnl -e \'$.== NUM && print && exit;\' some.file


回答11:

The fastest solution for big files is always tail|head, provided that the two distances:

  • from the start of the file to the starting line. Lets call it S
  • the distance from the last line to the end of the file. Be it E

are known. Then, we could use this:

mycount=\"$E\"; (( E > S )) && mycount=\"+$S\"
howmany=\"$(( endline - startline + 1 ))\"
tail -n \"$mycount\"| head -n \"$howmany\"

howmany is just the count of lines required.

Some more detail in https://unix.stackexchange.com/a/216614/79743



回答12:

As a followup to CaffeineConnoisseur\'s very helpful benchmarking answer... I was curious as to how fast the \'mapfile\' method was compared to others (as that wasn\'t tested), so I tried a quick-and-dirty speed comparison myself as I do have bash 4 handy. Threw in a test of the \"tail | head\" method (rather than head | tail) mentioned in one of the comments on the top answer while I was at it, as folks are singing its praises. I don\'t have anything nearly the size of the testfile used; the best I could find on short notice was a 14M pedigree file (long lines that are whitespace-separated, just under 12000 lines).

Short version: mapfile appears faster than the cut method, but slower than everything else, so I\'d call it a dud. tail | head, OTOH, looks like it could be the fastest, although with a file this size the difference is not all that substantial compared to sed.

$ time head -11000 [filename] | tail -1
[output redacted]

real    0m0.117s

$ time cut -f11000 -d$\'\\n\' [filename]
[output redacted]

real    0m1.081s

$ time awk \'NR == 11000 {print; exit}\' [filename]
[output redacted]

real    0m0.058s

$ time perl -wnl -e \'$.== 11000 && print && exit;\' [filename]
[output redacted]

real    0m0.085s

$ time sed \"11000q;d\" [filename]
[output redacted]

real    0m0.031s

$ time (mapfile -s 11000 -n 1 ary < [filename]; echo ${ary[0]})
[output redacted]

real    0m0.309s

$ time tail -n+11000 [filename] | head -n1
[output redacted]

real    0m0.028s

Hope this helps!



回答13:

If you got multiple lines by delimited by \\n (normally new line). You can use \'cut\' as well:

echo \"$data\" | cut -f2 -d$\'\\n\'

You will get the 2nd line from the file. -f3 gives you the 3rd line.



回答14:

All the above answers directly answer the question. But here\'s a less direct solution but a potentially more important idea, to provoke thought.

Since line lengths are arbitrary, all the bytes of the file before the nth line need to be read. If you have a huge file or need to repeat this task many times, and this process is time-consuming, then you should seriously think about whether you should be storing your data in a different way in the first place.

The real solution is to have an index, e.g. at the start of the file, indicating the positions where the lines begin. You could use a database format, or just add a table at the start of the file. Alternatively create a separate index file to accompany your large text file.

e.g. you might create a list of character positions for newlines:

awk \'BEGIN{c=0;print(c)}{c+=length()+1;print(c+1)}\' file.txt > file.idx

then read with tail, which actually seeks directly to the appropriate point in the file!

e.g. to get line 1000:

tail -c +$(awk \'NR=1000\' file.idx) file.txt | head -1
  • This may not work with 2-byte / multibyte characters, since awk is \"character-aware\" but tail is not.
  • I haven\'t tested this against a large file.
  • Also see this answer.
  • Alternatively - split your file into smaller files!


回答15:

One of possible ways:

sed -n \'NUM{p;q}\'

Note that without the q command, if the file is large, sed continues to work, which slows down the computation.



回答16:

Lots of good answers already. I personally go with awk. For convenience, if you use bash, just add the below to your ~/.bash_profile. And, the next time you log in (or if you source your .bash_profile after this update), you will have a new nifty \"nth\" function available to pipe your files through.

Execute this or put it in your ~/.bash_profile (if using bash) and reopen bash (or execute source ~/.bach_profile)

# print just the nth piped in line nth () { awk -vlnum=${1} \'NR==lnum {print; exit}\'; }

Then, to use it, simply pipe through it. E.g.,:

$ yes line | cat -n | nth 5 5 line



回答17:

To print nth line using sed with a variable as line number:

a=4
sed -e $a\'q:d\' file

Here the \'-e\' flag is for adding script to command to be executed.



回答18:

Using what others mentioned, I wanted this to be a quick & dandy function in my bash shell.

Create a file: ~/.functions

Add to it the contents:

getline() { line=$1 sed $line\'q;d\' $2 }

Then add this to your ~/.bash_profile:

source ~/.functions

Now when you open a new bash window, you can just call the function as so:

getline 441 myfile.txt