Is there a "canonical" way of doing that? I've been using head -n | tail -1
which does the trick, but I've been wondering if there's a Bash tool that specifically extracts a line (or a range of lines) from a file.
By "canonical" I mean a program whose main function is doing that.
The fastest solution for big files is always tail|head, provided that the two distances:
S
E
are known. Then, we could use this:
howmany is just the count of lines required.
Some more detail in https://unix.stackexchange.com/a/216614/79743
I have a unique situation where I can benchmark the solutions proposed on this page, and so I'm writing this answer as a consolidation of the proposed solutions with included run times for each.
Set Up
I have a 3.261 gigabyte ASCII text data file with one key-value pair per row. The file contains 3,339,550,320 rows in total and defies opening in any editor I have tried, including my go-to Vim. I need to subset this file in order to investigate some of the values that I've discovered only start around row ~500,000,000.
Because the file has so many rows:
My best-case-scenario is a solution that extracts only a single line from the file without reading any of the other rows in the file, but I can't think of how I would accomplish this in Bash.
For the purposes of my sanity I'm not going to be trying to read the full 500,000,000 lines I'd need for my own problem. Instead I'll be trying to extract row 50,000,000 out of 3,339,550,320 (which means reading the full file will take 60x longer than necessary).
I will be using the
time
built-in to benchmark each command.Baseline
First let's see how the
head
tail
solution:The baseline for row 50 million is 00:01:15.321, if I'd gone straight for row 500 million it'd probably be ~12.5 minutes.
cut
I'm dubious of this one, but it's worth a shot:
This one took 00:05:12.156 to run, which is much slower than the baseline! I'm not sure whether it read through the entire file or just up to line 50 million before stopping, but regardless this doesn't seem like a viable solution to the problem.
AWK
I only ran the solution with the
exit
because I wasn't going to wait for the full file to run:This code ran in 00:01:16.583, which is only ~1 second slower, but still not an improvement on the baseline. At this rate if the exit command had been excluded it would have probably taken around ~76 minutes to read the entire file!
Perl
I ran the existing Perl solution as well:
This code ran in 00:01:13.146, which is ~2 seconds faster than the baseline. If I'd run it on the full 500,000,000 it would probably take ~12 minutes.
sed
The top answer on the board, here's my result:
This code ran in 00:01:12.705, which is 3 seconds faster than the baseline, and ~0.4 seconds faster than Perl. If I'd run it on the full 500,000,000 rows it would have probably taken ~12 minutes.
mapfile
I have bash 3.1 and therefore cannot test the mapfile solution.
Conclusion
It looks like, for the most part, it's difficult to improve upon the
head
tail
solution. At best thesed
solution provides a ~3% increase in efficiency.(percentages calculated with the formula
% = (runtime/baseline - 1) * 100
)Row 50,000,000
sed
perl
head|tail
awk
cut
Row 500,000,000
sed
perl
head|tail
awk
cut
Row 3,338,559,320
sed
perl
head|tail
awk
cut
All the above answers directly answer the question. But here's a less direct solution but a potentially more important idea, to provoke thought.
Since line lengths are arbitrary, all the bytes of the file before the nth line need to be read. If you have a huge file or need to repeat this task many times, and this process is time-consuming, then you should seriously think about whether you should be storing your data in a different way in the first place.
The real solution is to have an index, e.g. at the start of the file, indicating the positions where the lines begin. You could use a database format, or just add a table at the start of the file. Alternatively create a separate index file to accompany your large text file.
e.g. you might create a list of character positions for newlines:
then read with
tail
, which actuallyseek
s directly to the appropriate point in the file!e.g. to get line 1000:
will print 2nd line
2011th line
line 10 up to line 33
1st and 3th line
and so on...
For adding lines with sed, you can check this:
sed: insert a line in a certain position
Lots of good answers already. I personally go with awk. For convenience, if you use bash, just add the below to your
~/.bash_profile
. And, the next time you log in (or if you source your .bash_profile after this update), you will have a new nifty "nth" function available to pipe your files through.Execute this or put it in your ~/.bash_profile (if using bash) and reopen bash (or execute
source ~/.bach_profile
)# print just the nth piped in line nth () { awk -vlnum=${1} 'NR==lnum {print; exit}'; }
Then, to use it, simply pipe through it. E.g.,:
$ yes line | cat -n | nth 5 5 line
If you got multiple lines by delimited by \n (normally new line). You can use 'cut' as well:
You will get the 2nd line from the file.
-f3
gives you the 3rd line.