可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
Are there any alternatives to the code below:
startFromLine = 141978 # or whatever line I need to jump to
urlsfile = open(filename, \"rb\", 0)
linesCounter = 1
for line in urlsfile:
if linesCounter > startFromLine:
DoSomethingWithThisLine(line)
linesCounter += 1
If I\'m processing a huge text file (~15MB)
with lines of unknown but different length, and need to jump to a particular line which number I know in advance? I feel bad by processing them one by one when I know I could ignore at least first half of the file. Looking for more elegant solution if there is any.
回答1:
linecache:
The linecache
module allows one to get any line from a Python source file, while attempting to optimize internally, using a cache, the common case where many lines are read from a single file. This is used by the traceback
module to retrieve source lines for inclusion in the formatted traceback...
回答2:
You can\'t jump ahead without reading in the file at least once, since you don\'t know where the line breaks are. You could do something like:
# Read in the file once and build a list of line offsets
line_offset = []
offset = 0
for line in file:
line_offset.append(offset)
offset += len(line)
file.seek(0)
# Now, to skip to line n (with the first line being line 0), just do
file.seek(line_offset[n])
回答3:
You don\'t really have that many options if the lines are of different length... you sadly need to process the line ending characters to know when you\'ve progressed to the next line.
You can, however, dramatically speed this up AND reduce memory usage by changing the last parameter to \"open\" to something not 0.
0 means the file reading operation is unbuffered, which is very slow and disk intensive. 1 means the file is line buffered, which would be an improvement. Anything above 1 (say 8k.. ie: 8096, or higher) reads chunks of the file into memory. You still access it through for line in open(etc):
, but python only goes a bit at a time, discarding each buffered chunk after its processed.
回答4:
I\'m probably spoiled by abundant ram, but 15 M is not huge. Reading into memory with readlines()
is what I usually do with files of this size. Accessing a line after that is trivial.
回答5:
Since there is no way to determine the lenght of all lines without reading them, you have no choice but to iterate over all lines before your starting line. All you can do is to make it look nice. If the file is really huge then you might want to use a generator based approach:
from itertools import dropwhile
def iterate_from_line(f, start_from_line):
return (l for i, l in dropwhile(lambda x: x[0] < start_from_line, enumerate(f)))
for line in iterate_from_line(open(filename, \"r\", 0), 141978):
DoSomethingWithThisLine(line)
Note: the index is zero based in this approach.
回答6:
I am suprised no one mentioned islice
line = next(itertools.islice(Fhandle,index_of_interest,index_of_interest+1),None) # just the one line
or if you want the whole rest of the file
rest_of_file = itertools.islice(Fhandle,index_of_interest)
for line in rest_of_file:
print line
or if you want every other line from the file
rest_of_file = itertools.islice(Fhandle,index_of_interest,None,2)
for odd_line in rest_of_file:
print odd_line
回答7:
If you know in advance the position in the file (rather the line number), you can use file.seek() to go to that position.
Edit: you can use the linecache.getline(filename, lineno) function, which will return the contents of the line lineno, but only after reading the entire file into memory. Good if you\'re randomly accessing lines from within the file (as python itself might want to do to print a traceback) but not good for a 15MB file.
回答8:
If you don\'t want to read the entire file in memory .. you may need to come up with some format other than plain text.
of course it all depends on what you\'re trying to do, and how often you will jump across the file.
For instance, if you\'re gonna be jumping to lines many times in the same file, and you know that the file does not change while working with it, you can do this:
First, pass through the whole file, and record the \"seek-location\" of some key-line-numbers (such as, ever 1000 lines),
Then if you want line 12005, jump to the position of 12000 (which you\'ve recorded) then read 5 lines and you\'ll know you\'re in line 12005
and so on
回答9:
What generates the file you want to process? If it is something under your control, you could generate an index (which line is at which position.) at the time the file is appended to. The index file can be of fixed line size (space padded or 0 padded numbers) and will definitely be smaller. And thus can be read and processed qucikly.
- Which line do you want?.
- Calculate byte offset of corresponding line number in index file(possible because line size of index file is constant).
- Use seek or whatever to directly jump to get the line from index file.
- Parse to get byte offset for corresponding line of actual file.
回答10:
I have had the same problem (need to retrieve from huge file specific line).
Surely, I can every time run through all records in file and stop it when counter will be equal to target line, but it does not work effectively in a case when you want to obtain plural number of specific rows. That caused main issue to be resolved - how handle directly to necessary place of file.
I found out next decision:
Firstly I completed dictionary with start position of each line (key is line number, and value – cumulated length of previous lines).
t = open(file,’r’)
dict_pos = {}
kolvo = 0
length = 0
for each in t:
dict_pos[kolvo] = length
length = length+len(each)
kolvo = kolvo+1
ultimately, aim function:
def give_line(line_number):
t.seek(dict_pos.get(line_number))
line = t.readline()
return line
t.seek(line_number) – command that execute pruning of file up to line inception.
So, if you next commit readline – you obtain your target line.
Using such approach I have saved significant part of time.
回答11:
Do the lines themselves contain any index information? If the content of each line was something like \"<line index>:Data
\", then the seek()
approach could be used to do a binary search through the file, even if the amount of Data
is variable. You\'d seek to the midpoint of the file, read a line, check whether its index is higher or lower than the one you want, etc.
Otherwise, the best you can do is just readlines()
. If you don\'t want to read all 15MB, you can use the sizehint
argument to at least replace a lot of readline()
s with a smaller number of calls to readlines()
.
回答12:
Here\'s an example using \'readlines(sizehint)\' to read a chunk of lines at a time. DNS pointed out that solution. I wrote this example because the other examples here are single-line oriented.
def getlineno(filename, lineno):
if lineno < 1:
raise TypeError(\"First line is line 1\")
f = open(filename)
lines_read = 0
while 1:
lines = f.readlines(100000)
if not lines:
return None
if lines_read + len(lines) >= lineno:
return lines[lineno-lines_read-1]
lines_read += len(lines)
print getlineno(\"nci_09425001_09450000.smi\", 12000)
回答13:
You may use mmap to find the offset of the lines. MMap seems to be the fastest way to process a file
example:
with open(\'input_file\', \"r+b\") as f:
mapped = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)
i = 1
for line in iter(mapped.readline, \"\"):
if i == Line_I_want_to_jump:
offsets = mapped.tell()
i+=1
then use f.seek(offsets) to move to the line you need
回答14:
If you\'re dealing with a text file & based on linux system, you could use the linux commands.
For me, this worked well!
import commands
def read_line(path, line=1):
return commands.getoutput(\'head -%s %s | tail -1\' % (line, path))
line_to_jump = 141978
read_line(\"path_to_large_text_file\", line_to_jump)
回答15:
Can use this function to return line n:
def skipton(infile, n):
with open(infile,\'r\') as fi:
for i in range(n-1):
fi.next()
return fi.next()