I have two files with tens of thousands of lines each, output1.txt and output2.txt. I want to iterate through both files and return the line (and content) of the lines that differ between the two. They're mostly the same which is why I can't find the differences (filecmp.cmp returns false).
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
回答1:
You can do something like this:
import difflib, sys
tl=100000 # large number of lines
# create two test files (Unix directories...)
with open('/tmp/f1.txt','w') as f:
for x in range(tl):
f.write('line {}\n'.format(x))
with open('/tmp/f2.txt','w') as f:
for x in range(tl+10): # add 10 lines
if x in (500,505,1000,tl-2):
continue # skip these lines
f.write('line {}\n'.format(x))
with open('/tmp/f1.txt','r') as f1, open('/tmp/f2.txt','r') as f2:
diff = difflib.ndiff(f1.readlines(),f2.readlines())
for line in diff:
if line.startswith('-'):
sys.stdout.write(line)
elif line.startswith('+'):
sys.stdout.write('\t\t'+line)
Prints (in 400 ms):
- line 500
- line 505
- line 1000
- line 99998
+ line 100000
+ line 100001
+ line 100002
+ line 100003
+ line 100004
+ line 100005
+ line 100006
+ line 100007
+ line 100008
+ line 100009
If you want the line number, use enumerate:
with open('/tmp/f1.txt','r') as f1, open('/tmp/f2.txt','r') as f2:
diff = difflib.ndiff(f1.readlines(),f2.readlines())
for i,line in enumerate(diff):
if line.startswith(' '):
continue
sys.stdout.write('My count: {}, text: {}'.format(i,line))
回答2:
7.4. difflib — Helpers for computing deltas
New in version 2.1.
This module provides classes and functions for comparing sequences. It can be used for example, for comparing files, and can produce difference information in various formats, including HTML and context and unified diffs. For comparing directories and files, see also, the filecmp module.
回答3:
As long as you don't care about order you could use:
with open('file1') as f:
t1 = f.read().splitlines()
t1s = set(t1)
with open('file2') as f:
t2 = f.read().splitlines()
t2s = set(t2)
#in file1 but not file2
print "Only in file1"
for diff in t1s-t2s:
print t1.index(diff), diff
#in file2 but not file1
print "Only in file2"
for diff in t2s-t1s:
print t2.index(diff), diff
Edit:
If you do care about order and they're mostly the same then why not just use the command diff
?