I'm in a little over my head on this one, so please pardon my terminology in advance.
I'm running this using Python 2.7 on Windows XP.
I found some Python code that reads a log file, does some stuff, then displays something.
What, that's not enough detail? Ok, here's a simplified version:
#!/usr/bin/python
import re
import sys
class NotSupportedTOCError(Exception):
pass
def filter_toc_entries(lines):
while True:
line = lines.next()
if re.match(r""" \s*
.+\s+ \| (?#track)
\s+.+\s+ \| (?#start)
\s+.+\s+ \| (?#length)
\s+.+\s+ \| (?#start sec)
\s+.+\s*$ (?#end sec)
""", line, re.X):
lines.next()
break
while True:
line = lines.next()
m = re.match(r"""
^\s*
(?P<num>\d+)
\s*\|\s*
(?P<start_time>[0-9:.]+)
\s*\|\s*
(?P<length_time>[0-9:.]+)
\s*\|\s*
(?P<start_sector>\d+)
\s*\|\s*
(?P<end_sector>\d+)
\s*$
""", line, re.X)
if not m:
break
yield m.groupdict()
def calculate_mb_toc_numbers(eac_entries):
eac = list(eac_entries)
num_tracks = len(eac)
tracknums = [int(e['num']) for e in eac]
if range(1,num_tracks+1) != tracknums:
raise NotSupportedTOCError("Non-standard track number sequence: %s", tracknums)
leadout_offset = int(eac[-1]['end_sector']) + 150 + 1
offsets = [(int(x['start_sector']) + 150) for x in eac]
return [1, num_tracks, leadout_offset] + offsets
f = open(sys.argv[1])
mb_toc_urlpart = "%20".join(str(x) for x in calculate_mb_toc_numbers(filter_toc_entries(f)))
print mb_toc_urlpart
The code works fine as long as the log file is "simple" text (I'm tempted to say ASCII although that may not be precise/accurate - for e.g. Notepad++ indicates it's ANSI).
However, the script doesn't work on certain log files (in these cases, Notepad++ says "UCS-2 Little Endian").
I get the following error:
Traceback (most recent call last):
File "simple.py", line 55, in <module>
mb_toc_urlpart = "%20".join(str(x) for x in calculate_mb_toc_numbers(filter_
toc_entries(f)))
File "simple.py", line 49, in calculate_mb_toc_numbers
leadout_offset = int(eac[-1]['end_sector']) + 150 + 1
IndexError: list index out of range
This log works
This log breaks
I believe it's the encoding that's breaking the script because if I simply do this at a command prompt:
type ascii.log > scrubbed.log
and then run the script on scrubbed.log, the script works fine (this is actually fine for my purposes since there's no loss of important information and I'm not writing back to a file, just printing to the console).
One workaround would be to "scrub" the log file before passing it to Python (e.g. using the type pipe trick above to a temporary file and then have the script run on that), but I would like to have Python "ignore" the encoding if it's possible. I'm also not sure how to detect what type of log file the script is reading so I can act appropriately.
I'm reading this and this but my eyes are still spinning around in their head, so while that may be my longer term strategy, I'm wondering if there's an interim hack I could use.