I am having trouble with variable text encoding when opening text files to find a match in the files' contents.
I am writing a script to scan the file system for log files with specific contents in order to copy them to an archive. The names are often changed, so the contents are the only way to identify them. I need to identify *.txt files and find within their contents a string that is unique to these particular log files.
I have the code below that mostly works. The problem is the logs may have their encoding changed if they are opened and edited. In this case, Python won't match the search term to the contents because the contents are garbled when Python uses the wrong encoding to open the file.
import os
import codecs
#Filepaths to search
FILEPATH = "SomeDrive:\\SomeDirs\\"
#Text to match in file names
MATCH_CONDITION = ".txt"
#Text to match in file contents
MATCH_CONTENT = "--------Base Data Details:--------------------"
for root, dirs, files in os.walk(FILEPATH):
for f in files:
if MATCH_CONDITION in f:
print "Searching: " + os.path.join(root,f)
#ATTEMPT A -
#matches only text file re-encoded as ANSI,
#UTF-8, UTF-8 no BOM
#search_file = open(os.path.join(root,f), 'r')
#ATTEMPT B -
#matches text files ouput from Trimble software
#"UCS-2 LE w/o BOM", also "UCS-2 Little Endian" -
#(same file resaved using Windows Notepad),
search_file = codecs.open(os.path.join(root,f), 'r', 'utf_16_le')
file_data = search_file.read()
if MATCH_CONTENT in file_data:
print "CONTENTS MATCHED: " + f
search_file.close()
I can open the files in Notepad ++ which detects the encoding. Using the regular file.open() Python command does not automatically detect the encoding. I can use codecs.open and specify the encoding to catch a single encoding, but then have to write excess code to catch the rest. I've read the Python codecs module documentation and it seems to be devoid of any automatic detection.
What options do I have to concisely and robustly search any text file with any encoding?
I've read about the chardet module, which seems good but I really need to avoid installing modules. Anyway, there must be a simpler way to interact with the ancient and venerable text file. Surely as a newb I am making this too complicated, right?
Python 2.7.2, Windows 7 64-bit. Probably not necessary, but here is a sample log file.
EDIT: As far as I know the files will almost surely be in one of the encodings in the code comments: ANSI, UTF-8, UTF_16_LE (as UCS-2 LE w/o BOM; UCS-2 Little Endian). There is always the potential for someone to find a way around my expectations...
EDIT: While using an external library is certainly the sound approach, I've taken a chance at writing some amateurish code to guess the encoding and solicited feedback in another question -> Pitfalls in my code for detecting text file encoding with Python?