Searching text files' contents with various en

2019-05-25 01:42发布

I am having trouble with variable text encoding when opening text files to find a match in the files' contents.

I am writing a script to scan the file system for log files with specific contents in order to copy them to an archive. The names are often changed, so the contents are the only way to identify them. I need to identify *.txt files and find within their contents a string that is unique to these particular log files.

I have the code below that mostly works. The problem is the logs may have their encoding changed if they are opened and edited. In this case, Python won't match the search term to the contents because the contents are garbled when Python uses the wrong encoding to open the file.

import os
import codecs

#Filepaths to search
FILEPATH = "SomeDrive:\\SomeDirs\\"

#Text to match in file names
MATCH_CONDITION = ".txt"

#Text to match in file contents
MATCH_CONTENT = "--------Base Data Details:--------------------"

for root, dirs, files in os.walk(FILEPATH):
    for f in files:
        if MATCH_CONDITION in f:
            print "Searching: "  + os.path.join(root,f)

            #ATTEMPT A -
            #matches only text file re-encoded as ANSI,
            #UTF-8, UTF-8 no BOM

            #search_file = open(os.path.join(root,f), 'r')

            #ATTEMPT B -
            #matches text files ouput from Trimble software
            #"UCS-2 LE w/o BOM", also "UCS-2 Little Endian" -
            #(same file resaved using Windows Notepad),

            search_file = codecs.open(os.path.join(root,f), 'r', 'utf_16_le')


            file_data = search_file.read()

            if MATCH_CONTENT in file_data:
                print "CONTENTS MATCHED: " + f

            search_file.close()

I can open the files in Notepad ++ which detects the encoding. Using the regular file.open() Python command does not automatically detect the encoding. I can use codecs.open and specify the encoding to catch a single encoding, but then have to write excess code to catch the rest. I've read the Python codecs module documentation and it seems to be devoid of any automatic detection.

What options do I have to concisely and robustly search any text file with any encoding?

I've read about the chardet module, which seems good but I really need to avoid installing modules. Anyway, there must be a simpler way to interact with the ancient and venerable text file. Surely as a newb I am making this too complicated, right?

Python 2.7.2, Windows 7 64-bit. Probably not necessary, but here is a sample log file.

EDIT: As far as I know the files will almost surely be in one of the encodings in the code comments: ANSI, UTF-8, UTF_16_LE (as UCS-2 LE w/o BOM; UCS-2 Little Endian). There is always the potential for someone to find a way around my expectations...

EDIT: While using an external library is certainly the sound approach, I've taken a chance at writing some amateurish code to guess the encoding and solicited feedback in another question -> Pitfalls in my code for detecting text file encoding with Python?

1条回答
一纸荒年 Trace。
2楼-- · 2019-05-25 02:42

The chardet package exists for a reason (and was ported from some older Netscape code, for a similar reason) : detecting the encoding of an arbitrary text file is tricky.

There are two basic alternatives :

  1. Use some hard-coded rules to determine whether a file has a certain encoding. For example, you could look for the UTF byte-order marker at the beginning of the file. This breaks for encodings that overlap significantly in their use of different bytes, or for files that don't happen to use the "marker" bytes that your detection rules use.

  2. Take a database of files in known encodings and count up the distributions of different bytes (and byte pairs, triplets, etc.) in each of the encodings. Then, when you have a file of unknown encoding, take a sample of its bytes and see which pattern of byte usage is the best match. This breaks when you have short test files (which makes the frequency estimates inaccurate), or when the usage of the bytes in your test file doesn't match the usage in the file database you used to build up your frequency data.

The reason notepad++ can do character detection (as well as web browsers, word processors, etc.) is that these programs all have one or both of these methods built in to the program. Python doesn't build this into its interpreter -- it's a general-purpose programming language, not a text editor -- but that's just what the chardet package does.

I would say that because you know some things about the text files that you're handling, you might be able to take a few shortcuts. For example, are your log files all in one of either encoding A or encoding B ? If so, then your decision is much simpler, and probably either the frequency-based or the rule-based approach above would be pretty straightforward to implement on your own. But if you need to detect arbitrary character sets, I'd highly recommend building on the shoulders of giants.

查看更多
登录 后发表回答