How to read Chinese files?

2019-06-09 06:43发布

问题:

I'm stuck with all this confusing encoding stuff. I have a file containing Chinese subs. I actually believe it is UTF-8 because using this in Notepad++ gives me a very good result. If I set gb2312 the Chinese part is still fine, but I will see some UTF8 code not being converted.

The goal is to loop through the text in the file and count how many times the different chars come up.

import os
import re
import io

character_dict = {}
for dirname, dirnames, filenames in os.walk('.'):
    for filename in filenames:
        if "srt" in filename:
            import codecs
            f = codecs.open(filename, 'r', 'gb2312', errors='ignore')
            s = f.read()

            # deleting {}
            s = re.sub('{[^}]+}', '', s)
            # deleting every line that does not start with a chinese char
            s = re.sub(r'(?m)^[A-Z0-9a-z].*\n?', '', s)
            # delete non chinese chars
            s = re.sub(r'[\s\.A-Za-z0-9\?\!\\/\-\"\,\*]', '', s)
            #print s
            s = s.encode('gb2312')
            print s
            for c in s:
                #print c
                pass

This will actually give me the complete Chinese text. But when I print out the loop on the bottom I just get questionmarks instead of the single chars.

Also note I said it is UTF8, but I have to use gb2312 for encoding and as the setting in my gnome-terminal. If I set it to UTF8 in the code i just get trash no matter if I set my terminal to UTF8 or gb2312. So maybe this file is not UTF8 after all!?

In any case s contains the full Chinese text. Why can't I loop it?

Please help me to understand this. It is very confusing for me and the docs are getting me nowhere. And google just leads me to similar problems that somebody solves, but there is no explanation so far that helped me understand this.

回答1:

gb2312 is a multi-byte encoding. If you iterate over a bytestring encoded with it, you will be iterating over the bytes, not over the characters you want to be counting (or printing). You probably want to do your iteration on the unicode string before encoding it. If necessary, you can encode the individual codepoints (characters) to their own bytestrings for output:

# don't do s = s.encode('gb2312')
for c in s:      # iterate over the unicode codepoints
    print c.encode('gb2312')  # encode them individually for output, if necessary


回答2:

You are printing individual bytes. GB2312 is a multi-byte encoding, and each codepoint uses 2 bytes. Printing those bytes individually won't produce valid output, no.

The solution is to not encode from Unicode to bytes when printing. Loop over the Unicode string instead:

# deleting {}
s = re.sub('{[^}]+}', '', s)
# deleting every line that does not start with a chinese char
s = re.sub(r'(?m)^[A-Z0-9a-z].*\n?', '', s)
# delete non chinese chars
s = re.sub(r'[\s\.A-Za-z0-9\?\!\\/\-\"\,\*]', '', s)
#print s

# No `s.encode()`!
for char in s:
    print char

You could encode each char chararter individually:

for char in s:
    print char

But if you have your console / IDE / terminal correctly configured you should be able to print directly without errors, especially since your print s.encode('gb2312)` produces correct output.

You also appear to be confusing UTF-8 (an encoding) with the Unicode standard. UTF-8 can be used to represent Unicode data in bytes. GB2312 is an encoding too, and can be used to represent a (subset of) Unicode text in bytes.

You may want to read up on Python and Unicode:

  • The Python Unicode HOWTO

  • Pragmatic Unicode by Ned Batchelder

  • The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky