I'm stuck with all this confusing encoding stuff. I have a file containing Chinese subs. I actually believe it is UTF-8 because using this in Notepad++ gives me a very good result. If I set gb2312 the Chinese part is still fine, but I will see some UTF8 code not being converted.
The goal is to loop through the text in the file and count how many times the different chars come up.
import os
import re
import io
character_dict = {}
for dirname, dirnames, filenames in os.walk('.'):
for filename in filenames:
if "srt" in filename:
import codecs
f = codecs.open(filename, 'r', 'gb2312', errors='ignore')
s = f.read()
# deleting {}
s = re.sub('{[^}]+}', '', s)
# deleting every line that does not start with a chinese char
s = re.sub(r'(?m)^[A-Z0-9a-z].*\n?', '', s)
# delete non chinese chars
s = re.sub(r'[\s\.A-Za-z0-9\?\!\\/\-\"\,\*]', '', s)
#print s
s = s.encode('gb2312')
print s
for c in s:
#print c
pass
This will actually give me the complete Chinese text. But when I print out the loop on the bottom I just get questionmarks instead of the single chars.
Also note I said it is UTF8, but I have to use gb2312 for encoding and as the setting in my gnome-terminal. If I set it to UTF8 in the code i just get trash no matter if I set my terminal to UTF8 or gb2312. So maybe this file is not UTF8 after all!?
In any case s contains the full Chinese text. Why can't I loop it?
Please help me to understand this. It is very confusing for me and the docs are getting me nowhere. And google just leads me to similar problems that somebody solves, but there is no explanation so far that helped me understand this.