I used a scraper to get comments from Facebook. Unfortunately, it converted the Umlaute "Ä" "Ü" "Ö" in German to UTF-8 literals such as "\xc3\xb6". I tried now different approaches to reconvert the files but unfortunately none of the things I have done, were successful.
for file in glob.glob("Comments/*.csv"):
rawfile=csv.reader(open(file,"rU", encoding = "ISO-8859-1"))
new_tablename=file +"converted"
new_table=csv.writer(open("%s.csv" % (new_tablename),"w"))
for row in rawfile:
for w in row:
a=str(w)
b=a.encode('latin-1').decode('utf-8')
print(b)
new_table.writerow(row)
Another approach was creating a dictionary with all the literals and the German characters but this approach did not work either.
import csv, glob, re
print("Start")
converter_table=csv.reader(open("LiteralConvert.csv","rU"))
converterdic={}
for line in converter_table:
charToFind=line[2]
charForReplace=line[1]
print(charToFind+" will be replaced by "+charForReplace)
converterdic[charToFind] = charForReplace
print(converterdic)
for file in glob.glob("Comments/*.csv"):
rawfile=csv.reader(open(file,"rU", encoding = "ISO-8859-1"))
print("opening: "+ file)
new_tablename=file +"converted"
new_table=csv.writer(open("%s.csv" % (new_tablename),"w"))
print("created clean file: " + new_tablename)
for row in rawfile:
for w in row:
#print(w)
try:
w.translate(converterdic)
except KeyError:
continue
new_table.writerow(row)
However, the first solution works fine, if I just do:
s="N\xc3\xb6 kein Schnee von gestern doch der beweis daf\xc3\xbcr das L\xc3\xbcgenpresse existiert."
b = s.encode('latin-1').decode('utf-8')
print(b)
But not when I parse in the string from a file.
I have been through all the comments and the other answer trying to understand WHERE is the problem and WHAT is the core of the problem you face. Here my conclusion from all this after many deep thoughts about it:
Frequent core of problems with encoding/decoding strings is the interpretation of what you have from what you see. In this context it is VERY IMPORTANT to understand, that:
If you have a string/text in Python (or file) you are never, ever able to see it 'as it is'.
and have to decide about the encoding/decoding scheme first.
In other words, you look ALWAYS through a filter of a given encoding/decoding on what you look at and if there is a change in the encoding/decoding scheme, it changes what you see without a change in what you look at.
Let's say the same once again, now in other 'other words":
To look at a string or text in a file you MUST use some kind of tool for its VISUALIZATION ... AND ... such tool for visualization USES some kind of information about the ENCODING (implicit taking a default value or explicit by urging you to specify which coding should it use), so without encoding/decoding there is no visualization. Understanding this has an huge impact on how you think about what you see in terms of thinking what are you looking at. It is like with 3D-glasses in a cinema: wearing them does not change what is on the screen, but changes how you see it.
So if you have an UTF-8 encoded string with non-ASCII characters and look at it with tools showing you UTF-8 characters you see the German Umlaute as they are, BUT if you look at the same string using a tool for visualization of binary strings ti will show you neither the non-ASCII characters in it (it's binary, so it visualizes byte by byte and can't show non-ASCII without knowledge about the used code) nor the UTF-8 interpretation (the Umlaut are two bytes but the tool for visualization shows byte by byte) - it will show you the non-ASCII characters in the form "\xc3\xb6", BUT ... in the string/file there ARE NOT 8 bytes there - there are only TWO bytes '0xC3' and '0xB6'. This is how it comes that e.g. the print() command in order to show you what the bytes are uses "\xc3\xb6".
Hope you got now the idea what I am talking about (it's a kind of enlightenment experience after long hours/days/months of confusion), did you?
Here an excerpt from the UTF-8 table you can find the letter 'ö' in:
"""U+00F6 ö c3 b6 ö ö LATIN SMALL LETTER O WITH DIAERESIS"""
You are essentially doing b'\xc3\xb6'.decode('ISO-8859-1').encode('latin-1').decode('utf8')
when you do
rawfile = csv.reader(open(file,"rU", encoding = "ISO-8859-1"))
...
a = str(w)
b = a.encode('latin-1').decode('utf-8')
Skip the unnecessary .decode()
and .encode()
and by doing open(file, "r", encoding="utf8")
to open the files instead.