I made this script which removes every trailing whitespace characters and replace all bad french characters by the right ones.
Removing the trailing whitespace characters works but not the part about replacing the french characters.
The file to read/write are encoded in UTF-8 so I added the utf-8 declaration above my script but in the end every bad characters (like \u00e9) are being replaced by litte square.
Any idea why?
script :
# --*-- encoding: utf-8 --*--
import fileinput
import sys
CRLF = "\r\n"
ACCENT_AIGU = "\\u00e9"
ACCENT_GRAVE = "\\u00e8"
C_CEDILLE = "\\u00e7"
A_ACCENTUE = "\\u00e0"
E_CIRCONFLEXE = "\\u00ea"
CURRENT_ENCODING = "utf-8"
#Getting filepath
print "Veuillez entrer le chemin du fichier (utiliser des \\ ou /, c'est pareil) :"
path = str(raw_input())
path.replace("\\", "/")
#removing trailing whitespace characters
for line in fileinput.FileInput(path, inplace=1):
if line != CRLF:
line = line.rstrip()
print line
print >>sys.stderr, line
else:
print CRLF
print >>sys.stderr, CRLF
fileinput.close()
#Replacing bad wharacters
for line in fileinput.FileInput(path, inplace=1):
line = line.decode(CURRENT_ENCODING)
line = line.replace(ACCENT_AIGU, "é")
line = line.replace(ACCENT_GRAVE, "è")
line = line.replace(A_ACCENTUE, "à")
line = line.replace(E_CIRCONFLEXE, "ê")
line = line.replace(C_CEDILLE, "ç")
line.encode(CURRENT_ENCODING)
sys.stdout.write(line) #avoid CRLF added by print
print >>sys.stderr, line
fileinput.close()
EDIT
the input file contains this type of text :
* Cette m\u00e9thode permet d'appeller le service du module de tourn\u00e9e
* <code>rechercherTechnicien</code> et retourne la liste repr\u00e9sentant le num\u00e9ro
* de la tourn\u00e9e ainsi que le nom et le pr\u00e9nom du technicien et la dur\u00e9e
* th\u00e9orique por se rendre au point d'intervention.
*
EDIT2
Final code if someone is interested, the first part replaces the badly encoded caracters, the second part removes all right trailing whitespaces caracters.
# --*-- encoding: iso-8859-1 --*--
import fileinput
import re
CRLF = "\r\n"
print "Veuillez entrer le chemin du fichier (utiliser des \\ ou /, c'est pareil) :"
path = str(raw_input())
path = path.replace("\\", "/")
def unicodize(seg):
if re.match(r'\\u[0-9a-f]{4}', seg):
return seg.decode('unicode-escape')
return seg.decode('utf-8')
print "Replacing caracter badly encoded"
with open(path,"r") as f:
content = f.read()
replaced = (unicodize(seg) for seg in re.split(r'(\\u[0-9a-f]{4})',content))
with open(path, "w") as o:
o.write(''.join(replaced).encode("utf-8"))
print "Removing trailing whitespaces caracters"
for line in fileinput.FileInput(path, inplace=1):
if line != CRLF:
line = line.rstrip()
print line
else:
print CRLF
fileinput.close()
print "Done!"
Not so quick, and mostly dirty, but...
with open("enc.txt","r") as f:
content = f.read()
import re
def unicodize(seg):
if re.match(r'\\u[0-9a-f]{4}', seg):
return seg.decode('unicode-escape')
return seg.decode('utf-8')
replaced = (unicodize(seg) for seg in re.split(r'(\\u[0-9a-f]{4})',content))
print(''.join(replaced))
Given that input file (mixing unicode escaped sequences and properly encoded utf-8 text):
* Cette m\u00e9thode permet d'appeller le service du module de
* tourn\u00e9e
* <code>rechercherTechnicien</code> et retourne la liste
* repr\u00e9sentant le num\u00e9ro
* de la tourn\u00e9e ainsi que le nom et le pr\u00e9nom du technicien
* et la dur\u00e9e
* th\u00e9orique por se rendre au point d'intervention.
*
* S'il le désire le technicien peut dormir à l'hôtel
Produce that result:
* Cette méthode permet d'appeller le service du module de
* tournée
* <code>rechercherTechnicien</code> et retourne la liste
* représentant le numéro
* de la tournée ainsi que le nom et le prénom du technicien
* et la durée
* théorique por se rendre au point d'intervention.
*
* S'il le désire le technicien peut dormir à l'hôtel
You are looking for s.decode('unicode_escape')
:
>>> s = r"""
... * Cette m\u00e9thode permet d'appeller le service du module de tourn\u00e9e
... * <code>rechercherTechnicien</code> et retourne la liste repr\u00e9sentant le num\u00e9ro
... * de la tourn\u00e9e ainsi que le nom et le pr\u00e9nom du technicien et la dur\u00e9e
... * th\u00e9orique por se rendre au point d'intervention.
... *
... """
>>> print(s.decode('unicode_escape'))
* Cette méthode permet d'appeller le service du module de tournée
* <code>rechercherTechnicien</code> et retourne la liste représentant le numéro
* de la tournée ainsi que le nom et le prénom du technicien et la durée
* théorique por se rendre au point d'intervention.
*
And don't forget to encode
your string before writing it to a file (e.g. as UTF-8):
writable_s = s.decode('unicode_escape').encode('utf-8')
To read a file encoded in utf-8 that has non-ascii characters in it and that literally has \
, u
, 0
, 0
, e
, 9
character sequences that you also want to decode:
import codecs
import re
repl = lambda m: m.group().encode('ascii', 'strict').decode('unicode-escape')
with codecs.open(filename, encoding='utf-8') as file:
text = re.sub(r'\\u[0-9a-f]{4}', repl, file.read())
Note: normally, non-ascii characters and Unicode escapes (\uxxxx
) should not be mixed in a single file. Use one or another but not both simultaneously.
The file to read/write are encoded in UTF-8 so I added the utf-8 declaration above my script
The utf-8 declaration in your Python source affects only character encoding of your Python source e.g., it allows to use non-ascii characters in bytestring and unicode literals. It has no effect on character encoding of the files that you read.
but in the end every bad characters (like \u00e9) are being replaced by litte square.
"litte square" might be an artifact of printing to console. Try this in a console to see whether squares are present:
>>> s = "\u00e9" # 6 bytes in a bytestring
>>> len(s)
6
>>> u = u"\u00e9" # unicode escape in a Unicode string
>>> len(u)
1
>>> print s
\u00e9
>>> print u
é
>>> b = "é" # non-ascii char in a bytestring
>>> len(b) # note: it is 2 bytes
2
>>> ub = u"é" # non-ascii char in a Unicode string
>>> len(ub)
1
>>> print b
é
>>> print ub
é
>>> se = u.encode('ascii', 'backslashreplace') # non-ascii chars are escaped
>>> len(se)
4
>>> (s.decode('unicode-escape') == u == b.decode('utf-8') == ub ==
se.decode('unicode-escape') == unichr(0xe9))
True