How to convert utf-8 fancy quotes to neutral quote

2019-03-25 18:17发布

I'm writing a little Python script that parses word docs and writes to a csv file. However, some of the docs have some utf-8 characters that my script can't process correctly.

Fancy quotes show up quite often (u'\u201c'). Is there a quick and easy (and smart) way of replacing those with the neutral ascii-supported quotes, so I can just write line.encode('ascii') to the csv file?

I have tried to find the left quote and replace it:

val = line.find(u'\u201c')
if val >= 0: line[val] = '"'

But to no avail:

TypeError: 'unicode' object does not support item assignment

Is what I've described a good strategy? Or should I just set up the csv to support utf-8 (though I'm not sure if the application that will be reading the CSV wants utf-8)?

Thank you

标签： python python-2.7 unicode encoding utf-8

2条回答

Luminary・发光体

2楼-- · 2019-03-25 18:56

You can't assign to a string, as they are immutable, and can't be changed.

You can, however, just use the regex library, which might be the most flexible way to do this:

import re
newline = re.sub(u'\u201c','"',line)

0人赞添加讨论(0) 举报

我命由我不由天

3楼-- · 2019-03-25 19:02

You can use the Unidecode package to automatically convert all Unicode characters to their nearest pure ASCII equivalent.

from unidecode import unidecode
line = unidecode(line)

This will handle both directions of double quotes as well as single quotes, em dashes, and other things that you probably haven't discovered yet.

0人赞添加讨论(0) 举报

How to convert utf-8 fancy quotes to neutral quote

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间