removing weird double quotes (from excel file) in

2020-05-06 23:26发布

I'm loading in an excel file to python3 using xlrd. They are basically lines of text in a spreadsheet. On some of these lines are quotation marks. For example, one line can be:

She said, "My name is Jennifer."

When I'm reading them into python and making them into strings, the double quotes are read in as a weird double quote character that looks like a double quote in italics. I'm assuming that somewhere along the way, python read in the character as some foreign character rather than actual double quotes due to some encoding issue or something. So in the above example, if I assign that line as "text", then we'll have something like the following (although not exactly since I don't actually type out the line, so imagine "text" was already assigned beforehand):

text = 'She said, “My name is Jennifer.”'
text[10] == '"'

The second line will spit out a False because it doesn't seem to recognize it as a normal double quote character. I'm working within the Mac terminal if that makes a difference.

My questions are: 1. Is there a way to easily strip these weird double quotes? 2. Is there a way when I read in the file to get python to recognize them as double quotes properly?

1条回答
Root(大扎)
2楼-- · 2020-05-06 23:49

I'm assuming that somewhere along the way, python read in the character as some foreign character

Yes; it read that in because that's what the file data actually represents.

rather than actual double quotes due to some encoding issue or something.

There's no issue with the encoding. The actual character is not an "actual double quote".

Is there a way to easily strip these weird double quotes?

You can use the .replace method of strings as you would normally, to either replace them with an "actual double quote" or with nothing.

Is there a way when I read in the file to get python to recognize them as double quotes properly?

If you're looking for them, you can compare them to the character they actually are.

As noted in the comment, they are most likely U+201C LEFT DOUBLE QUOTATION MARK and U+201D RIGHT DOUBLE QUOTATION MARK. They're used so that opening and closing quotes can look different (by curving in different directions), which pretty typography normally does (as opposed to using " which is simply more convenient for programmers). You represent them in Python with a Unicode escape, thus:

text[10] == '\u201c'

You could also have directly asked Python for this info, by asking for text[10] at the Python command line (which would evaluate that and show you the representation), or explicitly in a script with e.g. print(repr(text[10])).

查看更多
登录 后发表回答