Make list of unicode words that are in a file

My code is

f = codecs.open(r'C:\Users\Admin\Desktop\nepali.txt', 'r', 'UTF-8')
nepali = f.read().split()
for i in nepali:
    print i

Display the words in file:

यो
किताब
टेबुल
मा
छ
यो
एक
किताब
हो
केटा

But when I try to create a list of the words with code:

file=codecs.open(r'C:\Users\Admin\Desktop\nepali.txt', 'r', 'UTF-8')
nepali = list(file.read().split())
print nepali

The output now is displayed like this

[u'\ufeff\u092f\u094b', u'\u0915\u093f\u0924\u093e\u092c', u'\u091f\u0947\u092c\u0941\u0932', u'\u092e\u093e', u'\u091b', u'\u092f\u094b', u'\u090f\u0915', u'\u0915\u093f\u0924\u093e\u092c', u'\u0939\u094b',]

The output should look like:

[यो, किताब, टेबुल, मा, छ,यो, एक, किताब, हो]

标签： python unicode utf-8

1条回答

成全新的幸福

2楼-- · 2019-04-13 03:17

You are looking at the output of the repr() function, which is always used for displaying the contents of containers. The output is meant for debugging, not end-user displays; any non-printable non-ASCII codepoint is represented by an escape sequence (which can, depending on the codepoint, be a single character escape like \t or \n, or use 2, 4, or 8 hex digits, like \xe5, \u2603 or \U0001f4e2).

You'll have to produce the output manually:

print u'[{}]'.format(u', '.join(nepali))

This produces a unicode string formatted to look like a list object, but without using repr(), simply by adding square brackets around the strings, joined with ', ' (comma and space).

Demo:

>>> nepali = [u'\ufeff\u092f\u094b', u'\u0915\u093f\u0924\u093e\u092c', u'\u091f\u0947\u092c\u0941\u0932', u'\u092e\u093e', u'\u091b', u'\u092f\u094b', u'\u090f\u0915', u'\u0915\u093f\u0924\u093e\u092c', u'\u0939\u094b',]
>>> print u'[{}]'.format(u', '.join(nepali))
[यो, किताब, टेबुल, मा, छ, यो, एक, किताब, हो]

However, if you want to show this to an end-user, why use the square brackets at all?

0人赞添加讨论(0) 举报

Make list of unicode words that are in a file

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间