How to store arabic text in mysql database using p

2019-05-26 20:08发布

I have an arabic string say

txt = u'Arabic (\u0627\u0644\u0637\u064a\u0631\u0627\u0646)'

I want to write this text arabic converted into mySql database. I tried using

txt = smart_str(txt)

or

txt = text.encode('utf-8') 

both of these din't work as they coverted the string to

u'Arabic (\xd8\xa7\xd9\x84\xd8\xb7\xd9\x8a\xd8\xb1\xd8\xa7\xd9\x86)' 

Also my database character set is already set to utf-8

ALTER DATABASE databasename CHARACTER SET utf8 COLLATE utf8_unicode_ci;

So due to this new unicodes, my database is displaying the characters related to the encoded text. Please help. I want my arabic text to be retained.

Also does quick export of this arabic text from MySQL database write the same arabic text into files or will it again convert it back to unicode?

I used the foolowing code to insert

cur.execute("INSERT INTO tab1(id, username, text, created_at) VALUES (%s, %s, %s, %s)", (smart_str(id), smart_str(user_name), smart_str(text), date))

Earlier to this when I didn't use smart_str, it throws an error saying only 'latin-1' is allowed.

2条回答
爱情/是我丢掉的垃圾
2楼-- · 2019-05-26 20:50

To clarify a few things, because it will help you along in the future as well.

txt = u'Arabic (\u0627\u0644\u0637\u064a\u0631\u0627\u0646)'

This is not an Arabic string. This is a unicode object, with unicode codepoints. If you were to simply print it, and if your terminal supports Arabic you would get output like this:

>>> txt = u'Arabic (\u0627\u0644\u0637\u064a\u0631\u0627\u0646)'
>>> print(txt)
Arabic (الطيران)

Now, to get the same output like Arabic (الطيران) in your database, you need to encode the string.

Encoding is taking these code points; and converting them to bytes so that computers know what to do with them.

So the most common encoding is utf-8, because it supports all the characters of English, plus a lot of other languages (including Arabic). There are others too, for example, windows-1256 also supports Arabic. There are some that don't have references for those numbers (called code points), and when you try to encode, you'll get an error like this:

>>> print(txt.encode('latin-1'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 8-14: ordinal not in range(256)

What that is telling you is that some number in the unicode object does not exist in the table latin-1, so the program doesn't know how to convert it to bytes.

Computers store bytes. So when storing or transmitting information you need to always encode/decode it correctly.

This encode/decode step is sometimes called the unicode sandwich - everything outside is bytes, everything inside is unicode.


With that out of the way, you need to encode the data correctly before you send it to your database; to do that, encode it:

q = u"""
    INSERT INTO
       tab1(id, username, text, created_at)
    VALUES (%s, %s, %s, %s)"""

conn = MySQLdb.connect(host="localhost",
                       user='root',
                       password='',
                       db='',
                       charset='utf8',
                       init_command='SET NAMES UTF8')
cur = conn.cursor()
cur.execute(q, (id.encode('utf-8'),
                user_name.encode('utf-8'),
                text.encode('utf-8'), date))

To confirm that it is being inserted correctly, make sure you are using mysql from a terminal or application that supports Arabic; otherwise - even if its inserted correctly, when it is displayed by your program - you will see garbage characters.

查看更多
爱情/是我丢掉的垃圾
3楼-- · 2019-05-26 20:54

Just execute SET names utf8 before executing your INSERT:

cur.execute("set names utf8;")

cur.execute("INSERT INTO tab1(id, username, text, created_at) VALUES (%s, %s, %s, %s)", (smart_str(id), smart_str(user_name), smart_str(text), date))

Your question is very similar to this SO post, which you should read.

查看更多
登录 后发表回答