I use the following code to scrape a table from a Chinese website. It works fine. But it seems that the contents I stored in the list are not shown properly.
import requests
from bs4 import BeautifulSoup
import pandas as pd
x = requests.get('http://www.sohu.com/a/79780904_126549')
bs = BeautifulSoup(x.text,'lxml')
clg_list = []
for tr in bs.find_all('tr'):
tds = tr.find_all('td')
for i in range(len(tds)):
clg_list.append(tds[i].text)
print(tds[i].text)
When I print the text, it shows Chinese characters. But when I print out the list, it's showing \u4e00\u671f\uff0834\u6240\uff09'. I am not sure if I should change the encoding or something else is wrong.
There is nothing wrong in this case.
When you print a python list, python calls
repr
on each of the list's elements. In python2, therepr
of a unicode string shows the unicode code points for the characters that make up the string.However, if you
print
the string, python encodes the unicode string with a text encoding (for example, utf-8) and your computer displays the characters that match the encoding.Note that in python3 printing the list will show the chinese characters as you expect, because of python3's better unicode handling.