I am trying to read in a dataset called df1, but it does not work
import pandas as pd
df1=pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",sep=";")
df1.head()
Here are huge errors from the above code, but this is the most relevant
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte
The data is indeed not encoded as UTF-8; everything is ASCII except for that single 0x92 byte:
Decode it as Windows codepage 1252 instead, where 0x92 is a fancy quote,
’
:Demo:
I note however, that Pandas seems to take the HTTP headers at face value too and produces a Mojibake when you load your data from a URL. When I save the data directly to disk, then load it with
pd.read_csv()
the data is correctly decoded, but loading from the URL produces re-coded data:This is a known bug in Pandas. You can work around this by using
urllib.request
to load the URL and pass that topd.read_csv()
instead: