The key
field in an AWS S3 notification event, which denotes the filename, is URL escaped.
This is evident when the filename contains spaces or non-ASCII characters.
For example, I have upload the following filename to S3:
my file řěąλλυ.txt
The notification is received as:
{
"Records": [
"s3": {
"object": {
"key": u"my+file+%C5%99%C4%9B%C4%85%CE%BB%CE%BB%CF%85.txt"
}
}
]
}
I've tried to decode using:
key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key']).decode('utf-8')
but that yields:
my file ÅÄÄλλÏ.txt
Of course, when I then try to get the file from S3 using Boto, I get a 404 error.
Just in case someone else comes here hoping for a JavaScript solution, here's what I ended up with:
With limited testing, it seems to work fine:
test+image+%C3%BCtf+%E3%83%86%E3%82%B9%E3%83%88.jpg
decodes totest image ütf テスト.jpg
my+file+%C5%99%C4%9B%C4%85%CE%BB%CE%BB%CF%85.txt
decodes tomy file řěąλλυ.txt
tl;dr
You need to convert the URL encoded Unicode string to a bytes str before un-urlparsing it and decoding as UTF-8.
For example, for an S3 object with the filename:
my file řěąλλυ.txt
:Background
AWS have commited the cardinal sin of changing the default encoding - https://anonbadger.wordpress.com/2015/06/16/why-sys-setdefaultencoding-will-break-code/
The error you should've got from your
decode()
is:The value of
key
is a Unicode. In Python 2.x you could decode a Unicode, even though it doesn't make sense. In Python 2.x to decode a Unicode, Python first tries to encode it to a [byte] str first before decoding it using the given encoding. In Python 2.x the default encoding should be ASCII, which of course can't contain the characters used.Had you got the proper UnicodeEncodeError from Python, you may have found suitable answers. On Python 3, you wouldn't have been able to call
.decode()
at all.