Can someone confirm that Python 2.6 ftplib does NOT support Unicode file names? Or must Unicode file names be specially encoded in order to be used with the ftplib module?
The following email exchange seems to support my conclusion that the ftplib module only supports ASCII file names.
Should ftplib use UTF-8 instead of latin-1 encoding? http://mail.python.org/pipermail/python-dev/2009-January/085408.html
Any recommendations on a 3rd party Python FTP module that supports Unicode file names? I've googled this question without success [1], [2].
The official Python documentation does not mention Unicode file names [3].
Thank you, Malcolm
[1] ftputil wraps ftplib and inherits ftplib's apparent ASCII only support?
[2] Paramiko's SFTP library does support Unicode file names, however I'm looking specifically for ftp (vs. sftp) support relative to our current project.
[3] http://docs.python.org/library/ftplib.html
WORKAROUND:
The encodings.idna.ToASCII and .ToUnicode methods can be used to convert Unicode path names to an ASCII format. If you wrap all your remote path names and the output of the dir/nlst methods with these functions, then you can create a way to preserve Unicode path names using the standard ftplib (and also preserve Unicode file names on file systems that don't support Unicode paths). The downside to this technique is that other processes on the server will also have to use encodings.idna when referencing the files that you upload to the server. BTW: I understand that this is an abuse of the encodings.idna library.
Thank you Peter and Bob for your comments which I found very helpful.
We got UTF8 encoded filenames working for Python 2.7's FTPlib.
Note 1: Here's a background to easily explain UTF8 and unicode: https://code.google.com/p/iqbox-ftp/wiki/ProgrammingGuide_UnicodeVsAscii
Note 2: You can take a look at the AGPL libraries we use for IQBox. You might be able to use those (or parts of those), and they support UTF8 over FTP. Look at filetransfer_abc.py
You do need to add code to (1) Determine if the server supports UTF8, and (2) encode the unicode Python string in UTF8 format. (3) (Full code not shown since everyone gets file listings differently) When you get the file listings you need to also use
if UTF8_support: name = name.decode('utf-8')
ftplib
has no knowledge of Unicode whatsoever. It is intended to be passed byte-strings for filenames, and it'll return byte strings when asked for a directory list. Those are the exact strings of bytes passed-to/returned-from the server.If you pass a Unicode string to
ftplib
in Python 2.x, it'll end up getting coerced to bytes when it's sent to the underlying socket object. This coercion uses Python's ‘default’ encoding, ie. US-ASCII for safety, with exceptions generated for non-ASCII characters.The python-dev message to which you linked is talking about
ftplib
in Python 3.x, where strings are Unicode by default. This leaves modules likeftplib
in a tricky situation because although they now use Unicode strings at their front-end, the actual protocol behind it is byte-based. There therefore has to be an extra level of encoding/decoding involved, and without explicit intervention to specify what encoding is in use, there's a fair change it'll choose wrong.ftplib
in 3.x chose to default to ISO-8859-1 in order to preserve each byte as a character inside the Unicode string. Unfortunately this will give unexpected results in the common case where the target server uses a UTF-8 collation for filenames (whether or not the FTP daemon itself knows that filenames are UTF-8, which it commonly won't). There are a number of cases like this where the Python standard libraries have been brutally hacked to Unicode strings with negative consequences; Python 3's batteries-included are still leaking corrosive fluid IMO.It doesn't.
It's debatable. UTF-8 is the preferred encoding as dictated by RFC-2640 but latin-1 is usually more friendly for misbehaving implementations (either server or client). If server includes "UTF8" as part of the FEAT response then you should definitively use UTF8.
To support unicode in python 2.x you can adopt the following monkey patched version of ftpdlib:
...and pass unicode strings when using the remaining API as in:
Personally I would be more worried about what is on the other side of the ftp connection than the support of the library. FTP is a brittle protocol at the best of times without trying to be creative with filenames.
from RFC 959:
To me that means that the filenames should conform to the lowest common denominator. Since nowadays the number of DOS servers, Vax and IBM mainframes is negligeable and chances are you'll end up on a Windows or Unix box so the common denominator is quite high, but making assumptions on which codepage the remote site wants to accept appears to me pretty risky.
To get around this, I used the following code
This assumes that the FTP server supports RFC 2640 http://www.ietf.org/rfc/rfc2640.txt which allows for utf-8 file names. In my case I used SwiFTP server for Android and it transfers the files with the proper names successfully.