Deal with EOFError while downloading files from se

2019-06-10 02:34发布

问题:

Use Case:

Dowload hundred of thousands of xmls files (size from bytes to 50 mb/file) structured like this /year-month/year-month-day/hours/files with ftplib. So i loop through each hour folder for a given day and for each one i store all the filenames with ftp.nlst(), then i loop through each filename and i donwload the concerned file like this.

with open(local_file, 'wb') as fhandle:
    try:
        ftp.retrbinary('RETR ' + filename, fhandle.write)
    except EOFError:
        try:
            fhandle.close()
            os.remove(local_file)
            ftp = ftplib.FTP()
            ftp.connect(self.remote_host,self.port, timeout=60)
            ftp.login(self.username, self.passwd, acct="")
            ftp.cwd(self.input_folder + '/' + subdir)
            try:
                with open(local_file, 'wb') as fhandle:
                ftp.retrbinary('RETR ' + filename, fhandle.write, 8192)
            except:
                self.log.error('i give up !!!')

Expected:

For each day given as input folder, download all the concerned xml files

what i get:

EOFError

What i already tried:

  • I have gone though all possible posts about the subject on stackoverflow and the net in general.
  • i have tried to close and open a ne connection for each subfolder in the hour folder.
  • It doesn't seem to be one specific file that is causing the problem. It is definitely not the first one. i get this EOFError while downloading files with ftp.retrbinary(). It is related to the fact that i download hundred of thousands of xmls files, because i have tested the script with 2000 files and i didn't got any exceptions but with around 287000 files i get it always. And what i don't understand is that the script downloads each time the same amount/number of xml files, around 159 000 and it is always
  • I have tried to play with the buffersize in

    ftp.retrbinary('RETR ' + filename, fhandle.write,4096)

Question:

it may be that i have missed something? How to handle this EOFError to continue downloading all my files...and without loosing my sanity.

回答1:

Finally i found a solution for my problem. Instead of opening a connection for each sub-folder, i now open a connection for each file to be downloaded. It is less performant, but i get to pass this EOFError. I also found out that the FTP server which i want to download files from have restrictions for example on the number of parallel connections or how long a connection may last.