We have a ftp system setup to monitor/download from remote ftp servers that are not under our control. The script connects to the remote ftp, and grabs the file names of files on the server, we then check to see if its something that has already been downloaded. If it hasn't been downloaded then we download the file and add it to the list.
We recently ran into an issue, where someone on the remote ftp side, will copy in a massive single file(>1GB) then the script will wake up see a new file and begin downloading the file that is being copied in.
What is the best way to check this? I was thinking of grabbing the file size waiting a few seconds checking the file size again and see if it has increased, if it hasn't then we download it. But since time is of the concern, we can't wait a few seconds for every single file set and see if it's file size has increased.
What would be the best way to go about this, currently everything is done via pythons ftplib, how can we do this aside from using the aforementioned method.
Yet again let me reiterate this, we have 0 control over the remote ftp sites.
Thanks.
UPDATE1:
I was thinking what if i tried to rename it... since we have full permissions on the ftp, if the file upload is in progress would the rename command fail?
We don't have any real options here... do we?
UPDATE2: Well here's something interesting some of the ftps we tested on appear to automatically allocate the space once the transfer starts.
E.g. If i transfer a 200mb file to the ftp server. While the transfer is active if i connect to the ftp server and do a size while the upload is happening. It shows 200mb for the size. Even though the file is only like 10% complete.
Permissions also seem to be randomly set the FTP Server that comes with IIS sets the permissions AFTER the file is finished copying. While some of the other older ftp servers set it as soon as you send the file.
:'(
You can't know when the OS copy is done. It could slow down or wait.
For absolute certainty, you really need two files.
They can mess with the massive file all they want. But when they touch the trigger file, you're downloading both.
If you can't get a trigger, you have to balance the time required to poll vs. the time required to download.
Do this.
Get a listing. Check timestamps.
Check sizes vs. previous size of file. If size isn't even close, it's being copied right now. Wait; loop on this step until size is close to previous size.
While you're not done:
a. Get the file.
b. Get a listing AGAIN. Check the size of the new listing, previous listing and your file. If they agree: you're done. If they don't agree: file changed while you were downloading; you're not done.
If you are dealing with multiple files, you could get the list of all the sizes at once, wait ten seconds, and see which are the same. Whichever are still the same should be safe to download.
“Damn the torpedoes! Full speed ahead!”
Just download the file. If it is a large file then after the download completes wait as long as is reasonable for your scenario and continue the download from the point it stopped. Repeat until there is no more stuff to download.
As you say you have 0 control over the servers and can't make your clients post trigger files as suggested by S. Lott, you must deal with the imperfect solution and risk incomplete file transmission, perhaps by waiting for a while and compare file sizes before and after.
You can try to rename as you suggested, but as you have 0 control you can't be sure that the ftp-server-administrator (or their successor) doesn't change platforms or ftp servers or restricts your permissions.
Sorry.