Last Modified of file downloaded does not match it

2019-05-31 15:08发布

问题:

I have a piece of Python code that (for better or worse) checks a local file against the same file on a web server. If it's not there, it downloads it, if it does, it checks the os.stat last modified of the downloaded file against the HTTP header of the same file on the server.

Problem is, it seems these two numbers aren't equal even when they should be. Here's the code:

from urllib import urlretrieve
from urllib2 import Request, urlopen
from time import strftime, localtime, mktime, strptime
from os import stat, path

destFile = "logo3w.png"
srvFile = "http://www.google.com/images/srpr/logo3w.png"

if path.exists(destFile):
    localLastModified = stat(destFile).st_mtime
    req = Request(srvFile)
    url_handle = urlopen(req)
    headers = url_handle.info()                        
    srvLastModified = headers.getheader("Last-Modified")
    srvLastModified = mktime(strptime(srvLastModified,
      "%a, %d %b %Y %H:%M:%S GMT"))
    print localLastModified, srvLastModified

else:
    urlretrieve(srvFile, destFile)

The return of the print statement (if you run the code twice) is 1334527395.26 1333350817.0.

Seems to me those two should be the same, but they're wildly different. The date modified of the file downloaded locally is in fact the date it was downloaded to the local machine, not the last modified date on the server.

Essentially all I'm trying to do is keep a local cache of the file (would be a lot of files in the actual application), downloading it if necessary. I'm half aware that web proxies should do this by default, and I'm running a basic WAMP server where these files are stored, but I'm not sure how to apply this to my PyQt application. There are potentially dozens of files that would need to be downloaded and cached, and about half of them will rarely ever change, so I'm trying to determine the fastest way to check and grab these files.

Perhaps this isn't even the right way to go about it, so I'm all ears if there are (far better/numerous other) ways to do this.

回答1:

urllib.urlretrieve just downloads the file; it does not copy the modification date. You must manually do so using os.utime:

import os

# current code
else:
    headers = urlretrieve(srvFile, destFile)[1]
    lmStr = headers.getheader("Last-Modified")
    srvLastModified = mktime(strptime(lmStr, "%a, %d %b %Y %H:%M:%S GMT"))
    os.utime(destFile, (srvLastModified, srvLastModified))