I have a script that downloads from Amazon S3. The scripts works 99.9% of the time. Occasionally I get the following error (socket.error: [Errno 104] Connection reset by peer). Once I restart the code the error seems to go away. Since its hard to recreate the error. I'm hoping the snipped of code below will fix the error. Specifically, I'm hoping if the error comes up, it'll try to re-download the file. I'm wondering if this code will work, and if there is anything else I should add in. I'm thinking an error counter might be good, so if the error does keep coming up it'll eventually move on. (Not exactly sure how to add a counter)
files = [#list of files to download]
for file in files:
for keys in bucket.list(prefix=file):
while True:
try:
keys.get_contents_to_filename()
except socket.error:
continue
break
I had exactly the same problem. If you search boto on GitHub, you will see, we are not alone.
There's also a known accepted issue: https://github.com/boto/boto/issues/2207
Reaching performance limits of AWS S3
The truth is, that we got so used to boto and AWS S3 service, we have forgotten, these are really distributed systems, which might break in some cases.
I was archiving (download, tar, upload) huge number of files (about 3 years with around 15 feeds each having about 1440 versions a day) and using Celery to do this faster. And I have to say, that I was sometime getting these errors more often, probably reaching performance limits of AWS S3. These errors were often appearing in chunks (in my case I was uploading about 60 Mbps for couple of hours).
Training S3 performance
When I was measuring performance, it was "trained". After some hour, the responsiveness of S3 bucket jumped up, AWS have probably detected higher load and spin up some more instances serving it.
Try latest stable version of boto
Other thing is, that boto
is trying to retry in many cases, so many failures are hidden to our calls. Sometime I got a bit better with upgrading to the latest stable version.
My conclusion are:
- try upgrading to the latest stable
boto
- when error rate grows up, lower the pressure
- accept the fact, that AWS S3 is distributed service having rare performance problems
In your code, I would definitely recommend adding some sleep, (at least 5, but 30 s would seem fine to me), otherwise you are just pushing harder and harder to a system, which might be in shaky situation at the moment.
Well, it appeared the time.sleep() worked for a while. But, now that the files are bigger, that doesn't even do the trick. It seems like I need to restart the loop to get it working again. This modification seems to be working.
def download(filesToDownload):
temp = []
for sFile in filesToDownload:
for keys in bucket.list(prefix='<bucket>%s' % (sFile)):
while True:
try:
keys.get_contents_to_filename('%s%s' % (downloadRoot,sFile))
temp.append(sFile)
except:
time.sleep(30)
x = set(filesToDownload) - set(temp)
download(x)
break
I once had this problem, and what fixed was creating a new access key, because the old one was compromised