I have a model with a FileField
, which holds user uploaded files. Since I want to save space, I would like to avoid duplicates.
What I'd like to achieve:
- Calculate the uploaded files md5 checksum
- Store the file with the file name based on its md5sum
- If a file with that name is already there (the new file's a duplicate), discard the uploaded file and use the existing file instead
1 and 2 is already working, but how would I forget about an uploaded duplicate and use the existing file instead?
Note that I'd like to keep the existing file and not overwrite it (mainly to keep the modified time the same - better for backup).
Notes:
- I'm using Django 1.5
- The upload handler is
django.core.files.uploadhandler.TemporaryFileUploadHandler
Code:
def media_file_name(instance, filename):
h = instance.md5sum
basename, ext = os.path.splitext(filename)
return os.path.join('mediafiles', h[0:1], h[1:2], h + ext.lower())
class Media(models.Model):
orig_file = models.FileField(upload_to=media_file_name)
md5sum = models.CharField(max_length=36)
...
def save(self, *args, **kwargs):
if not self.pk: # file is new
md5 = hashlib.md5()
for chunk in self.orig_file.chunks():
md5.update(chunk)
self.md5sum = md5.hexdigest()
super(Media, self).save(*args, **kwargs)
Any help is appreciated!
This answer helped me solve the problem where I wanted to raise an exception if the file being uploaded already existed. This version raises an exception if a file with the same name already exists in the upload location.
Thanks to alTus answer, I was able to figure out that writing a custom storage class is the key, and it was easier than expected.
_save
method to write the file if it is already there and I just return the name.get_available_name
, to avoid getting numbers appended to the file name if a file with the same name is already existingI don't know if this is the proper way of doing it, but it works fine so far.
Hope this is useful!
Here's the complete sample code:
AFAIK you can't easily implement this using save/delete methods coz files are handled quite specifically.
But you could try smth like that.
First, my simple md5 file hash function:
Next
simple_upload_to
is is smth like yours media_file_name function. You should use it like that:Of course, it's just an example so path generation logic could be various.
And the most important part:
As you can see this custom storage deletes file before saving and then saves new one with the same name. So here you can implement your logic if NOT deleting (and thus updating) files is important.
More about storages ou can find here: https://docs.djangoproject.com/en/1.5/ref/files/storage/
I had the same issue and found this SO question. As this is nothing too uncommon I searched the web and found the following Python package which seams to do exactly what you want:
https://pypi.python.org/pypi/django-hashedfilenamestorage
If SHA1 hashes are out of question I think a pull request to add MD5 hashing support would be a great idea.