Detecting if the user drops the same file twice on

2020-06-17 14:31发布

问题:

I want to allow users to drag images from their desktop onto a browser window and then upload those images to a server. I want to upload each file only once, even if it is dropped on the window several times. For security reasons, the information from File object that is accessible to JavaScript is limited. According to msdn.microsoft.com, only the following properties can be read:

  • name
  • lastModifiedDate

(Safari also exposes size and type).

The user can drop two images with the same name and last modified date from different folders onto the browser window. There is a very small but finite chance that these two images are in fact different.

I've created a script that reads in the raw dataURL of each image file, and compares it to files that were previously dropped on the window. One advantage of this is that it can detect identical files with different names.

This works, but it seems overkill. It also requires a huge amount of data to be stored. I could improve this (and add to the overkill) by making a hash of the dataURL, and storing that instead.

I'm hoping that there may be a more elegant way of achieving my goal. What can you suggest?

<!DOCTYPE html>
<html>
<head>
  <title>Detect duplicate drops</title>
  <style>
html, body {
width: 100%;
height: 100%;
margin: 0;
background: #000;
}
  </style>
  <script>
var body
var imageData = []


document.addEventListener('DOMContentLoaded', function ready() {
  body = document.getElementsByTagName("body")[0]
  body.addEventListener("dragover", swallowEvent, false)
  body.addEventListener("drop", treatDrop, false)
}, false)


function swallowEvent(event) {
  // Prevent browser from loading the dropped image in an empty page
  event.preventDefault()
  event.stopPropagation()
}


function treatDrop(event) {
  swallowEvent(event)

  for (var ii=0, file; file = event.dataTransfer.files[ii]; ii++) {
    importImage(file)
  }
}


function importImage(file) {
    var reader = new FileReader()

    reader.onload = function fileImported(event) {
        var dataURL = event.target.result
        var index = imageData.indexOf(dataURL)
        var img, message

        if (index < 0) {
            index = imageData.length
            console.log(dataURL)
            imageData.push(dataURL, file.name)  
          message = "Image "+file.name+" imported"
        } else {
          message = "Image "+file.name+" imported as "+imageData[index+1]
        }

        img = document.createElement("img")
        img.src = imageData[index] // copy or reference?
        body.appendChild(img)

        console.log(message)
    }

  reader.readAsDataURL(file)
}
  </script>
</head>
<body>
</body>
</html>

回答1:

Here is a suggestion (that I haven't seen being mentioned in your question):

Create a Blob URL for each file-object in the FileList-object to be stored in the browsers URL Store, saving their URL-String.

Then you pass that URL-string to a webworker (separate thread) which uses the FileReader to read each file (accessed via the Blob URL string) in chunked sections, re-using one fixed-size buffer (almost like a circular buffer), to calculates the file's hash (there are simple/fast carry-able hashes like crc32 which can often be simply combined with a vertical and horizontal checksum in the same loop (also carry-able over chunks)).
You might speed up the process by reading in 32 bit (unsigned) values instead of 8 bit values using an appropriate 'bufferview' (that's 4 times faster). System endianness is not important, don't waste resources on this!

Upon completion the webworker then passes back the file's hash to the main-thread/app which then simply performs your matrix comparison of [[fname, fsize, blobUrl, fhash] /* , etc /*].

Pro
The re-used fixed buffer significantly brings down your memory usage (to any level you specify), the webworker brings up performance by using the extra thread (which doesn't block your main browser's thread).

Con
You'd still need serverside fall-back for browsers with javascript disabled (you might add a hidden field to the form and set it's value using javascript as means of a javascript-enabled check, as to lower server-side load). However.. even then.. you'd still need server-side fallback to safeguard against malicious input.

Usefulness
So.. no net gain? Well.. if the chance is reasonable that the user might upload duplicate files (or just uses them in a web-based app) than you have saved on waisted bandwith just to perform the check. That is quite a (ecological/financial) win in my book.


Extra
Hashes are prone to collision, period. To lower the (realistic) chance of collision you'd select a more advanced hash-algo (most are easily carry-able in chunked mode). Obvious trade-off for more advanced hashes is larger code-size and lower speed (higher CPU usage).