什么是算法来计算亚马逊S3的Etag超过5GB大的文件?(What is the algorithm

2019-06-17 22:23发布

上传到亚马逊S3是比5GB小文件有一个ETag是简单的文件,它可以很容易地检查你的本地文件是一样的,你穿什么S3的MD5哈希值。

但是,如果你的文件大于5GB大,那么亚马逊的计算ETag的不同。

举例来说,我的确在380份一个5970150664字节的文件的上传多。 现在S3表明它有一个ETag 6bcf86bed8807b8e78f0fc6e0a53079d-380 。 我的本地文件具有的MD5哈希702242d3703818ddefe6bf7da2bed757 。 我认为破折号后面的数字是在多载部件的数量。

我也怀疑,新的ETag(破折号前)仍然是一个MD5哈希,但也有一些元数据以及从多上传莫名其妙的方式包括在内。

有谁知道如何使用相同的算法如亚马逊S3来计算ETag的?

Answer 1:

只是验证一个。 脱帽向亚马逊使得它很简单,可以猜测的。

假设你上传了14MB的文件,你的部分大小为5MB。 计算对应于每个部分3个MD5校验和,即第一5MB,第二5MB,最后4MB的校验和。 然后把他们串联的校验和。 由于MD5校验和二进制数据的十六进制表示,只要确保你把解码后的二进制级联的MD5,不是的ASCII或UTF-8编码的串联。 如果这样做了,加连字符和部件的获得ETag的数量。

下面是从控制台做它在Mac OS X上的命令:

$ dd bs=1m count=5 skip=0 if=someFile | md5 >>checksums.txt
5+0 records in
5+0 records out
5242880 bytes transferred in 0.019611 secs (267345449 bytes/sec)
$ dd bs=1m count=5 skip=5 if=someFile | md5 >>checksums.txt
5+0 records in
5+0 records out
5242880 bytes transferred in 0.019182 secs (273323380 bytes/sec)
$ dd bs=1m count=5 skip=10 if=someFile | md5 >>checksums.txt
2+1 records in
2+1 records out
2599812 bytes transferred in 0.011112 secs (233964895 bytes/sec)

此时所有的校验都在checksums.txt 。 将它们串联和解码十六进制,并得到了很多的MD5校验,只是使用

$ xxd -r -p checksums.txt | md5

而现在加上“-3”,以获得ETag的,因为有3个部分。

值得一提的是md5在Mac OS X只是写出了校验,但md5sum在Linux上还输出文件名。 你需要剥离,但我敢肯定有一些选项只能输出校验。 你不必担心空白原因xxd会忽略它。

注意 :如果您上传的AWS-CLI通过aws s3 cp ,那么你很可能有8MB CHUNKSIZE。 根据该文件 ,这是默认。

更新 :有人告诉我在这方面的一个实施https://github.com/Teachnova/s3md5 ,这并不在OS X上工作,这里有一个要点我有写为OS X的工作脚本了 。



Answer 2:

同样的算法,Java版本:(BaseEncoding,散列器,哈希等来自番石榴库

/**
 * Generate checksum for object came from multipart upload</p>
 * </p>
 * AWS S3 spec: Entity tag that identifies the newly created object's data. Objects with different object data will have different entity tags. The entity tag is an opaque string. The entity tag may or may not be an MD5 digest of the object data. If the entity tag is not an MD5 digest of the object data, it will contain one or more nonhexadecimal characters and/or will consist of less than 32 or more than 32 hexadecimal digits.</p> 
 * Algorithm follows AWS S3 implementation: https://github.com/Teachnova/s3md5</p>
 */
private static String calculateChecksumForMultipartUpload(List<String> md5s) {      
    StringBuilder stringBuilder = new StringBuilder();
    for (String md5:md5s) {
        stringBuilder.append(md5);
    }

    String hex = stringBuilder.toString();
    byte raw[] = BaseEncoding.base16().decode(hex.toUpperCase());
    Hasher hasher = Hashing.md5().newHasher();
    hasher.putBytes(raw);
    String digest = hasher.hash().toString();

    return digest + "-" + md5s.size();
}


Answer 3:

bash的实现

Python实现

该算法字面上是(从Python实现的自述复制):

  1. MD5的块
  2. 在glob的MD5串在一起
  3. 转换水珠为二进制
  4. MD5算法的匹配替换块md5s的二进制
  5. 追加“-Number_of_chunks”的二进制文件的MD5字符串的结尾


Answer 4:

不知道是否可以帮助:

目前,我们正在做一个丑陋的(但到目前为止,很有用)黑客来修复多载的文件,其中包括对应用变化桶里的文件那些错误的ETag; 触发从亚马逊MD5重新计算改变的ETag用于匹配与实际MD5签名。

在我们的例子:

文件:桶/ Foo.mpg.gpg

  1. ETag的获得: “3f92dffef0a11d175e60fb8b958b4e6e-2”
  2. 一些与文件( 重命名,添加元数据像假的头,等等)
  3. Etag的获得: “c1d903ca1bb6dc68778ef21e74cc15b0”

我们不知道的算法,但因为我们可以“修复”的ETag我们并不需要或者担心。



Answer 5:

基于答案在这里,我写了一个Python实现其正确地计算两个多部分和单部分文件的ETag。

def calculate_s3_etag(file_path, chunk_size=8 * 1024 * 1024):
    md5s = []

    with open(file_path, 'rb') as fp:
        while True:
            data = fp.read(chunk_size)
            if not data:
                break
            md5s.append(hashlib.md5(data))

    if len(md5s) == 1:
        return '"{}"'.format(md5s[0].hexdigest())

    digests = b''.join(m.digest() for m in md5s)
    digests_md5 = hashlib.md5(digests)
    return '"{}-{}"'.format(digests_md5.hexdigest(), len(md5s))

默认CHUNK_SIZE是由官方使用8 MB aws cli工具,它多用于上传2+块。 它应该既Python 2和3下工作。



Answer 6:

在上面的回答,有人问是否有一种方式来获得超过5G大文件的MD5。

我可以给用于获取MD5值(大于5G大文件)的回答是要么手动添加元数据,或者使用一个程序做你的上传,这将增加的信息。

例如,我用s3cmd上传文件,并且它增加了以下的元数据。

$ aws s3api head-object --bucket xxxxxxx --key noarch/epel-release-6-8.noarch.rpm 
{
  "AcceptRanges": "bytes", 
  "ContentType": "binary/octet-stream", 
  "LastModified": "Sat, 19 Sep 2015 03:27:25 GMT", 
  "ContentLength": 14540, 
  "ETag": "\"2cd0ae668a585a14e07c2ea4f264d79b\"", 
  "Metadata": {
    "s3cmd-attrs": "uid:502/gname:staff/uname:xxxxxx/gid:20/mode:33188/mtime:1352129496/atime:1441758431/md5:2cd0ae668a585a14e07c2ea4f264d79b/ctime:1441385182"
  }
}

这是不使用ETag的一个直接的解决方案,但它是填充的方式,你可以访问它所需的元数据(MD5)的方式。 如果有人上传文件没有元数据还是会失败。



Answer 7:

根据AWS文档ETag的不是用于多部分的上传,也不是加密对象MD5哈希: http://docs.aws.amazon.com/AmazonS3/latest/API/RESTCommonResponseHeaders.html

由PUT物件,POST对象或复制操作创建的对象,或通过AWS管理控制台,并且通过SSE-S3或明文进行加密,具有是其对象数据的MD5摘要的ETag。

由PUT物件,POST对象或复制操作创建的对象,或通过AWS管理控制台,并且通过SSE-C或SSE-KMS被加密,具有不是它们的对象的数据的MD5摘要的ETag。

如果对象是由任一多部分上传或部分拷贝操作创建的,ETag的不是一个MD5摘要,而不管进行加密的方法的。



Answer 8:

这里是计算的ETag的PHP版本:

function calculate_aws_etag($filename, $chunksize) {
    /*
    DESCRIPTION:
    - calculate Amazon AWS ETag used on the S3 service
    INPUT:
    - $filename : path to file to check
    - $chunksize : chunk size in Megabytes
    OUTPUT:
    - ETag (string)
    */
    $chunkbytes = $chunksize*1024*1024;
    if (filesize($filename) < $chunkbytes) {
        return md5_file($filename);
    } else {
        $md5s = array();
        $handle = fopen($filename, 'rb');
        if ($handle === false) {
            return false;
        }
        while (!feof($handle)) {
            $buffer = fread($handle, $chunkbytes);
            $md5s[] = md5($buffer);
            unset($buffer);
        }
        fclose($handle);

        $concat = '';
        foreach ($md5s as $indx => $md5) {
            $concat .= hex2bin($md5);
        }
        return md5($concat) .'-'. count($md5s);
    }
}

$etag = calculate_aws_etag('path/to/myfile.ext', 8);

这里是一个可以验证对预期的ETag的增强版 - 甚至猜测CHUNKSIZE,如果你不知道吧!

function calculate_etag($filename, $chunksize, $expected = false) {
    /*
    DESCRIPTION:
    - calculate Amazon AWS ETag used on the S3 service
    INPUT:
    - $filename : path to file to check
    - $chunksize : chunk size in Megabytes
    - $expected : verify calculated etag against this specified etag and return true or false instead
        - if you make chunksize negative (eg. -8 instead of 8) the function will guess the chunksize by checking all possible sizes given the number of parts mentioned in $expected
    OUTPUT:
    - ETag (string)
    - or boolean true|false if $expected is set
    */
    if ($chunksize < 0) {
        $do_guess = true;
        $chunksize = 0 - $chunksize;
    } else {
        $do_guess = false;
    }

    $chunkbytes = $chunksize*1024*1024;
    $filesize = filesize($filename);
    if ($filesize < $chunkbytes && (!$expected || !preg_match("/^\\w{32}-\\w+$/", $expected))) {
        $return = md5_file($filename);
        if ($expected) {
            $expected = strtolower($expected);
            return ($expected === $return ? true : false);
        } else {
            return $return;
        }
    } else {
        $md5s = array();
        $handle = fopen($filename, 'rb');
        if ($handle === false) {
            return false;
        }
        while (!feof($handle)) {
            $buffer = fread($handle, $chunkbytes);
            $md5s[] = md5($buffer);
            unset($buffer);
        }
        fclose($handle);

        $concat = '';
        foreach ($md5s as $indx => $md5) {
            $concat .= hex2bin($md5);
        }
        $return = md5($concat) .'-'. count($md5s);
        if ($expected) {
            $expected = strtolower($expected);
            $matches = ($expected === $return ? true : false);
            if ($matches || $do_guess == false || strlen($expected) == 32) {
                return $matches;
            } else {
                // Guess the chunk size
                preg_match("/-(\\d+)$/", $expected, $match);
                $parts = $match[1];
                $min_chunk = ceil($filesize / $parts /1024/1024);
                $max_chunk =  floor($filesize / ($parts-1) /1024/1024);
                $found_match = false;
                for ($i = $min_chunk; $i <= $max_chunk; $i++) {
                    if (calculate_aws_etag($filename, $i) === $expected) {
                        $found_match = true;
                        break;
                    }
                }
                return $found_match;
            }
        } else {
            return $return;
        }
    }
}


Answer 9:

这里是红宝石的算法...

require 'digest'

# PART_SIZE should match the chosen part size of the multipart upload
# Set here as 10MB
PART_SIZE = 1024*1024*10 

class File
  def each_part(part_size = PART_SIZE)
    yield read(part_size) until eof?
  end
end

file = File.new('<path_to_file>')

hashes = []

file.each_part do |part|
  hashes << Digest::MD5.hexdigest(part)
end

multipart_hash = Digest::MD5.hexdigest([hashes.join].pack('H*'))
multipart_etag = "#{multipart_hash}-#{hashes.count}"

由于最短HEX2BIN在红宝石和多部分上传给S3 ...



Answer 10:

在提出的算法这个答案是正确的。 即你把128位二进制MD5摘要各部分,它们连接成一个文档,散列该文档。

还有别的值得一提的关于算法:如果复制,或做你完成的多部分上传的对象(又名PUT-COPY)的就地副本,S3将重新计算ETAG并使用算法的简单版本。 即目标对象将不带连字符的ETAG。

你可能已经考虑过这个了,但如果你的文件少于5GB,并且你已经知道他们的MD5s,并上传并行化提供了几乎没有任何收益(例如,您是从较慢的磁盘流从一个速度较慢的网络,或上传上传),那么你也可以考虑使用一个简单的PUT而不是多放的,并通过您已知的内容-MD5您的要求标题 - 亚马逊将无法上传,如果它们不匹配。 请记住,你得到收取每个UploadPart。

此外,在一些客户端,传递一个已知MD5用于PUT操作的输入将从传输期间重新计算MD5保存客户端。 在boto3(蟒蛇),你可以使用ContentMD5的参数client.put_object()方法,例如。 如果省略该参数,并且您已经知道了MD5,则客户端会浪费周期转让前再次计算它。



Answer 11:

我有不使用外部佣工像DD和XXD适用于iOS和MacOS的解决方案。 我刚刚发现了它,所以我报告,因为它是,规划在后一阶段,以改善它。 目前,它依靠两Objective-C和斯威夫特代码上。 首先,建立在Objective-C这样的辅助类:

AWS3MD5Hash.h

#import <Foundation/Foundation.h>

NS_ASSUME_NONNULL_BEGIN

@interface AWS3MD5Hash : NSObject

- (NSData *)dataFromFile:(FILE *)theFile startingOnByte:(UInt64)startByte length:(UInt64)length filePath:(NSString *)path singlePartSize:(NSUInteger)partSizeInMb;

- (NSData *)dataFromBigData:(NSData *)theData startingOnByte:(UInt64)startByte length:(UInt64)length;

- (NSData *)dataFromHexString:(NSString *)sourceString;

@end

NS_ASSUME_NONNULL_END

AWS3MD5Hash.m

#import "AWS3MD5Hash.h"
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define SIZE 256

@implementation AWS3MD5Hash


- (NSData *)dataFromFile:(FILE *)theFile startingOnByte:(UInt64)startByte length:(UInt64)length filePath:(NSString *)path singlePartSize:(NSUInteger)partSizeInMb {


   char *buffer = malloc(length);


   NSURL *fileURL = [NSURL fileURLWithPath:path];
   NSNumber *fileSizeValue = nil;
   NSError *fileSizeError = nil;
   [fileURL getResourceValue:&fileSizeValue
                           forKey:NSURLFileSizeKey
                            error:&fileSizeError];

   NSInteger __unused result = fseek(theFile,startByte,SEEK_SET);

   if (result != 0) {
      free(buffer);
      return nil;
   }

   NSInteger result2 = fread(buffer, length, 1, theFile);

   NSUInteger difference = fileSizeValue.integerValue - startByte;

   NSData *toReturn;

   if (result2 == 0) {
       toReturn = [NSData dataWithBytes:buffer length:difference];
    } else {
       toReturn = [NSData dataWithBytes:buffer length:result2 * length];
    }

     free(buffer);

     return toReturn;
 }

 - (NSData *)dataFromBigData:(NSData *)theData startingOnByte:  (UInt64)startByte length:(UInt64)length {

   NSUInteger fileSizeValue = theData.length;
   NSData *subData;

   if (startByte + length > fileSizeValue) {
        subData = [theData subdataWithRange:NSMakeRange(startByte, fileSizeValue - startByte)];
    } else {
       subData = [theData subdataWithRange:NSMakeRange(startByte, length)];
    }

        return subData;
    }

- (NSData *)dataFromHexString:(NSString *)string {
    string = [string lowercaseString];
    NSMutableData *data= [NSMutableData new];
    unsigned char whole_byte;
    char byte_chars[3] = {'\0','\0','\0'};
    NSInteger i = 0;
    NSInteger length = string.length;
    while (i < length-1) {
       char c = [string characterAtIndex:i++];
       if (c < '0' || (c > '9' && c < 'a') || c > 'f')
           continue;
       byte_chars[0] = c;
       byte_chars[1] = [string characterAtIndex:i++];
       whole_byte = strtol(byte_chars, NULL, 16);
       [data appendBytes:&whole_byte length:1];
    }

        return data;
}


@end

现在创建一个简单的迅速文件:

AWS Extensions.swift

import UIKit
import CommonCrypto

extension URL {

func calculateAWSS3MD5Hash(_ numberOfParts: UInt64) -> String? {


    do {

        var fileSize: UInt64!
        var calculatedPartSize: UInt64!

        let attr:NSDictionary? = try FileManager.default.attributesOfItem(atPath: self.path) as NSDictionary
        if let _attr = attr {
            fileSize = _attr.fileSize();
            if numberOfParts != 0 {



                let partSize = Double(fileSize / numberOfParts)

                var partSizeInMegabytes = Double(partSize / (1024.0 * 1024.0))



                partSizeInMegabytes = ceil(partSizeInMegabytes)

                calculatedPartSize = UInt64(partSizeInMegabytes)

                if calculatedPartSize % 2 != 0 {
                    calculatedPartSize += 1
                }

                if numberOfParts == 2 || numberOfParts == 3 { // Very important when there are 2 or 3 parts, in the majority of times
                                                              // the calculatedPartSize is already 8. In the remaining cases we force it.
                    calculatedPartSize = 8
                }


                if mainLogToggling {
                    print("The calculated part size is \(calculatedPartSize!) Megabytes")
                }

            }

        }

        if numberOfParts == 0 {

            let string = self.memoryFriendlyMd5Hash()
            return string

        }




        let hasher = AWS3MD5Hash.init()
        let file = fopen(self.path, "r")
        defer { let result = fclose(file)}


        var index: UInt64 = 0
        var bigString: String! = ""
        var data: Data!

        while autoreleasepool(invoking: {

                if index == (numberOfParts-1) {
                    if mainLogToggling {
                        //print("Siamo all'ultima linea.")
                    }
                }

                data = hasher.data(from: file!, startingOnByte: index * calculatedPartSize * 1024 * 1024, length: calculatedPartSize * 1024 * 1024, filePath: self.path, singlePartSize: UInt(calculatedPartSize))

                bigString = bigString + MD5.get(data: data) + "\n"

                index += 1

                if index == numberOfParts {
                    return false
                }
                return true

        }) {}

        let final = MD5.get(data :hasher.data(fromHexString: bigString)) + "-\(numberOfParts)"

        return final

    } catch {

    }

    return nil
}

   func memoryFriendlyMd5Hash() -> String? {

    let bufferSize = 1024 * 1024

    do {
        // Open file for reading:
        let file = try FileHandle(forReadingFrom: self)
        defer {
            file.closeFile()
        }

        // Create and initialize MD5 context:
        var context = CC_MD5_CTX()
        CC_MD5_Init(&context)

        // Read up to `bufferSize` bytes, until EOF is reached, and update MD5 context:
        while autoreleasepool(invoking: {
            let data = file.readData(ofLength: bufferSize)
            if data.count > 0 {
                data.withUnsafeBytes {
                    _ = CC_MD5_Update(&context, $0, numericCast(data.count))
                }
                return true // Continue
            } else {
                return false // End of file
            }
        }) { }

        // Compute the MD5 digest:
        var digest = Data(count: Int(CC_MD5_DIGEST_LENGTH))
        digest.withUnsafeMutableBytes {
            _ = CC_MD5_Final($0, &context)
        }
        let hexDigest = digest.map { String(format: "%02hhx", $0) }.joined()
        return hexDigest

    } catch {
        print("Cannot open file:", error.localizedDescription)
        return nil
    }
}

struct MD5 {

    static func get(data: Data) -> String {
        var digest = [UInt8](repeating: 0, count: Int(CC_MD5_DIGEST_LENGTH))

        let _ = data.withUnsafeBytes { bytes in
            CC_MD5(bytes, CC_LONG(data.count), &digest)
        }
        var digestHex = ""
        for index in 0..<Int(CC_MD5_DIGEST_LENGTH) {
            digestHex += String(format: "%02x", digest[index])
        }

        return digestHex
    }
    // The following is a memory friendly version
    static func get2(data: Data) -> String {

    var currentIndex = 0
    let bufferSize = 1024 * 1024
    //var digest = [UInt8](repeating: 0, count: Int(CC_MD5_DIGEST_LENGTH))

    // Create and initialize MD5 context:
    var context = CC_MD5_CTX()
    CC_MD5_Init(&context)


    while autoreleasepool(invoking: {
        var subData: Data!
        if (currentIndex + bufferSize) < data.count {
            subData = data.subdata(in: Range.init(NSMakeRange(currentIndex, bufferSize))!)
            currentIndex = currentIndex + bufferSize
        } else {
            subData = data.subdata(in: Range.init(NSMakeRange(currentIndex, data.count - currentIndex))!)
            currentIndex = currentIndex + (data.count - currentIndex)
        }
        if subData.count > 0 {
            subData.withUnsafeBytes {
                _ = CC_MD5_Update(&context, $0, numericCast(subData.count))
            }
            return true
        } else {
            return false
        }

    }) { }

    // Compute the MD5 digest:
    var digest = Data(count: Int(CC_MD5_DIGEST_LENGTH))
    digest.withUnsafeMutableBytes {
        _ = CC_MD5_Final($0, &context)
    }

    var digestHex = ""
    for index in 0..<Int(CC_MD5_DIGEST_LENGTH) {
        digestHex += String(format: "%02x", digest[index])
    }

    return digestHex

}
}

现在添加:

#import "AWS3MD5Hash.h"

你的Objective-C的桥接报头。 你应该确定在此设置。

用法示例

为了测试这个设置,你可以调用一个是负责处理AWS连接对象内部下面的方法:

func getMd5HashForFile() {


    let credentialProvider = AWSCognitoCredentialsProvider(regionType: AWSRegionType.USEast2, identityPoolId: "<INSERT_POOL_ID>")
    let configuration = AWSServiceConfiguration(region: AWSRegionType.APSoutheast2, credentialsProvider: credentialProvider)
    configuration?.timeoutIntervalForRequest = 3.0
    configuration?.timeoutIntervalForResource = 3.0

    AWSServiceManager.default().defaultServiceConfiguration = configuration

    AWSS3.register(with: configuration!, forKey: "defaultKey")
    let s3 = AWSS3.s3(forKey: "defaultKey")


    let headObjectRequest = AWSS3HeadObjectRequest()!
    headObjectRequest.bucket = "<NAME_OF_YOUR_BUCKET>"
    headObjectRequest.key = self.latestMapOnServer.key




    let _: AWSTask? = s3.headObject(headObjectRequest).continueOnSuccessWith { (awstask) -> Any? in

        let headObjectOutput: AWSS3HeadObjectOutput? = awstask.result

        var ETag = headObjectOutput?.eTag!
        // Here you should parse the returned Etag and extract the number of parts to provide to the helper function. Etags end with a "-" followed by the number of parts. If you don't see this format, then pass 0 as the number of parts.
        ETag = ETag!.replacingOccurrences(of: "\"", with: "")

        print("headObjectOutput.ETag \(ETag!)")

        let mapOnDiskUrl = self.getMapsDirectory().appendingPathComponent(self.latestMapOnDisk!)

        let hash = mapOnDiskUrl.calculateAWSS3MD5Hash(<Take the number of parts from the ETag returned by the server>)

        if hash == ETag {
            print("They are the same.")
        }

        print ("\(hash!)")

        return nil
    }



}

如果服务器返回的ETag的没有“ - ”在ETag的结束,只是通过0到calculateAWSS3MD5Hash。 如果遇到任何问题,请发表评论。 我工作的迅速唯一的解决办法,我会尽快为我完成更新这个答案。 谢谢



Answer 12:

没有,

到现在为止还没有解决方案来匹配正常文件的ETag和多文件的ETag和本地文件的MD5。



文章来源: What is the algorithm to compute the Amazon-S3 Etag for a file larger than 5GB?