Does aws-cli confirm checksums when uploading file

2020-07-06 06:45发布

If I'm uploading data to S3 using the aws-cli (i.e. using aws s3 cp), does aws-cli do any work to confirm that the resulting file in S3 matches the original file, or do I somehow need to manage that myself?

Based on this answer and the Java API documentation for putObject(), it looks like it's possible to verify the MD5 checksum after upload. However, I can't find a definitive answer on whether aws-cli actually does that.

It matters to me because I'm intending to upload GPG-encrypted files from a backup process, and I'd like some confidence that what's been stored in S3 actually matches the original.

2条回答
叛逆
2楼-- · 2020-07-06 07:25

According to the faq from the aws-cli github, the checksums are checked in most cases during upload and download.

Key points for uploads:

  • The AWS CLI calculates the Content-MD5 header for both standard and multipart uploads.
  • If the checksum that S3 calculates does not match the Content-MD5 provided, S3 will not store the object and instead will return an error message back the AWS CLI.
  • The AWS CLI will retry this error up to 5 times before giving up and exiting with a nonzero exit code.
查看更多
再贱就再见
3楼-- · 2020-07-06 07:25

The AWS support page How do I ensure data integrity of objects uploaded to or downloaded from Amazon S3? describes how to achieve this.

Firstly determine the base64 encoded md5sum of the file you wish to upload:

$ md5_sum_base64="$( openssl md5 -binary my-file | base64 )"

Then use the s3api to upload the file:

$ aws s3api put-object --bucket my-bucket --key my-file --body my-file --content-md5 "$md5_sum_base64"

Note the use of the --content-md5 flag, the help for this flag states:

--content-md5  (string)  The  base64-encoded  128-bit MD5 digest of the part data.

This does not say much about why to use this flag, but we can find this information in the API documentation for put object:

To ensure that data is not corrupted traversing the network, use the Content-MD5 header. When you use this header, Amazon S3 checks the object against the provided MD5 value and, if they do not match, returns an error. Additionally, you can calculate the MD5 while putting an object to Amazon S3 and compare the returned ETag to the calculated MD5 value.

Using this flag causes S3 to verify that the file hash serverside matches the specified value. If the hashes match s3 will return the ETag:

{
    "ETag": "\"599393a2c526c680119d84155d90f1e5\""
}

The ETag value will usually be the hexadecimal md5sum (see this question for some scenarios where this may not be the case).

If the hash does not match the one you specified you get an error.

A client error (InvalidDigest) occurred when calling the PutObject operation: The Content-MD5 you specified was invalid.

In addition to this you can also add the file md5sum to the file metadata as an additional check:

$ aws s3api put-object --bucket my-bucket --key my-file --body my-file --content-md5 "$md5_sum_base64" --metadata md5chksum="$md5_sum_base64"

After upload you can issue the head-object command to check the values.

$ aws s3api head-object --bucket my-bucket --key my-file
{
    "AcceptRanges": "bytes",
    "ContentType": "binary/octet-stream",
    "LastModified": "Thu, 31 Mar 2016 16:37:18 GMT",
    "ContentLength": 605,
    "ETag": "\"599393a2c526c680119d84155d90f1e5\"",
    "Metadata": {    
        "md5chksum": "WZOTosUmxoARnYQVXZDx5Q=="    
    }    
}

Here is a bash script that uses content md5 and adds metadata and then verifies that the values returned by S3 match the local hashes:

#!/bin/bash

set -euf -o pipefail

# assumes you have aws cli, jq installed

# change these if required
tmp_dir="$HOME/tmp"
s3_dir="foo"
s3_bucket="stack-overflow-example"
aws_region="ap-southeast-2"
aws_profile="my-profile"

test_dir="$tmp_dir/s3-md5sum-test"
file_name="MailHog_linux_amd64"
test_file_url="https://github.com/mailhog/MailHog/releases/download/v1.0.0/MailHog_linux_amd64"
s3_key="$s3_dir/$file_name"
return_dir="$( pwd )"

cd "$tmp_dir" || exit
mkdir "$test_dir"
cd "$test_dir" || exit

wget "$test_file_url"

md5_sum_hex="$( md5sum $file_name | awk '{ print $1 }' )"
md5_sum_base64="$( openssl md5 -binary $file_name | base64 )"

echo "$file_name hex    = $md5_sum_hex"
echo "$file_name base64 = $md5_sum_base64"

echo "Uploading $file_name to s3://$s3_bucket/$s3_dir/$file_name"
aws \
--profile "$aws_profile" \
--region "$aws_region" \
s3api put-object \
--bucket "$s3_bucket" \
--key "$s3_key" \
--body "$file_name" \
--metadata md5chksum="$md5_sum_base64" \
--content-md5 "$md5_sum_base64"

echo "Verifying sums match"

s3_md5_sum_hex=$( aws --profile "$aws_profile"  --region "$aws_region" s3api head-object --bucket "$s3_bucket" --key "$s3_key" | jq -r '.ETag' | sed 's/"//'g )
s3_md5_sum_base64=$( aws --profile "$aws_profile"  --region "$aws_region" s3api head-object --bucket "$s3_bucket" --key "$s3_key" | jq -r '.Metadata.md5chksum' )

if [ "$md5_sum_hex" == "$s3_md5_sum_hex" ] && [ "$md5_sum_base64" == "$s3_md5_sum_base64" ]; then
    echo "checksums match"
else
    echo "something is wrong checksums do not match:"

    cat <<EOM | column -t -s ' '
$file_name file hex:    $md5_sum_hex    s3 hex:    $s3_md5_sum_hex
$file_name file base64: $md5_sum_base64 s3 base64: $s3_md5_sum_base64
EOM

fi

echo "Cleaning up"
cd "$return_dir"
rm -rf "$test_dir"
aws \
--profile "$aws_profile" \
--region "$aws_region" \
s3api delete-object \
--bucket "$s3_bucket" \
--key "$s3_key"
查看更多
登录 后发表回答