AWS S3: Checksum Mismatch — When S3 Uploads Don’t Match Your Files
A checksum mismatch isn’t just an error — it’s a warning that your data and your assumptions are out of alignment.
Problem
You upload a large file to S3 — maybe through the AWS CLI, an SDK, or a data pipeline — and everything looks fine. The CLI confirms success. The object appears in the console.
But when you download it later, the file is corrupted or flagged with a checksum mismatch.
It’s one of those unsettling moments where you realize: the upload succeeded, but the data didn’t survive the trip unchanged.
Clarifying the Issue
A checksum mismatch happens when the file that S3 stored doesn’t perfectly match the one you sent.
S3 verifies integrity using a value called an ETag, which for simple uploads is just the MD5 checksum of your file. However, for multipart uploads — files split into multiple chunks for faster upload — the ETag becomes a composite hash that can’t be directly compared to your local MD5.
Typical causes include:
- Uploading large files with multipart upload, then comparing their composite ETag to a simple MD5.
- Client-side hashing errors, often caused by comparing the wrong algorithm (MD5 vs SHA).
- Network interruptions or retries during upload that alter part of the data stream.
- Middleware compression or encryption that changes file bytes in transit.
In other words, your file reached S3, but it didn’t arrive exactly as it left your system.
Why It Matters
Checksum mismatches erode trust in your storage layer — and in large data workflows, trust is everything.
If your ETL job, backup, or ML pipeline depends on data integrity, a single corrupted upload can quietly contaminate everything downstream.
You may not discover it until hours later, when your data validation or model training fails.
When you can’t trust that a successful upload means a correct file, your system isn’t resilient — it’s fragile.
Key Terms
- Checksum: A mathematical fingerprint of a file’s contents used to verify integrity (e.g., MD5, SHA-256).
- ETag: S3’s object identifier that often (but not always) represents the MD5 checksum of the object.
- Multipart Upload: A method for uploading large files in pieces. Each part has its own checksum, and the final ETag combines them.
- Integrity Verification: The process of ensuring that what you upload is exactly what S3 stores.
- Checksum Algorithm: The specific method (e.g., MD5, SHA1, SHA256) used to calculate and compare integrity hashes.
Steps at a Glance
- Identify the ETag mismatch in your S3 object metadata.
- Determine whether the object used multipart upload.
- Calculate your local file checksum manually.
- Compare it to S3’s ETag or checksum metadata.
- Use S3’s checksum-aware verification for multipart files.
- Re-upload with integrity validation enabled.
Detailed Steps
Step 1 — Identify the Mismatch
List the object and note its ETag:
aws s3api head-object --bucket your-bucket --key yourfile.bin
Example output:
"ETag": "\"d41d8cd98f00b204e9800998ecf8427e-10\""
If the ETag doesn’t match your local MD5, you’ve got a mismatch.
You can also view the ETag and checksum fields directly in the AWS Management Console under the Properties tab for your object.
Step 2 — Check for Multipart Upload
Multipart uploads use a composite ETag, recognizable by a dash (e.g., "d41d8cd98f00b204e9800998ecf8427e-10"
).
The “-10” means the file was uploaded in 10 parts — not that your data is broken.
You just can’t compare this directly to a single MD5 hash.
Step 3 — Calculate the Local Checksum
Run a local checksum to verify the original file’s integrity:
md5sum yourfile.bin
# or
shasum -a 256 yourfile.bin
Example output:
md5sum: d41d8cd98f00b204e9800998ecf8427e yourfile.bin
If this doesn’t match the ETag and your upload wasn’t multipart, corruption may have occurred during transfer.
Step 4 — Validate with S3’s New Checksum APIs
AWS now supports explicit checksum verification for several algorithms:
aws s3api put-object \
--bucket your-bucket \
--key yourfile.bin \
--body yourfile.bin \
--checksum-algorithm SHA256
After uploading, you can confirm integrity:
aws s3api head-object \
--bucket your-bucket \
--key yourfile.bin \
--query "ChecksumSHA256"
If the checksum matches your local value, the upload is verified.
Step 5 — Re-Upload with Validation
If you find corruption or mismatch, re-upload with integrity verification built in.
For example, in Python:
s3.upload_file(
'yourfile.bin',
'your-bucket',
'yourfile.bin',
ExtraArgs={'ChecksumAlgorithm': 'SHA256'}
)
This ensures S3 computes and stores the checksum for each part during upload.
Step 6 — Automate Verification in Pipelines
Integrate checksum validation into your CI/CD or data lake pipelines.
AWS S3 Inventory Reports can export per-object checksum data, allowing you to verify bulk object integrity periodically.
ProTip — ETag Isn’t Always MD5
Don’t rely on ETag equality checks for multipart or encrypted uploads.
Instead, use S3’s new checksum fields — like ChecksumCRC32
, ChecksumSHA256
, or ChecksumSHA1
— which are consistent across upload types.
This ensures accurate verification whether your file is small, massive, or encrypted.
Conclusion
A checksum mismatch isn’t just an error — it’s a warning that your data and your assumptions are out of alignment.
By moving beyond ETag comparisons and using S3’s modern checksum APIs, you build resilience into your storage architecture.
Every upload becomes provable, every download verifiable, and every byte accounted for.
In the cloud, integrity isn’t automatic — it’s intentional.
Aaron Rose is a software engineer and technology writer at tech-reader.blog and the author of Think Like a Genius.
Comments
Post a Comment