When Partial Writes Sneak Through: DynamoDB Consistency After S3 Failures (S3 → Lambda → DynamoDB)

#aws #s3 #lambda #dynamodb

When the cloud fails halfway, DynamoDB remembers — even when the rest of your system forgets.

Problem

Your event-driven pipeline looks airtight.

S3 receives files, Lambda processes them, and DynamoDB stores clean results for your analytics layer.

Then, one day, a batch of uploads silently breaks the pattern.

Half the files show up in DynamoDB. The other half vanish into thin air.

CloudWatch shows no alarms. Lambda retried a few events. Everything looks fine.

Until your analyst says:

“Why are my totals off by 14 records?”

And there it is — the ghost of a failure that never fully committed.

Clarifying the Issue

This is the half-success problem — where Lambda partially succeeds, DynamoDB commits some writes, and then a retry or timeout leaves the system inconsistent.

Let’s unpack what happens under the hood:

S3 triggers multiple Lambda events — one per object.
Lambda processes the event, parses the file, and writes a record to DynamoDB.
A transient failure occurs — such as a ProvisionedThroughputExceededException, a network timeout, or a Lambda timeout mid-batch.
Lambda retries — but by this time, some writes have succeeded while others never occurred.

The key point:

DynamoDB’s writes are atomic per request, not per batch of events.

That means if your Lambda handles multiple S3 events in one invocation, some writes can succeed while others fail silently during the retry window.

Result: partial persistence.

Why It Matters

This isn’t a data science problem — it’s an architectural debt problem.

Every partial write breaks the chain of truth between S3 and DynamoDB.

That drift propagates:

Dashboards misreport counts and aggregates.
Downstream Lambdas double-process or skip entries.
“Idempotent” systems lose their guarantees.

In distributed systems, consistency errors are like rust — invisible at first, devastating later.

Key Terms

Partial Write: A subset of intended writes succeeds while others fail, usually due to a transient Lambda or network failure.
Idempotency Token: A unique identifier that ensures the same logical write doesn’t happen twice.
Compensating Write: A corrective transaction or update to restore state after failure.
Transactional Write: A DynamoDB operation (TransactWriteItems) that commits or rolls back as a single atomic unit.
Reconciliation Job: A background Lambda or script that scans and repairs drifted records.

Steps at a Glance

Build the baseline pipeline: S3 → Lambda → DynamoDB (single table).
Smoke test: Upload multiple files and verify consistent writes.
Introduce the failure: Force a mid-invocation crash to simulate a partial write.
Detect drift: Query DynamoDB and confirm missing or duplicated items.
Apply the fix: Use DynamoDB TransactWriteItems or a compensating reconciliation function.

Step 1 – Build the Baseline Pipeline

Create a minimal 3-service stack:

An S3 bucket to trigger events.
A Lambda function to write to DynamoDB.
A DynamoDB table with a simple key schema.

aws s3api create-bucket \
  --bucket partial-write-demo \
  --region us-east-1
aws dynamodb create-table \
  --table-name partial-write-table \
  --attribute-definitions AttributeName=id,AttributeType=S \
  --key-schema AttributeName=id,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST

✅ Confirm both resources are active before proceeding:

aws dynamodb describe-table \
  --table-name partial-write-table \
  --query "Table.TableStatus"

✅ Output:

"ACTIVE"

Step 2 – Smoke Test the Happy Path

Create a simple Lambda function that writes every new file name to DynamoDB.

cat > lambda_handler.py <<'EOF'
import boto3, json, hashlib

dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('partial-write-table')

def lambda_handler(event, context):
    for record in event['Records']:
        key = record['s3']['object']['key']
        event_id = hashlib.md5(key.encode()).hexdigest()
        table.put_item(Item={'id': event_id, 'filename': key})
    print("Processed all events successfully.")
EOF

✅ Upload and test:

echo "file1" > file1.txt
echo "file2" > file2.txt
aws s3 cp file1.txt s3://partial-write-demo/
aws s3 cp file2.txt s3://partial-write-demo/

✅ Check the table:

aws dynamodb scan --table-name partial-write-table

✅ Output:

{"Count":2}

Everything looks good — for now.

Step 3 – Introduce the Failure

Now let’s simulate a mid-flight failure that causes partial persistence.

Edit the Lambda:

cat > lambda_handler.py <<'EOF'
import boto3, json, hashlib, random

dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('partial-write-table')

def lambda_handler(event, context):
    for record in event['Records']:
        key = record['s3']['object']['key']
        event_id = hashlib.md5(key.encode()).hexdigest()
        table.put_item(Item={'id': event_id, 'filename': key})
        # Simulate crash halfway through batch
        if random.random() < 0.5:
            raise Exception("Simulated mid-invocation failure")
EOF

✅ Upload several files at once — some will succeed, others will disappear from DynamoDB.

Step 4 – Detect Drift

Run a scan again:

aws dynamodb scan --table-name partial-write-table

❌ Output shows missing records — drift confirmed.

At this point, S3 shows all uploaded files, but DynamoDB only holds a subset.

The system’s perceived truth diverges.

This fix applies within a **single Lambda invocation, ensuring that all S3 records handled in one run are either fully committed or not written at all.

Step 5 – Apply the Fix

Use a transactional write or compensating logic to ensure all-or-nothing behavior.

Here’s a simple transactional rewrite:

cat > lambda_handler.py <<'EOF'
import boto3, json, hashlib, random

client = boto3.client('dynamodb')

def lambda_handler(event, context):
    items = []
    for record in event['Records']:
        key = record['s3']['object']['key']
        event_id = hashlib.md5(key.encode()).hexdigest()
        items.append({
            'Put': {
                'TableName': 'partial-write-table',
                'Item': {
                    'id': {'S': event_id},
                    'filename': {'S': key}
                },
                'ConditionExpression': 'attribute_not_exists(id)'
            }
        })

    try:
        client.transact_write_items(TransactItems=items)
        print("All items written atomically.")
    except client.exceptions.TransactionCanceledException:
        print("Transaction rolled back — detected partial or duplicate writes. (If any item fails, DynamoDB rolls back the entire transaction — ensuring atomicity.)")
EOF

✅ Test again with multiple uploads.

Every invocation now commits as a single atomic transaction — either all records persist or none do.

Pro Tips

Use TransactWriteItems carefully: It’s limited to 25 items per transaction — batch accordingly.
Add a reconciliation job: Periodically compare S3 object keys to DynamoDB ids to catch residual drift.
Log both success and rollback states: Use structured CloudWatch logs to track exactly when and why a transaction rolled back.
Don’t ignore conditional checks: They prevent duplicates from retries.

Conclusion

Partial writes are the silent killers of event-driven systems.

They don’t crash your code — they corrupt your data.

By pairing transactional writes with idempotent logic and periodic reconciliation, you harden your pipeline against the messy realities of distributed timing and failure.

In cloud architecture, success isn’t binary — it’s atomic.

Aaron Rose is a software engineer and technology writer at tech-reader.blog and the author of Think Like a Genius.

Search This Blog

Tech-Reader.blog