The Secret Life of AWS: The Orchestrator (AWS Step Functions)
The Secret Life of AWS: The Orchestrator (AWS Step Functions)
How to manage distributed transactions and automated rollbacks using the Saga Pattern
#AWS #StepFunctions #SagaPattern #Serverless
Margaret is a senior software engineer. Timothy is her junior colleague. They work in a grand Victorian library in London — the kind of place where code quality is the unspoken objective, and craftsmanship is the only thing that matters.
Episode 74
Timothy was staring at a customer support ticket on his secondary monitor, shaking his head in disbelief. He had spent months ensuring his decoupled, event-driven architecture was highly available, cost-efficient, and chaos-tested. Yet, a severe logical bug had just surfaced.
Margaret walked into the Victorian studio and set her notebook on the table. "You have that look again, Timothy. What did the architecture do this time?"
"It did exactly what I told it to do, which is the problem," Timothy admitted. "A customer placed an order for a mechanical keyboard. Our EventBridge bus fired off the events perfectly. The Payment microservice charged their credit card. A fraction of a second later, the Inventory microservice tried to reserve the item, but our warehouse API returned an 'Out of Stock' error."
"So the customer was charged for an item we do not have," Margaret summarized.
"Exactly," Timothy said. "Because the architecture is fully decoupled, the Payment service has no idea the Inventory service failed. I am having to write custom 'cleanup' scripts to listen for inventory failures and manually trigger refunds. It is turning into distributed spaghetti code."
"You have spent 73 episodes successfully decoupling this architecture," Margaret explained, pulling up a chair. "But you have hit the ceiling of event-driven choreography. EventBridge is brilliant for broadcasting state changes, but it is terrible at managing strict, multi-step workflows. When a distributed transaction fails halfway through, you cannot just leave the database in an inconsistent state. We need to implement the Saga Pattern using AWS Step Functions."
Choreography vs. Orchestration
Margaret took the keyboard and opened the AWS Console.
"In a choreographed system, every microservice is an independent dancer reacting to the music," Margaret said. "But for a checkout flow, we don't want independent dancers. We want an orchestra with a single conductor dictating exactly what happens, step by step, and what to do if an instrument breaks."
She opened Timothy's AWS Cloud Development Kit (CDK) stack to define a State Machine.
const { StateMachine } = require('aws-cdk-lib/aws-stepfunctions');
const { LambdaInvoke } = require('aws-cdk-lib/aws-stepfunctions-tasks');
// 1. Define the individual tasks
const processPayment = new LambdaInvoke(this, 'Process Payment', { lambdaFunction: paymentLambda });
const reserveInventory = new LambdaInvoke(this, 'Reserve Inventory', { lambdaFunction: inventoryLambda });
const refundPayment = new LambdaInvoke(this, 'Refund Payment (Compensating Transaction)', { lambdaFunction: refundLambda });
// 2. Implement the Saga Pattern (The Rollback)
// If inventory fails, catch the error and route the workflow to the refund task
reserveInventory.addCatch(refundPayment, {
errors: ['InventoryDepletedException'],
resultPath: '$.errorDetails'
});
// 3. Chain the successful workflow together
const checkoutDefinition = processPayment.next(reserveInventory);
// 4. Create the Step Functions State Machine
const checkoutStateMachine = new StateMachine(this, 'CheckoutOrchestrator', {
definition: checkoutDefinition,
tracingEnabled: true
});
The Saga Pattern
Timothy studied the CDK code. "So instead of API Gateway dropping an event onto a bus and hoping for the best, it triggers this CheckoutOrchestrator State Machine. The State Machine explicitly calls the Payment Lambda, waits for success, and then explicitly calls the Inventory Lambda."
"Precisely," Margaret said. "And look at the .addCatch block attached to the Inventory task. This is the implementation of the Saga Pattern."
"In a monolithic database, if a multi-step transaction fails, you just execute an SQL ROLLBACK," Timothy noted. "But we are in a distributed serverless environment. The payment is already processed in a completely different database."
"Which is why we use Compensating Transactions," Margaret nodded. "If the Reserve Inventory task throws an InventoryDepletedException, AWS Step Functions instantly catches it and routes the workflow to the Refund Payment task. It actively reverses the previous step to bring the system back to a consistent state. And just like we established in Episode 61, that compensating refund task must be idempotent, ensuring we never accidentally refund a customer twice if the network blips."
"And if it fails, I don't have to go digging through CloudWatch logs to figure out where the transaction died," Timothy realized.
"Correct," Margaret smiled. "Every Step Functions execution leaves a visual audit trail. You can look at the console and see exactly where the saga succeeded, where it failed, and how the rollback executed."
Timothy updated his architecture diagram. He removed the tangled web of EventBridge rules trying to manage his checkout flow and replaced it with a single, clean AWS Step Functions icon. His distributed transactions were no longer a chaotic dance; they were a tightly orchestrated symphony, complete with guaranteed rollbacks.
Key Concepts Introduced
Choreography vs. Orchestration
Two patterns for managing microservices.
- Choreography (using Amazon EventBridge or SNS) relies on services independently reacting to events without centralized control; it is highly decoupled but difficult to track.
- Orchestration (using AWS Step Functions) uses a centralized controller to manage the execution order, data passing, and error handling of a workflow.
AWS Step Functions
A visual workflow service that helps developers build distributed applications and orchestrate microservices. It provides a visual execution history, making it trivial to audit exact success/failure paths in complex workflows.
Standard vs. Express Workflows
Step Functions offers two distinct execution modes. Standard Workflows are designed for long-running, auditable business processes (like order fulfillment spanning up to a year). Express Workflows are optimized for high-volume, short-duration tasks (like IoT data ingestion) where visual audit trails per execution are less critical and cost efficiency is paramount.
The Saga Pattern
A failure management pattern used in distributed architectures where standard database transactions cannot span across multiple microservices. A business workflow is broken into smaller, local transactions.
Compensating Transactions
The mechanism that makes the Saga Pattern work. If a local transaction fails halfway through a distributed workflow, the orchestrator triggers a series of idempotent compensating transactions (e.g., issuing a refund) to actively undo the work of the previous successful steps.
Aaron Rose is a software engineer and technology writer at tech-reader.blog.
Catch up on the latest explainer videos, podcasts, and industry discussions below.
.jpeg)
