The Secret Life of AWS: Distributed Tracing (AWS X-Ray)

How to track a single user request across multiple microservices.

#AWS #XRay #Observability #Microservices

🎧 Audio Edition: Prefer to listen? Check out the expanded AI podcast version of this deep dive on YouTube.

📺 Video Edition: Prefer to watch? Check out the 5-minute visual explainer on YouTube.

Part 50 of The Secret Life of AWS

Timothy had six different browser tabs open on his monitor. He was looking at CloudWatch logs for the API Gateway, the Checkout Lambda function, the Amazon SQS queue, and the Inventory Lambda function.

"I have a user report indicating their checkout process took fourteen seconds to complete," Timothy explained as Margaret pulled up a chair next to him. "I am trying to match the timestamps across all these different log groups to find out which specific service caused the delay. It is taking quite a bit of time."

Margaret smiled sympathetically. "That is incredibly difficult to do manually, Timothy. We have built roughly fifty different architectural components together over the past year. You did an excellent job decoupling them using event-driven patterns. However, the direct result of a decoupled architecture is that a single user request now spans multiple independent compute environments. Matching timestamps across isolated systems is unreliable."

"How do we track the request if the services do not share a compute environment?" Timothy asked.

"We implement Distributed Tracing," Margaret said, opening the AWS Management Console. "CloudWatch Logs provides the detailed, text-based output of a single compute process. We need a system that maps the chronological relationship between all of those separate compute processes. We must assign a unique identifier to the request the moment it enters our cloud environment, and we must ensure that identifier is passed to every downstream service. We will use AWS X-Ray."

The Trace Identifier

Margaret navigated to the API Gateway configuration and enabled X-Ray tracing.

"When the user's HTTP request hits the API Gateway, X-Ray generates a unique tracking string called a Trace ID," she explained. "It looks like a standard HTTP header, specifically X-Amzn-Trace-Id. The API Gateway attaches this header to the request before forwarding it to your Checkout Lambda function."

"So the Checkout Lambda function logs that Trace ID?" Timothy asked.

"Yes, but it does more than just log it," Margaret replied. "It performs Trace Propagation. When the Checkout Lambda function sends a message to the SQS queue, the AWS SDK automatically includes that exact same X-Amzn-Trace-Id in the message metadata. When the downstream Inventory Lambda function pulls the message from the queue, it reads the Trace ID and continues the chain. Every service reports its execution time and status back to the X-Ray API using that shared ID."

The Service Map

"Let me show you the primary benefit of this configuration," Margaret said. She opened the AWS X-Ray console and clicked on the Service Map.

Instead of lines of text, the screen displayed a visual, interactive diagram of their exact architecture, generated entirely from the trace data.

"X-Ray aggregates all the data from that Trace ID and visualizes it," Margaret explained. "Every AWS service is represented as a node. The lines connecting them show the exact path the user request took. We no longer have to guess the architecture; the system maps it for us based on actual network traffic."

Timothy looked at the map. The node representing the Checkout database was green, indicating successful, fast queries. However, the node representing the Inventory database was outlined in red, and the connection line showed a high latency metric.

"Look at the Inventory database node," Timothy pointed out. "X-Ray shows the average response time for that specific database query is twelve seconds."

"Exactly," Margaret said encouragingly. "You found the exact source of the fourteen-second delay in less than a minute. You do not need to search through six different log groups. The Service Map highlights latency, errors, and faults visually. You can click directly on the red node to see the specific SQL query that caused the delay."

Instrumentation, Sampling, and Annotations

"This is incredibly efficient," Timothy said. "Do I need to rewrite all of our microservices to send data to X-Ray? And will recording every single request drastically increase our AWS bill?"

"To manage costs, X-Ray relies on Sampling Rules," Margaret answered. "By default, it only records the first request each second, plus five percent of any additional requests. This provides enough statistical data to identify bottlenecks without paying to trace every identical, successful transaction."

"As for your code," she continued, "managed services like API Gateway and Step Functions only require a simple configuration toggle in the console. For your custom application code running in Lambda, you import the AWS X-Ray SDK and wrap your AWS SDK clients with it. It requires only a few lines of code, and it automatically intercepts your downstream HTTP and database calls to inject the tracing header. You can even add Annotations, which are custom key-value pairs like customer_tier: premium. This allows you to filter the Service Map later to only show performance metrics for your highest-paying users."

Timothy closed his multiple log tabs. He opened the Inventory service repository to review the specific database query X-Ray had identified. The isolated log files were gone, replaced by a complete, end-to-end view of the system.

Key Concepts Introduced:

Distributed Tracing is a method used to profile and monitor applications, especially those built using a microservices architecture. It tracks a single request as it progresses through various services, network boundaries, and databases, allowing engineers to pinpoint performance bottlenecks and errors that are otherwise difficult to isolate in decentralized systems.

AWS X-Ray is the managed service that collects data about requests that your application serves. It relies on a unique identifier called a Trace ID (X-Amzn-Trace-Id), which is generated at the entry point of the architecture and appended to the HTTP headers. To manage data volume and cost, X-Ray uses Sampling Rules, capturing a representative subset of traffic (by default, one request per second and five percent of subsequent traffic) rather than recording every single transaction.

Trace Propagation is the process of ensuring that the Trace ID is passed successfully from one service to the next. AWS X-Ray uses this continuous data to automatically generate a Service Map, a visual representation of your architecture showing the relationships between services, their latency metrics, and their current error rates based on real-time request traffic. Engineers can also enrich this data using Annotations, which are indexed key-value pairs added to traces to enable business-specific filtering and search capabilities within the X-Ray console.

Aaron Rose is a software engineer and technology writer at tech-reader.blog. For explainer videos and podcasts, check out Tech-Reader YouTube channel.

Search This Blog

Tech-Reader.blog