How to Get "Total Data Scanned" for a Redshift Query via the Data API
Aaron Rose
Software Engineer & Technology Writer
Problem
You can see "data scanned" metrics for queries in the Redshift console, but you need to access these numbers programmatically from Lambda functions, ETL jobs, or CI/CD pipelines that use the Redshift Data API. Without programmatic access to scan metrics, you can't implement cost monitoring, fail-fast logic, or automated performance telemetry.
Clarifying the Issue
This isn't about querying historical performance data or setting up dashboards — it's about getting the exact "total data scanned" number for any specific query executed via the Data API. The challenge is mapping Data API statement IDs to Redshift's internal query system and extracting scan metrics programmatically.
Why It Matters
Programmatic access to scan metrics enables critical operational capabilities like automated cost controls that fail jobs exceeding data scan thresholds, performance telemetry that tracks resource usage across all queries, and budget monitoring that provides real-time cost visibility. Organizations processing terabytes daily need these capabilities to prevent runaway costs and maintain predictable performance.
For Redshift Serverless users, scan bytes directly impact RPU (Redshift Processing Unit) billing, making this monitoring even more critical for cost management and capacity planning.
Key Terms
• Data API Statement ID – Unique identifier returned by ExecuteStatement
calls for tracking query execution
• Redshift Query ID – Internal integer identifier used in Redshift system tables and views
• SVL_QUERY_METRICS_SUMMARY – System view containing aggregated per-query performance metrics including scan bytes
• Scan Bytes – Total amount of data read from storage during query execution, including both local and Spectrum data
Steps at a Glance
- Execute your query using the Data API and capture the statement ID
- Poll until the query completes and extract the Redshift query ID
- Query the system metrics table using the internal query ID
- Extract scan bytes and convert to meaningful units for monitoring
Detailed Steps
Step 1: Execute your query using the Data API and capture the statement ID
Use the Redshift Data API to execute your business query and capture the returned statement identifier. This ID becomes your handle for tracking the query through completion.
import boto3
rsd = boto3.client("redshift-data", region_name="us-east-1")
# Execute your business query
stmt = rsd.execute_statement(
Database="dev",
WorkgroupName="my-serverless-workgroup", # or ClusterIdentifier + DbUser
SecretArn="arn:aws:secretsmanager:region:account:secret:name",
Sql="SELECT COUNT(*) FROM large_table WHERE date_col >= '2024-01-01'"
)
statement_id = stmt["Id"]
Connection Configuration:
Data API Flow
├── Serverless → WorkgroupName + Database + SecretArn
├── Provisioned → ClusterIdentifier + DbUser + Database
└── Authentication → SecretArn OR database user credentials
Step 2: Poll until the query completes and extract the Redshift query ID
Monitor the statement status until completion and retrieve the internal Redshift query ID that connects to the system metrics tables.
def wait_for_completion(statement_id):
while True:
desc = rsd.describe_statement(Id=statement_id)
if desc["Status"] in ("FINISHED", "FAILED", "ABORTED"):
return desc
time.sleep(1) # Add exponential backoff for production
desc = wait_for_completion(statement_id)
if desc["Status"] != "FINISHED":
raise RuntimeError(f"Query failed: {desc.get('Error')}")
redshift_query_id = desc["RedshiftQueryId"] # Key bridge to system tables
Troubleshooting: If RedshiftQueryId
is missing, ensure your IAM role has sufficient permissions and the query actually executed (wasn't just parsed/validated).
Step 3: Query the system metrics table using the internal query ID
Use the Redshift query ID to extract scan metrics from the system views. This query aggregates segment-level data to provide total bytes scanned.
-- Core query for extracting scan bytes
SELECT COALESCE(SUM(metric_value)::bigint, 0) AS scan_bytes
FROM svl_query_metrics_summary
WHERE query = :qid
AND metric_name = 'scan_bytes';
Execute this metrics query via the Data API:
metrics_stmt = rsd.execute_statement(
Database="dev",
WorkgroupName="my-serverless-workgroup",
SecretArn="arn:aws:secretsmanager:region:account:secret:name",
Sql="""
SELECT COALESCE(SUM(metric_value)::bigint, 0) AS scan_bytes
FROM svl_query_metrics_summary
WHERE query = :qid AND metric_name = 'scan_bytes'
""",
Parameters=[{"name": "qid", "value": str(redshift_query_id)}]
)
System Table Insights:
SVL_QUERY_METRICS_SUMMARY Structure
├── query → Redshift internal query ID
├── metric_name → 'scan_bytes', 'cpu_time', 'rows_returned'
├── metric_value → Actual measured value
└── Aggregation → SUM() combines all query segments
Step 4: Extract scan bytes and convert to meaningful units for monitoring
Retrieve the results and format them for logging, alerting, or cost analysis.
# Get metrics query results
metrics_desc = wait_for_completion(metrics_stmt["Id"])
result = rsd.get_statement_result(Id=metrics_stmt["Id"])
scan_bytes = int(result["Records"][0][0]["longValue"])
# Format for human readability
def format_bytes(bytes_val):
for unit in ("B", "KB", "MB", "GB", "TB", "PB"):
if bytes_val < 1024 or unit == "PB":
return f"{bytes_val:.2f} {unit}"
bytes_val /= 1024
print(f"Query {redshift_query_id} scanned: {format_bytes(scan_bytes)}")
# Calculate approximate cost (Redshift pricing: ~$5.37 per TB scanned)
cost_estimate = scan_bytes * 0.00000000537
print(f"Estimated cost: ${cost_estimate:.4f}")
Production Implementation Patterns:
Cost Guardrails:
# Fail fast on expensive queries
MAX_SCAN_GB = 100
if scan_bytes > MAX_SCAN_GB * 1024**3:
raise RuntimeError(f"Query exceeded scan limit: {format_bytes(scan_bytes)}")
Performance Telemetry:
# Log structured metrics for analysis
logger.info({
"redshift_query_id": redshift_query_id,
"scan_bytes": scan_bytes,
"scan_gb": scan_bytes / 1024**3,
"duration_seconds": duration,
"cost_estimate": cost_estimate
})
Complete Implementation: For full working examples in Python and Node.js, including error handling, retry logic, and Lambda deployment templates, see the Redshift Query Cost Monitoring Toolkit.
Conclusion
Getting programmatic access to Redshift scan metrics requires mapping Data API statements to internal query IDs and querying system tables. This pattern enables powerful cost monitoring, automated guardrails, and performance telemetry that scales with your data operations.
The key insight is that svl_query_metrics_summary
provides reliable, aggregated metrics for any query executed through Redshift, regardless of whether it originated from the console, Data API, or traditional database connections. By implementing this pattern in your data pipelines, you gain the visibility needed to optimize both performance and costs proactively.
Next Steps:
- Implement basic scan monitoring in your ETL jobs
- Set up CloudWatch alarms for unexpected cost spikes
- Build dashboards showing scan trends across teams and applications
- Consider automated query optimization recommendations based on scan patterns
This article is part of the comprehensive Redshift optimization series. Start with query optimization fundamentals for maximum impact.
Aaron Rose is a software engineer and technology writer at tech-reader.blog.
Comments
Post a Comment