How to Get "Total Data Scanned" for a Redshift Query via the Data API

Problem

You can see "data scanned" metrics for queries in the Redshift console, but you need to access these numbers programmatically from Lambda functions, ETL jobs, or CI/CD pipelines that use the Redshift Data API. Without programmatic access to scan metrics, you can't implement cost monitoring, fail-fast logic, or automated performance telemetry.

Clarifying the Issue

This isn't about querying historical performance data or setting up dashboards — it's about getting the exact "total data scanned" number for any specific query executed via the Data API. The challenge is mapping Data API statement IDs to Redshift's internal query system and extracting scan metrics programmatically.

Why It Matters

Programmatic access to scan metrics enables critical operational capabilities like automated cost controls that fail jobs exceeding data scan thresholds, performance telemetry that tracks resource usage across all queries, and budget monitoring that provides real-time cost visibility. Organizations processing terabytes daily need these capabilities to prevent runaway costs and maintain predictable performance.

For Redshift Serverless users, scan bytes directly impact RPU (Redshift Processing Unit) billing, making this monitoring even more critical for cost management and capacity planning.

Key Terms

• Data API Statement ID – Unique identifier returned by ExecuteStatement calls for tracking query execution
• Redshift Query ID – Internal integer identifier used in Redshift system tables and views
• SVL_QUERY_METRICS_SUMMARY – System view containing aggregated per-query performance metrics including scan bytes
• Scan Bytes – Total amount of data read from storage during query execution, including both local and Spectrum data

Steps at a Glance

Execute your query using the Data API and capture the statement ID
Poll until the query completes and extract the Redshift query ID
Query the system metrics table using the internal query ID
Extract scan bytes and convert to meaningful units for monitoring

Detailed Steps

Step 1: Execute your query using the Data API and capture the statement ID

Use the Redshift Data API to execute your business query and capture the returned statement identifier. This ID becomes your handle for tracking the query through completion.

import boto3

rsd = boto3.client("redshift-data", region_name="us-east-1")

# Execute your business query
stmt = rsd.execute_statement(
    Database="dev",
    WorkgroupName="my-serverless-workgroup",  # or ClusterIdentifier + DbUser
    SecretArn="arn:aws:secretsmanager:region:account:secret:name", 
    Sql="SELECT COUNT(*) FROM large_table WHERE date_col >= '2024-01-01'"
)

statement_id = stmt["Id"]

Connection Configuration:

Data API Flow
├── Serverless → WorkgroupName + Database + SecretArn
├── Provisioned → ClusterIdentifier + DbUser + Database  
└── Authentication → SecretArn OR database user credentials

Step 2: Poll until the query completes and extract the Redshift query ID

Monitor the statement status until completion and retrieve the internal Redshift query ID that connects to the system metrics tables.

def wait_for_completion(statement_id):
    while True:
        desc = rsd.describe_statement(Id=statement_id)
        if desc["Status"] in ("FINISHED", "FAILED", "ABORTED"):
            return desc
        time.sleep(1)  # Add exponential backoff for production

desc = wait_for_completion(statement_id)
if desc["Status"] != "FINISHED":
    raise RuntimeError(f"Query failed: {desc.get('Error')}")

redshift_query_id = desc["RedshiftQueryId"]  # Key bridge to system tables

Troubleshooting: If RedshiftQueryId is missing, ensure your IAM role has sufficient permissions and the query actually executed (wasn't just parsed/validated).

Step 3: Query the system metrics table using the internal query ID

Use the Redshift query ID to extract scan metrics from the system views. This query aggregates segment-level data to provide total bytes scanned.

-- Core query for extracting scan bytes
SELECT COALESCE(SUM(metric_value)::bigint, 0) AS scan_bytes
FROM svl_query_metrics_summary 
WHERE query = :qid 
  AND metric_name = 'scan_bytes';

Execute this metrics query via the Data API:

metrics_stmt = rsd.execute_statement(
    Database="dev",
    WorkgroupName="my-serverless-workgroup",
    SecretArn="arn:aws:secretsmanager:region:account:secret:name",
    Sql="""
        SELECT COALESCE(SUM(metric_value)::bigint, 0) AS scan_bytes
        FROM svl_query_metrics_summary
        WHERE query = :qid AND metric_name = 'scan_bytes'
    """,
    Parameters=[{"name": "qid", "value": str(redshift_query_id)}]
)

System Table Insights:

SVL_QUERY_METRICS_SUMMARY Structure
├── query → Redshift internal query ID
├── metric_name → 'scan_bytes', 'cpu_time', 'rows_returned'
├── metric_value → Actual measured value  
└── Aggregation → SUM() combines all query segments

Step 4: Extract scan bytes and convert to meaningful units for monitoring

Retrieve the results and format them for logging, alerting, or cost analysis.

# Get metrics query results
metrics_desc = wait_for_completion(metrics_stmt["Id"])
result = rsd.get_statement_result(Id=metrics_stmt["Id"])
scan_bytes = int(result["Records"][0][0]["longValue"])

# Format for human readability
def format_bytes(bytes_val):
    for unit in ("B", "KB", "MB", "GB", "TB", "PB"):
        if bytes_val < 1024 or unit == "PB":
            return f"{bytes_val:.2f} {unit}"
        bytes_val /= 1024

print(f"Query {redshift_query_id} scanned: {format_bytes(scan_bytes)}")

# Calculate approximate cost (Redshift pricing: ~$5.37 per TB scanned)
cost_estimate = scan_bytes * 0.00000000537
print(f"Estimated cost: ${cost_estimate:.4f}")

Production Implementation Patterns:

Cost Guardrails:

# Fail fast on expensive queries
MAX_SCAN_GB = 100
if scan_bytes > MAX_SCAN_GB * 1024**3:
    raise RuntimeError(f"Query exceeded scan limit: {format_bytes(scan_bytes)}")

Performance Telemetry:

# Log structured metrics for analysis
logger.info({
    "redshift_query_id": redshift_query_id,
    "scan_bytes": scan_bytes,
    "scan_gb": scan_bytes / 1024**3,
    "duration_seconds": duration,
    "cost_estimate": cost_estimate
})

Complete Implementation: For full working examples in Python and Node.js, including error handling, retry logic, and Lambda deployment templates, see the Redshift Query Cost Monitoring Toolkit.

Conclusion

Getting programmatic access to Redshift scan metrics requires mapping Data API statements to internal query IDs and querying system tables. This pattern enables powerful cost monitoring, automated guardrails, and performance telemetry that scales with your data operations.

The key insight is that svl_query_metrics_summary provides reliable, aggregated metrics for any query executed through Redshift, regardless of whether it originated from the console, Data API, or traditional database connections. By implementing this pattern in your data pipelines, you gain the visibility needed to optimize both performance and costs proactively.

Next Steps:

Implement basic scan monitoring in your ETL jobs
Set up CloudWatch alarms for unexpected cost spikes
Build dashboards showing scan trends across teams and applications
Consider automated query optimization recommendations based on scan patterns

This article is part of the comprehensive Redshift optimization series. Start with query optimization fundamentals for maximum impact.

Aaron Rose is a software engineer and technology writer at tech-reader.blog and the author of The Rose Theory series on math and physics.

Search This Blog

Tech-Reader.blog