AWS CloudWatch Metrics: The Complete Reference for Grafana Users
If you are serious about observability on AWS, you will spend a lot of time in CloudWatch. It is the single source of truth for service-level metrics, and it integrates directly with Grafana. The problem is that AWS documentation scatters metric information across dozens of service guides.
This article brings it all together. It explains the metrics available from CloudWatch for the most commonly used AWS services, what each metric means, and how you can use them effectively in Grafana.
How CloudWatch Metrics Work
-
Namespaces group metrics by service. Examples include
AWS/EC2
,AWS/RDS
, andAWS/Lambda
. -
Dimensions identify the resource the metric applies to, such as
InstanceId
,FunctionName
, orTableName
. -
Statistics define how values are summarised:
Average
,Sum
,Minimum
,Maximum
, or percentiles likep95
. -
Resolution: Basic monitoring collects metrics at 5-minute intervals. Detailed monitoring enables 1-minute resolution.
EC2 Metrics (AWS/EC2
)
Metric | Description | Key Dimensions | Units | Use in Grafana |
---|---|---|---|---|
CPUUtilization |
Percentage of CPU used | InstanceId |
Percent | Spot bottlenecks, plan scaling |
DiskReadOps / DiskWriteOps |
Read/write operations on instance store volumes | InstanceId |
Count | Understand IOPS demand |
DiskReadBytes / DiskWriteBytes |
Data read/written on instance store volumes | InstanceId |
Bytes | Analyse throughput patterns |
StatusCheckFailed |
Combined instance/system health | InstanceId |
Count (0/1) | Alert if >0 |
StatusCheckFailed_Instance |
Instance-level failure | InstanceId |
Count (0/1) | Debug configuration/OS issues |
StatusCheckFailed_System |
AWS hardware failure | InstanceId |
Count (0/1) | Indicates AWS-side issue |
CPUCreditUsage |
CPU credits consumed (T2/T3/T4) | InstanceId |
Credits | Track burst usage |
CPUCreditBalance |
Remaining CPU credits | InstanceId |
Credits | Alert when balance low |
EBS Metrics (AWS/EBS
)
Metric | Description | Key Dimensions | Units | Use in Grafana |
---|---|---|---|---|
VolumeReadOps / VolumeWriteOps |
IOPS per volume | VolumeId |
Count/sec | Monitor storage demand |
VolumeReadBytes / VolumeWriteBytes |
Data throughput | VolumeId |
Bytes/sec | Detect throughput-heavy workloads |
VolumeTotalReadTime / VolumeTotalWriteTime |
Total latency per operation | VolumeId |
Seconds | Spot performance bottlenecks |
VolumeIdleTime |
Time volume spent idle | VolumeId |
Seconds | Useful for cost tuning |
VolumeQueueLength |
Outstanding I/O requests | VolumeId |
Count | High queues = contention risk |
RDS Metrics (AWS/RDS
)
Metric | Description | Key Dimensions | Units | Use in Grafana |
---|---|---|---|---|
CPUUtilization |
CPU used by DB instance | DBInstanceIdentifier |
Percent | High CPU = scaling/queries issue |
DatabaseConnections |
Number of DB connections | DBInstanceIdentifier |
Count | Alert when near max connections |
FreeableMemory |
Available RAM | DBInstanceIdentifier |
Bytes | Low memory = poor performance |
FreeStorageSpace |
Disk capacity remaining | DBInstanceIdentifier |
Bytes | Critical to avoid outages |
ReadIOPS / WriteIOPS |
Reads/writes per second | DBInstanceIdentifier |
Count/sec | Understand workload demand |
ReadLatency / WriteLatency |
Time per read/write | DBInstanceIdentifier |
Seconds | Latency impacts UX |
ReplicaLag |
Replication delay | DBInstanceIdentifier |
Seconds | Important for read replicas |
DynamoDB Metrics (AWS/DynamoDB
)
Metric | Description | Key Dimensions | Units | Use in Grafana |
---|---|---|---|---|
ConsumedReadCapacityUnits / ConsumedWriteCapacityUnits |
Throughput consumed | TableName , GlobalSecondaryIndex |
Count | Track against provisioned capacity |
ThrottledRequests |
Requests rejected due to capacity limits | TableName |
Count | Alert if >0 |
ReadThrottleEvents / WriteThrottleEvents |
Breakdown of throttling | TableName |
Count | Debug workload patterns |
SuccessfulRequestLatency |
Latency for successful requests | TableName |
Milliseconds | Measure user experience impact |
Lambda Metrics (AWS/Lambda
)
Metric | Description | Key Dimensions | Units | Use in Grafana |
---|---|---|---|---|
Invocations |
Number of function calls | FunctionName |
Count | Track workload volume |
Errors |
Failed invocations | FunctionName |
Count | Alert on spikes |
Duration |
Execution time per call | FunctionName |
ms | Track p95/p99 for UX |
Throttles |
Throttled invocations | FunctionName |
Count | Indicates concurrency issues |
ConcurrentExecutions |
Functions running at once | FunctionName |
Count | Watch concurrency limits |
IteratorAge |
Lag for stream triggers | FunctionName |
ms | Spot delays in stream processing |
API Gateway Metrics (AWS/ApiGateway
)
Metric | Description | Key Dimensions | Units | Use in Grafana |
---|---|---|---|---|
Count |
Number of requests | ApiName , Stage |
Count | Traffic trends |
Latency |
End-to-end latency | ApiName , Stage |
ms | Track user impact (use p95/p99) |
IntegrationLatency |
Backend processing time | ApiName , Stage |
ms | Diagnose backend vs gateway |
4XXError |
Client errors | ApiName , Stage |
Count | Spot misuse or auth failures |
5XXError |
Server errors | ApiName , Stage |
Count | Alert on backend/system faults |
CloudFront Metrics (AWS/CloudFront
)
Metric | Description | Key Dimensions | Units | Use in Grafana |
---|---|---|---|---|
Requests |
Number of requests | DistributionId , Region=Global |
Count | Request volume |
BytesDownloaded / BytesUploaded |
Data transfer | Same | Bytes | Bandwidth usage |
TotalErrorRate |
Error % across all requests | Same | Percent | High values = reliability issue |
4xxErrorRate / 5xxErrorRate |
Client vs server error breakdown | Same | Percent | Debug caching vs origin issues |
CacheHitRate |
% served from cache | Same | Percent | Optimise caching behaviour |
Note: CloudFront metrics are only available in us-east-1
with Region=Global
.
CloudWatch Logs Metrics (AWS/Logs
)
Metric | Description | Key Dimensions | Units | Use in Grafana |
---|---|---|---|---|
IncomingLogEvents |
Number of log events ingested | LogGroupName |
Count | Track log volume |
IncomingBytes |
Data volume ingested | LogGroupName |
Bytes | Estimate storage cost |
DeliveryErrors |
Failed delivery attempts | LogGroupName , DestinationType |
Count | Debug subscription issues |
DeliveryThrottling |
Throttled log events | Same | Count | Detect limits exceeded |
ErrorCount |
API errors | Service , Resource |
Count | Detect problems in log delivery |
CloudWatch Agent Metrics (CWAgent
)
Metric | Description | Key Dimensions | Units | Use in Grafana |
---|---|---|---|---|
cpu_usage_active / idle / system |
CPU breakdown | Host, InstanceId | Percent | Detailed CPU analysis |
mem_used / mem_free / mem_available |
Memory statistics | Host | Bytes | Memory pressure detection |
disk_used_percent / disk_free |
Disk usage | MountPath, Device | Percent/Bytes | Disk capacity alerts |
swap_used / swap_used_percent |
Swap activity | Host | Bytes/Percent | Spot performance issues |
processes_running / processes_sleeping |
Process states | Host | Count | Detect overloaded hosts |
Using Metrics Effectively in Grafana
-
Group by user impact: Focus dashboards on latency, errors, and throttles.
-
Overlay related metrics: Plot API latency alongside DynamoDB throttles and consumed capacity to spot correlations.
-
Alert on symptoms, not noise: A single throttle is irrelevant. A sustained latency increase plus throttles is actionable.
-
Use percentiles, not averages: Percentiles reflect what users actually see.
-
Keep dashboards alive: Review and refine them as your architecture evolves.
Final Thoughts
CloudWatch metrics are AWS’s built-in observability layer. Grafana makes them usable by connecting signals across services. Together, they provide visibility and confidence without requiring a heavy observability stack.
If you want to move faster, reduce wasted effort, and calm down those “will it scale?” conversations, this is where you start: collect the right metrics, display them clearly, and use them to make decisions.
Gary Worthington is a software engineer, delivery consultant, and agile coach who helps teams move fast, learn faster, and scale when it matters. He writes about modern engineering, product thinking, and helping teams ship things that matter.
Through his consultancy, More Than Monkeys, Gary helps startups and scaleups improve how they build software, from tech strategy and agile delivery to product validation and team development.
Visit morethanmonkeys.co.uk to learn how we can help you build better, faster.
Follow Gary on LinkedIn for practical insights into engineering leadership, agile delivery, and team performance.