Article

AWS CloudWatch Metrics: The Complete Reference for Grafana Users

If you are serious about observability on AWS, you will spend a lot of time in CloudWatch. It is the single source of truth for service-level metrics, and it integrates directly with Grafana. The problem is that AWS documentation scatters metric information across dozens of service guides.

This article brings it all together. It explains the metrics available from CloudWatch for the most commonly used AWS services, what each metric means, and how you can use them effectively in Grafana.


How CloudWatch Metrics Work

  • Namespaces group metrics by service. Examples include AWS/EC2, AWS/RDS, and AWS/Lambda.

  • Dimensions identify the resource the metric applies to, such as InstanceId, FunctionName, or TableName.

  • Statistics define how values are summarised: Average, Sum, Minimum, Maximum, or percentiles like p95.

  • Resolution: Basic monitoring collects metrics at 5-minute intervals. Detailed monitoring enables 1-minute resolution.


EC2 Metrics (AWS/EC2)

Metric Description Key Dimensions Units Use in Grafana
CPUUtilization Percentage of CPU used InstanceId Percent Spot bottlenecks, plan scaling
DiskReadOps / DiskWriteOps Read/write operations on instance store volumes InstanceId Count Understand IOPS demand
DiskReadBytes / DiskWriteBytes Data read/written on instance store volumes InstanceId Bytes Analyse throughput patterns
StatusCheckFailed Combined instance/system health InstanceId Count (0/1) Alert if >0
StatusCheckFailed_Instance Instance-level failure InstanceId Count (0/1) Debug configuration/OS issues
StatusCheckFailed_System AWS hardware failure InstanceId Count (0/1) Indicates AWS-side issue
CPUCreditUsage CPU credits consumed (T2/T3/T4) InstanceId Credits Track burst usage
CPUCreditBalance Remaining CPU credits InstanceId Credits Alert when balance low

EBS Metrics (AWS/EBS)

Metric Description Key Dimensions Units Use in Grafana
VolumeReadOps / VolumeWriteOps IOPS per volume VolumeId Count/sec Monitor storage demand
VolumeReadBytes / VolumeWriteBytes Data throughput VolumeId Bytes/sec Detect throughput-heavy workloads
VolumeTotalReadTime / VolumeTotalWriteTime Total latency per operation VolumeId Seconds Spot performance bottlenecks
VolumeIdleTime Time volume spent idle VolumeId Seconds Useful for cost tuning
VolumeQueueLength Outstanding I/O requests VolumeId Count High queues = contention risk

RDS Metrics (AWS/RDS)

Metric Description Key Dimensions Units Use in Grafana
CPUUtilization CPU used by DB instance DBInstanceIdentifier Percent High CPU = scaling/queries issue
DatabaseConnections Number of DB connections DBInstanceIdentifier Count Alert when near max connections
FreeableMemory Available RAM DBInstanceIdentifier Bytes Low memory = poor performance
FreeStorageSpace Disk capacity remaining DBInstanceIdentifier Bytes Critical to avoid outages
ReadIOPS / WriteIOPS Reads/writes per second DBInstanceIdentifier Count/sec Understand workload demand
ReadLatency / WriteLatency Time per read/write DBInstanceIdentifier Seconds Latency impacts UX
ReplicaLag Replication delay DBInstanceIdentifier Seconds Important for read replicas

DynamoDB Metrics (AWS/DynamoDB)

Metric Description Key Dimensions Units Use in Grafana
ConsumedReadCapacityUnits / ConsumedWriteCapacityUnits Throughput consumed TableName, GlobalSecondaryIndex Count Track against provisioned capacity
ThrottledRequests Requests rejected due to capacity limits TableName Count Alert if >0
ReadThrottleEvents / WriteThrottleEvents Breakdown of throttling TableName Count Debug workload patterns
SuccessfulRequestLatency Latency for successful requests TableName Milliseconds Measure user experience impact

Lambda Metrics (AWS/Lambda)

Metric Description Key Dimensions Units Use in Grafana
Invocations Number of function calls FunctionName Count Track workload volume
Errors Failed invocations FunctionName Count Alert on spikes
Duration Execution time per call FunctionName ms Track p95/p99 for UX
Throttles Throttled invocations FunctionName Count Indicates concurrency issues
ConcurrentExecutions Functions running at once FunctionName Count Watch concurrency limits
IteratorAge Lag for stream triggers FunctionName ms Spot delays in stream processing

API Gateway Metrics (AWS/ApiGateway)

Metric Description Key Dimensions Units Use in Grafana
Count Number of requests ApiName, Stage Count Traffic trends
Latency End-to-end latency ApiName, Stage ms Track user impact (use p95/p99)
IntegrationLatency Backend processing time ApiName, Stage ms Diagnose backend vs gateway
4XXError Client errors ApiName, Stage Count Spot misuse or auth failures
5XXError Server errors ApiName, Stage Count Alert on backend/system faults

CloudFront Metrics (AWS/CloudFront)

Metric Description Key Dimensions Units Use in Grafana
Requests Number of requests DistributionId, Region=Global Count Request volume
BytesDownloaded / BytesUploaded Data transfer Same Bytes Bandwidth usage
TotalErrorRate Error % across all requests Same Percent High values = reliability issue
4xxErrorRate / 5xxErrorRate Client vs server error breakdown Same Percent Debug caching vs origin issues
CacheHitRate % served from cache Same Percent Optimise caching behaviour

Note: CloudFront metrics are only available in us-east-1 with Region=Global.


CloudWatch Logs Metrics (AWS/Logs)

Metric Description Key Dimensions Units Use in Grafana
IncomingLogEvents Number of log events ingested LogGroupName Count Track log volume
IncomingBytes Data volume ingested LogGroupName Bytes Estimate storage cost
DeliveryErrors Failed delivery attempts LogGroupName, DestinationType Count Debug subscription issues
DeliveryThrottling Throttled log events Same Count Detect limits exceeded
ErrorCount API errors Service, Resource Count Detect problems in log delivery

CloudWatch Agent Metrics (CWAgent)

Metric Description Key Dimensions Units Use in Grafana
cpu_usage_active / idle / system CPU breakdown Host, InstanceId Percent Detailed CPU analysis
mem_used / mem_free / mem_available Memory statistics Host Bytes Memory pressure detection
disk_used_percent / disk_free Disk usage MountPath, Device Percent/Bytes Disk capacity alerts
swap_used / swap_used_percent Swap activity Host Bytes/Percent Spot performance issues
processes_running / processes_sleeping Process states Host Count Detect overloaded hosts

Using Metrics Effectively in Grafana

  • Group by user impact: Focus dashboards on latency, errors, and throttles.

  • Overlay related metrics: Plot API latency alongside DynamoDB throttles and consumed capacity to spot correlations.

  • Alert on symptoms, not noise: A single throttle is irrelevant. A sustained latency increase plus throttles is actionable.

  • Use percentiles, not averages: Percentiles reflect what users actually see.

  • Keep dashboards alive: Review and refine them as your architecture evolves.


Final Thoughts

CloudWatch metrics are AWS’s built-in observability layer. Grafana makes them usable by connecting signals across services. Together, they provide visibility and confidence without requiring a heavy observability stack.

If you want to move faster, reduce wasted effort, and calm down those “will it scale?” conversations, this is where you start: collect the right metrics, display them clearly, and use them to make decisions.


Gary Worthington is a software engineer, delivery consultant, and agile coach who helps teams move fast, learn faster, and scale when it matters. He writes about modern engineering, product thinking, and helping teams ship things that matter.

Through his consultancy, More Than Monkeys, Gary helps startups and scaleups improve how they build software, from tech strategy and agile delivery to product validation and team development.

Visit morethanmonkeys.co.uk to learn how we can help you build better, faster.

Follow Gary on LinkedIn for practical insights into engineering leadership, agile delivery, and team performance.