Using Grafana and AWS Metrics to Bring Observability to Any Project

Most teams talk about observability, but many do not actually have it. Logs are buried somewhere, CloudWatch alarms are set up because “the console suggested it,” and a few dashboards magically appear after a nasty outage. That is not observability. That is hoping your customers are generous enough to report bugs for you.

Without observability, you are blind. You cannot explain why a service slowed down, you cannot prove your architecture is safe to scale, and you cannot make sensible trade-offs. You are left with guesswork and gut feel, which is a polite way of saying “fingers crossed.”

The good news is that observability does not require a PhD or a six-figure tooling budget. If you are running on AWS, the metrics are already there. Add Grafana, and suddenly you have clarity instead of confusion.

Why Observability Matters

Monitoring and observability are not the same thing. Teams often use the words interchangeably, which is usually a sign that they have neither.

Monitoring tells you when something is broken.
Observability lets you explain why it broke, and answer questions you did not plan for.

An alarm that says CPU is at 95% is monitoring. A dashboard that shows API latency rising in line with DynamoDB throttles and a surge of requests, and allows you to say that is why response times went up, is observability.

For startups, this is the difference between confidently launching features and praying nothing falls over. For scaleups, it is the difference between investors nodding along in meetings and investors panicking that you do not have your house in order.

What AWS Gives You for Free

CloudWatch quietly collects a mountain of metrics. Most teams only look at the surface. Underneath, the gold is already there:

EC2: CPU, network, and with the agent, memory and disk.
RDS: queries per second, connections, replication lag.
DynamoDB: consumed read and write capacity, throttles, and latency.
API Gateway: request counts, latencies, error rates.
Lambda: invocations, durations, errors, cold starts.

If you are not using these, you are paying AWS for the privilege of ignoring useful data.

Grafana: Turning Metrics Into Insight

Grafana takes CloudWatch data and makes it usable. You can combine services, overlay related signals, and drill into specific problems.

Health Dashboard

Show together:

API Gateway p99 latency
Lambda duration and error rate
DynamoDB throttles
RDS connections

If latency spikes, you can immediately see whether the problem is the database, the code, or the network. Much better than guessing and bouncing services at random.

User Experience Dashboard

Show:

CloudFront edge latency by region
API Gateway latency
HTTP error rate

This stops the “works for me” defence. If users in Asia are suffering and Europe is fine, the evidence is right there.

Cost vs Performance Dashboard

Show:

EC2 average CPU utilisation
EC2 cost
RDS storage cost

If CPU is flat at 15% while spend climbs, you are over-provisioned. If CPU averages 70% and spend is stable, you are efficient. Numbers are a lot harder to argue with than opinions.

Worked Example: DynamoDB Capacity and API Latency

One of the most common startup problems is an API that “feels” slow. Engineers blame Lambda cold starts, founders blame AWS, and no one actually proves anything.

With Grafana, you can show API Gateway latency and DynamoDB consumed capacity on the same graph.

When API latency rises in sync with DynamoDB throttles, the issue is provisioned capacity.
When API latency rises without throttles, the issue is probably code or external calls.

Grafana panel JSON (API Gateway latency + DynamoDB capacity):

{
  "type": "timeseries",
  "targets": [
    {
      "namespace": "AWS/ApiGateway",
      "metricName": "Latency",
      "statistics": ["p99"],
      "period": 60
    },
    {
      "namespace": "AWS/DynamoDB",
      "metricName": "ConsumedReadCapacityUnits",
      "statistics": ["Sum"],
      "period": 60
    }
  ]
}

One graph like this can end days of finger pointing. Suddenly, everyone is looking at the same reality instead of arguing from gut feel.

Making Metrics Actionable with Alerts

Dashboards are nice, but they do not wake you up at 3am. Alerts do, and that annoying neighbour who thinks its ok to put his bins out (Thanks Brian, who likes sleep anyway?).

Grafana can notify Slack, PagerDuty, or email. The trick is not to spam the team until they stop caring.

Bad alert: fire every time DynamoDB throttles. Congratulations, you have just created background noise.
Good alert: fire when API latency exceeds 500ms for 2 minutes and DynamoDB throttles are happening. That is a real symptom of user pain.

Grafana alert rule (YAML):

apiVersion: 1
groups:
  - name: api-latency
    interval: 1m
    rules:
      - alert: HighLatency
        expr: avg_over_time(AWS_ApiGateway_Latency[5m]) > 0.5
        for: 2m
        labels:
          severity: warning
        annotations:
          description: "API latency has been above 500ms for 2 minutes"

Observability as a Leadership Tool

Observability is not only for engineers.

Founders want confidence the system will survive a marketing campaign.
Investors want proof that risks are known and manageable.
Engineers just want fewer arguments about whose code is at fault.

I have been in investor meetings where a live Grafana dashboard killed the discussion. Instead of debating scaling risk, we showed throughput, latency, and auto-scaling behaviour in real time. That is not a debate, that is evidence.

Getting Started

You do not need a six-month “observability strategy” to start. Do this:

Enable CloudWatch metrics for EC2, RDS, DynamoDB, API Gateway, and Lambda.
Run Grafana (EC2, ECS, or Grafana Cloud) with IAM permissions to read metrics.
Build three dashboards: health, user experience, cost vs performance.
Add alerts for symptoms of customer pain.
Review dashboards regularly, otherwise they become expensive wallpaper.

This takes less than a week. If you are not doing it, it is not because it is hard. It is because you have not made it a priority.

The Bigger Picture

Observability is not a “nice to have.” It is one of the simplest ways to reduce wasted time, calm nervous founders, and give investors confidence.

AWS metrics provide the data. Grafana provides the visibility. Together, they provide confidence that the system is working, that scaling is under control, and that problems can be explained and fixed quickly.

Observability is not about pretty graphs. It is about being able to say, without bluffing, we know exactly what is happening inside our system and we can prove it.

Gary Worthington is a software engineer, delivery consultant, and agile coach who helps teams move fast, learn faster, and scale when it matters. He writes about modern engineering, product thinking, and helping teams ship things that matter.

Through his consultancy, More Than Monkeys, Gary helps startups and scaleups improve how they build software, from tech strategy and agile delivery to product validation and team development.

Visit morethanmonkeys.co.uk to learn how we can help you build better, faster.

Follow Gary on LinkedIn for practical insights into engineering leadership, agile delivery, and team performance.