SLOs: Easier Said Than Done (But You Can Get Them Right)

Service Level Objectives (SLOs) sound simple—define what good service looks like, measure it, and call it a day. But getting them right is tricky. This guide will show you how to skip the fluff, focus on what matters, and avoid the common pitfalls that lead to SLO regret.

What Are SLOs, Really?

An SLO (Service Level Objective) is a measurable target that answers a critical question:

  • Are you delivering the service your users expect?

An SLO combines:

  • SLIs (Service Level Indicators): Metrics that measure performance, like latency, availability, or error rate.

  • Thresholds: What’s considered acceptable (e.g., “99.9% of API requests succeed”).

  • Time Windows: How long you’re measuring against (e.g., over 30 days).

Think of SLOs as your team’s North Star—they focus your efforts on what matters most to your users and the business. Without them, you’re flying blind—or worse, wasting time chasing metrics nobody cares about.

For a deeper dive into SLOs and how they fit into the Datadog platform, check out Datadog’s SLO documentation.

The Good, The Bad, and The Ugly of SLOs

Not all SLOs are created equal. Here’s how to tell the difference:

  • Good: Specific, measurable, and focused on user impact.

    • Example: “99.9% of API requests respond in under 600ms.”

  • Bad: Vague, unmeasurable, or based on things you can’t control.

    • Example: “Our app is always reliable.”

  • Ugly: Unrealistic or overly ambitious.

    • Example: “100% uptime, forever.”

Pro Tip: A bad or ugly SLO doesn’t just waste your time—it actively works against you. Stick to good ones, or don’t bother.

How to Build SLOs That Actually Work

Step 1: Start With What Users Care About

Your users care about your uptime, and whether your app does what they need, when they need it. 

Ask yourself:

  • Can users complete transactions without errors?

  • Is the app responsive when they interact with it?

  • Are critical features available when they expect them?

Example:
For an e-commerce platform:

  • “99.95% of checkout transactions complete without errors over 30 days.”

What NOT to Do:

  • “Our uptime is good.” (Good? By whose definition?)

Which brings us to the big question: can you actually deliver on your goals?

Step 2: Be Realistic About Your Goals

Perfection isn’t just overrated—it’s expensive. Setting SLOs that aim for the stars doesn’t make you a hero; it makes you look like you don’t understand how systems (or budgets) work.

What the Nines Really Mean
Those extra nines in your reliability target? They’re not just numbers—they’re downtime realities. Here’s what you’re actually signing up for:

If your cloud provider promises 99.99%, you’re not hitting 99.999% no matter how much you want it (unless you get really lucky, but luck should never be relied on).

Focus on What Users Actually Care About
The reality: users won’t notice if you go from 99.9% to 99.99%, but they’ll notice buffering during a video stream or errors at checkout. Set goals that matter.

Good Target Example:

  • “95% of video streams play without buffering in the last 30 minutes.”

Bad Target Example:

  • “Every request is perfect all the time.” (Spoiler alert: it won’t be.)

Balance the Trade-Offs
Every extra nine comes with a cost—infrastructure, complexity, and your sanity. Be realistic and save yourself from the heartbreak of impossible promises.

Step 3: Focus on a Few Critical Metrics

You don’t need an SLO for every corner of your infrastructure. Start with the metrics that matter most:

  1. Critical Services: Things that break the app if they fail.

  2. High-Impact Features: What users interact with most.

  3. SLIs That Matter: Latency, availability, error rate—metrics with direct user impact.

Step 4: Iterate and Improve

SLOs aren’t static. They evolve with your service and your users.

  • Review Regularly: Do your metrics, thresholds, and goals reflect the current reality in an ever changing environment? Has a new feature or service been added to production that’s not accounted for?

  • Adjust as Needed: If your service grows more complex, modify thresholds or add new SLOs.

The Role of Error Budgets

An SLO without an error budget is just a number.

Error budgets define how much failure is acceptable before you take action. They help you:

  • Decide when to pause feature releases.

  • Prioritize fixes over new development.

  • Protect your team from burnout by setting clear boundaries for reliability.

Pro Tip: Think of error budgets as a buffer—they’re there to guide your decisions, not punish your team.

Why SLOs Matter Beyond Engineering

Error budgets make SLOs actionable, but their value goes beyond engineering.

SLOs are a common language between engineering and the business. They:

  • Align Priorities: Focus on what users and the business value most.

  • Build Trust: Transparent metrics build confidence between teams.

  • Drive Decisions: Help stakeholders understand trade-offs (e.g., speed of delivery vs. reliability).

Wrap-Up: SLOs Done Right

SLOs are your team’s compass. Start with what matters to your users, keep it simple, and don’t overthink it.

  • Focus on critical metrics.

  • Be realistic about what you can achieve.

  • Iterate as your service evolves.

Need help getting your SLOs right the first time? Reach out, and let’s cut through the noise together.

Previous
Previous

Tag Smarter, Not Harder: How to Fix Your Datadog Tagging Mess

Next
Next

Our Take on Observability