Day 6 | đ¸ The famous green character that stole your cloud budget: the cardinality problem
Day 6 | đ¸ The famous green character that stole your cloud budget: the cardinality problem
Published December 15, 2025
Every December, engineering teams unwrap the same unwanted gift: their annual observability bill. And every year, itâs bigger than the last.
You know the pattern. Services multiply. Traffic grows. Someone discovers OpenTelemetry and suddenly every microservice is emitting 50 spans per request instead of 5. Then January rolls around and your observability platform sends an invoice thatâs 30% higher than last quarter.
Your VP of Engineering wants to know why.
You could blame it on the famous green character who hates Christmas, or you could join other teams who are getting serious about cost-efficient observability. That is, collecting telemetry data based on value, not volume.
Why âcollect everythingâ no longer works
The old playbook was simple: instrument everything, store it all, figure out what you need later. Storage was cheap enough. Queries were fast enough. No need to overthink it.
Then, three things happened:
- OpenTelemetry went mainstream. Teams migrated from vendor agents to OTel and began adding spans for everything. This added more visibility, but with 10x the data.
- AI observability tools arrived. Platforms started using LLMs to analyze traces and suggest root causes. Powerful, but also expensive to run against terabytes of unfiltered trace data.
- CFOs started asking questions. âOur_traffic grew 15% but observability costs grew 40%. Explain.â
To stop instrumenting wouldnât be an option and also you want to make informed decisions, but still the biggest culprit, hiding in your telemetry stack is cardinality.
Cardinality will eat your budget
Cardinality is the observability villain. It sneaks in quietly, one innocent-looking label at a time, and before you know it, itâs stolen your entire cloud budget. What is cardinality? Itâs just the number of unique time series your metrics generate, but itâs also the main driver of observability costs that nobody sees coming.

Low cardinality: http_requests_total tracked by method and status_code. Maybe 20 unique combinations. Fairly manageable.
High cardinality: Same counter, but now youâve added user_id, request_id, and session_token as labels. By simply adding these labels, youâve just created millions of unique time series. Each one needs storage, indexing, and query compute. This will compound your bill faster than you can say deck the halls, except you wouldnât be able to deck the halls, youâd be paying off your usage bill.
Stopping the Green character: set cardinality budgets
Most teams donât set limits on how many time series a service can create even though they should., but you can.
Start by auditing what youâre currently generating. Look for metrics with >100K unique time series, or labels that include UUIDs, request IDs, or email addresses. These are your problem children.
Then set budgets. Give each service a limit, like 50K time series max. Assign team quotas so the checkout team knows they get 200K total across all their services. Create attribute allowlists that define exactly which labels are allowed in production. Yes, this feels restrictive at first. Your developers will complain. Theyâll argue that they need that user_id label for debugging. And sometimes theyâre right. But forcing that conversation up front means they have to justify the cost, not just add labels reflexively.
Finally, enforce budgets through linters that flag high-cardinality attributes in code review, CI checks that fail if estimates get too high, and dashboards that alert when cardinality spikes. This isnât about being restrictive. Itâs about being intentional. If youâre adding a label, you should know why and what itâll cost.
Cardinality budgets solve the metrics problem, but what about traces? Thatâs where sampling comes in.
Sampling: without the guilt

Not all sampling strategies are created equal, and picking the right one depends on what youâre trying to protect.
Head-based sampling is pretty strict. You decide whether to keep a trace at the very start of a request, before you know if itâll be interesting. Fast checkout gets dropped. Slow checkout that timeout also gets dropped, because the decision happened too early. Not great.
Tail-based sampling is smarter. Wait until the trace completes, then decide based on what actually happened. Keep errors, high latency, or specific user cohorts. Sample down the boring stuff. This costs more (you have to buffer complete traces) but you keep what matters.
Probabilistic sampling is the middle ground. Keep 10% of everything, regardless of content. Predictable cost reduction, but youâll still lose some critical events. Works fine for stable services where trends matter more than individual traces.
Now rule-based sampling is where things get interesting, and honestly where most teams should be spending their energy. The idea is dead simple: different traffic deserves different sampling rates. You keep 100% of traces during feature rollouts because you actually care about every request when youâre validating a new flow.
If youâre using LaunchDarkly for progressive rollouts, you can tie sampling rates directly to flag evaluations. 100% sampling for users in the new variant, 10% for the control group. Your main API endpoints can run at 50% since theyâre stable and high-volume. Internal health checks that just verify the service is alive need maybe 5%, or even less. Iâve seen teams go down to 1% for health checks and never miss it.
The key is that youâre making these decisions based on the actual value of the signal, not just applying a blanket rate across everything. Adjust based on context: feature flags, experiments, specific endpoints, user cohorts, whatever makes sense for your system.

Sampling isnât about compromising visibility. Itâs about amplifying signals. The noisy 90% of traces youâre storing never get looked at anyway.
Once youâve decided what to keep, you still need to decide how long to keep it and at what resolution.
Downsample vs. Discard: know when to do which
Not all data reduction is the same, and mixing up downsampling with discarding is how teams accidentally delete data they actually need.
Downsample when you need historical context but not full precision. SLO burn rates donât need second-by-second granularity so you can downsample to 1-minute intervals and still catch every trend. An additional practice is to keep high-res data for a week, then downsample to hourly for long-term retention.
Discard when the data is redundant or has served its purpose. For instance, debug spans from a canary that passed three days ago can be deleted. Or if you captured an error in both a trace and a log, you can pick one source of truth and drop the duplicate.
The rule of thumb here is If youâll never query it, donât store it. If you might need it for trends in six months, downsample it. If you need it immediately when something breaks, keep it at full resolution with an aggressive retention policy.
What this actually looks like

Cost-efficient observability isnât about cutting capabilities. Itâs about cutting waste.
Start by auditing your cardinality. Find the metrics generating hundreds of thousands of time series because someone added user_id as a label. Then, set budgets like 50K per service, 200K per team and enforce them through linters and CI checks. Create ways to encourage developers to justify high-cardinality labels before they ship, not after the bill arrives.
Then youâll be ready to tackle sampling. Drop the blanket 10% probabilistic rate and switch to rule-based sampling tied to actual value. Keep 100% of traces during feature rollouts. Sample stable endpoints at 10%. Go as low as 1% for health checks. If youâre running feature flags, tie sampling to flag evaluations so you capture what matters and discard what doesnât.
Finally, clean up retention, downsample SLO metrics to 1-minute intervals, discard debug spans from canaries that passed days ago and delete duplicate error data.
This not only leads to lower bills, but also cleaner dashboards, faster queries, fewer noisy alerts, and teams that spend less time swimming through telemetry and more time fixing actual problems.
Observability ROI isnât measured in data volume. Itâs measured in how fast you detect and resolve issues.
The teams figuring this out in 2025 arenât collecting everything. Theyâre collecting what matters.

