Mr.PlanB

There’s a moment every Prometheus operator hits eventually. The UI gets sluggish. Scrapes start failing. PromQL queries time out. Dashboards hang just long enough to make you question your life choices. Then you check memory. And Prometheus is casually chewing through 50 to 60GB of RAM. That’s not “slightly overprovisioned.” That’s “this thing is about to become your biggest incident.” One engineer shared exactly that scenario . A cluster where Prometheus had ballooned to ~60GB. Scrape reliability was degrading. Queries were timing out. Stability was slipping. This wasn’t theoretical tuning. This was survival. And the fix wasn’t exotic. It was ruthless. --- ## The Real Culprits: Three Predictable Killers The investigation uncovered three classic Prometheus memory traps : 1. **Duplicate scraping** 2. **Histogram overload** 3. **Label explosion** If you’ve run Prometheus in Kubernetes long enough, you’ve probably been bitten by at least one. Most teams get hit by all three. --- ## 1. Duplicate Scraping: The Accidental 2x Multiplier Prometheus was scraping ingress metrics from both pods *and* a ServiceMonitor . That sounds harmless. It’s not. Every duplicate scrape doubles: - Time series count - Memory usage - WAL pressure - Head block churn And unless you’re explicitly deduplicating downstream, you’ve now doubled cardinality for that entire metric set. This is the silent multiplier problem. You don’t feel it immediately. But it compounds across clusters. One small config oversight can cost you tens of gigabytes. The fix? Disable unnecessary pod-level scraping. Simple. Brutal. Effective . --- ## 2. Histogram Overload: Death by Buckets Histograms are powerful. They’re also expensive. Metrics like `*_duration_seconds_bucket` can generate hundreds of thousands of time series if: - You have many buckets. - You have many label combinations. - You have many replicas. Multiply buckets × labels × pods × clusters. Now multiply that by retention in the head block. Suddenly your memory graph looks like a hockey stick. Histograms are often enabled by default in exporters. Teams rarely revisit whether they need all buckets. Or whether high-cardinality labels are attached to them. The result? Prometheus becomes a histogram warehouse. --- ## 3. Label Explosion: The Cardinality Monster This one is the most dangerous. Labels like: - `replicaset` - `path` - `container_id` Were producing 10k+ unique values . That’s not “a bit high.” That’s a cardinality bomb. Each unique label combination equals a unique time series. If `path` includes raw URLs with IDs embedded? If `container_id` changes every deploy? If `replicaset` rotates constantly? You are generating thousands of new time series per rollout. Prometheus keeps them in memory. Not because it’s broken. Because that’s how it works. --- ## The Fix Wasn’t Fancy — It Was Disciplined The remediation steps were straightforward : - Drop unused metrics (after validating dashboards and alerts) - Disable redundant scraping - Remove high-cardinality labels that weren’t actually used - Write scripts to verify what was safe to drop That last one is important. Blindly dropping labels can cause ingestion errors. A moderator stepped in with a crucial warning: `labeldrop` removes the label — it does not remove the series . If that label was required for uniqueness, removing it can cause duplicate series collisions and ingestion failures. That’s the kind of subtle Prometheus behavior that trips people up. You’re not deleting series. You’re collapsing them. And collapse without thinking leads to chaos. The author updated their post accordingly after the correction . That exchange alone highlights something important: Prometheus tuning isn’t guesswork. It’s precision work. --- ## The Result: 60GB → 20GB After cleanup, memory dropped from roughly 60GB to about 20GB . And stability returned. Scrapes normalized. UI responsiveness improved. PromQL stopped timing out. Same cluster. Same workloads. Different discipline. That delta — 40GB — wasn’t magic compression. It was removing waste. --- ## The Uncomfortable Pattern Across Teams The comment section reveals something telling. Multiple teams said they’re facing similar issues . Someone joked “Laughs in VictoriaMetrics.” Another said, “NetData ;)” That’s the usual cycle: 1. Prometheus grows. 2. Memory balloons. 3. People blame the database. 4. Alternative TSDB vendors enter the chat. But here’s the hard truth: Most Prometheus memory explosions aren’t Prometheus’s fault. They’re architectural. Duplicate scraping. Unbounded label cardinality. Overzealous histograms. Kubernetes default metric overload. In fact, one strong recommendation was to review the `action: drop` defaults in `kube-prometheus-stack` because Kubernetes apiserver and cAdvisor metrics can overwhelm small setups . Kubernetes emits a *lot* of metrics. If you ingest all of them blindly, you’re volunteering for a memory crisis. --- ## The Real Lesson: Prometheus Doesn’t Forgive Laziness Prometheus is brutally honest. It stores exactly what you tell it to store. It doesn’t compress away your bad labeling decisions. It doesn’t automatically protect you from cardinality explosions. It doesn’t warn you when histograms multiply by replicas. It assumes you know what you’re doing. That’s power. And danger. When people say “Prometheus doesn’t scale,” what they often mean is: “We didn’t manage cardinality.” --- ## If You’re Sitting at 40GB+ RAM Right Now Here’s the uncomfortable checklist: - Are you scraping the same endpoint twice? - Do you actually need pod-level metrics for every service? - How many histogram buckets are you exporting? - Are you attaching user-level or path-level labels to high-volume metrics? - Are container IDs or replica hashes in your labels? - Have you audited unused metrics recently? - Are you reviewing drop rules in kube-prometheus-stack? If you haven’t asked those questions, your memory graph is just waiting for its turn. --- ## Prometheus at 20GB Isn’t Small. It’s Just Controlled. The takeaway isn’t “run less.” It’s “run intentionally.” A Prometheus instance using 20GB responsibly for a large Kubernetes cluster can be perfectly healthy. A Prometheus instance using 60GB because it’s duplicating ingress metrics and tracking every container ID ever created? That’s waste. The difference is discipline. And in observability, discipline scales better than hardware. Every time.

How One Team Slashed Prometheus Memory From 60GB to 20GB - And Exposed the Silent Cardinality Crisis

Keep Exploring