Vendor comparison

Datadog vs Prometheus + Grafana: true TCO math

Verified June 2026

The most common cost-savings pitch in cloud monitoring is to migrate from Datadog to self-hosted Prometheus and Grafana. The pitch is true at very small and very large scale and false in the middle. Honest TCO accounting requires including engineering time, on-call burden, and the compounding cost of observability outages.

TL;DR

Self-hosted Prometheus is genuinely cheaper at very small scale (single-VM Prometheus for under 20 hosts) and at very large scale (dedicated platform team for 1,000+ hosts). In the middle, Datadog or Grafana Cloud is almost always cheaper once engineering cost is honestly counted. The hidden cost is on-call burden during incidents when both the application and the observability platform need attention simultaneously.

The framing problem

Why "Prometheus is free" is misleading

The most common cost-savings pitch in cloud monitoring is to migrate from Datadog to self-hosted Prometheus and Grafana. The pitch goes: Prometheus is free, Grafana is free, the licence cost drops to zero, and the team pays only for compute and storage. The arithmetic is correct on the licence side and incomplete on the operational side. Honest TCO accounting requires including engineering time, on-call burden during the worst possible incidents, and the compounding cost of observability platform outages.

At very small scale (under 20 hosts, single platform engineer), the operational burden is genuinely modest. A single Prometheus VM scraping 20 hosts, a Grafana installation pointing at it, and Loki for logs on the same VM, fits comfortably in 30 to 50 hours of initial setup and 5 hours per month of ongoing maintenance. The Datadog free tier covers up to 5 hosts; above that, Datadog at 20 hosts is roughly $360 per month. Self-hosted is genuinely cheaper at this scale because the engineering tax is small.

At very large scale (1,000+ hosts, dedicated platform engineering team), self-hosted Prometheus with proper scaling (Mimir or Cortex for long-term storage, Loki for logs, Tempo for traces, all running production-grade with high availability) requires 3 to 4 platform engineers full-time. At fully loaded cost of $13,500 per FTE per month, that is $40,000 to $54,000 per month in engineering. Datadog at 1,000 hosts with negotiated enterprise rates is typically $60,000 to $120,000 per month. Self-hosted is cheaper at this scale, sometimes by $20,000 to $60,000 per month, but only because the engineering team can amortise across the broader observability function.

The middle is where the false economy lives. At 50 to 500 hosts, the engineering cost of running production-grade Prometheus + Loki + Tempo with high availability, long-term retention, on-call rotation, and ongoing capacity management exceeds the Datadog or Grafana Cloud hosted bill at the same scale. A team that hires one observability engineer at $13,500 per month loaded cost is spending more on the engineer than they would on Grafana Cloud at 100 hosts.

Three scenarios with honest engineering cost

The TCO comparison

Scenario

Small team (10 hosts, 1 platform engineer)

Datadog

$200 to $400 / month

Free tier covers most of this. Trivial operational burden.

Self-hosted

$0 license + ~50 hrs setup + ongoing time

Single-VM Prometheus + Grafana works at this scale. ~$0 cloud cost. ~5 hours/month maintenance.

Cheaper at this scale: Self-hosted

Scenario

Mid-market (100 hosts, dedicated observability engineer)

Datadog

$5,500 to $9,000 / month

Predictable monthly bill. Operational burden is configuring and maintaining Datadog dashboards/alerts.

Self-hosted

$1,500 cloud + $13,500 engineering = ~$15,000 / month

1 FTE platform engineer at ~$13.5K loaded cost per month for observability infrastructure ownership. Cloud storage for Prometheus/Loki adds $1K-$2K.

Cheaper at this scale: Datadog

Scenario

Enterprise (1,000 hosts, dedicated platform team)

Datadog

$60,000 to $120,000 / month

Negotiated rates apply. No platform team needed for observability infrastructure itself.

Self-hosted

$8,000 cloud + $50,000 platform team = ~$58,000 / month

3-4 platform engineers running Mimir/Cortex + Loki + Tempo. ~$50K loaded cost. Cloud storage ~$8K.

Cheaper at this scale: Self-hosted

The hidden costs

What teams underestimate when self-hosting

On-call burden during incidents

When the application is on fire and Prometheus is also having issues, the same on-call engineer is debugging two systems simultaneously. Mean-time-to-resolution lengthens; engineer stress compounds; the cost is real but rarely measured.

Version upgrade tax

Prometheus releases monthly. Mimir, Loki, Tempo, and Grafana release frequently. Each upgrade requires testing, scheduling, and occasionally schema migration. The cumulative engineering time is 5 to 10 percent of an FTE per platform component.

Knowledge transfer risk

Self-hosted observability becomes institutional knowledge held by specific engineers. When those engineers leave, the team faces a documented or undocumented runbook crisis. Hosted vendors absorb this knowledge management cost.

When self-hosting is the right call

Three valid reasons to self-host

Self-hosted Prometheus has three legitimate use cases that go beyond pure cost arbitrage. The first is the very small team where the operational burden is genuinely modest. A single-VM Prometheus + Grafana at 20 hosts costs nothing in licence, requires 5 hours per month of maintenance, and provides credible production observability. The team is not trying to scale; they just want Datadog-equivalent visibility without the bill. Self-hosted is the right answer.

The second is the very large team with observability as a core competency. Banks, hyperscalers, large e-commerce platforms, and multi-billion-dollar SaaS companies routinely operate dedicated observability platform teams that build deep expertise across Prometheus federation, long-term storage with Mimir or Thanos, custom metric discipline, and incident response tooling. At this scale, the platform team produces real engineering leverage, the hosted vendors' per-series or per-host pricing becomes uncompetitive, and self-hosted is structurally cheaper plus organisationally aligned with the team's capabilities.

The third is the regulatory or sovereignty case. Some customers (defence, federal government, certain financial services jurisdictions, certain healthcare regulations) cannot legally send observability data to a multi-tenant SaaS provider. Self-hosted Prometheus on customer-owned infrastructure is the only viable path, and the engineering cost is the cost of doing business in that regulatory regime.

Outside these three cases, the rational economic choice for most teams is hosted observability. Datadog if the team values integration breadth and zero-config auto-instrumentation. Grafana Cloud if the team is Prometheus-aligned and wants the architectural openness of OpenTelemetry. New Relic if the team wants a single unified data model with predictable per-GB billing.

If you do migrate

What the engineering programme looks like

For teams that have honestly evaluated the TCO and decided self-hosted is the right call, the migration programme typically takes 3 to 9 months depending on team size and existing observability footprint. The standard phases are: stand up Prometheus + Grafana on a small subset of services, validate the data quality and dashboard fidelity, expand to the full host fleet, parallel-run Datadog and Prometheus for 30 to 60 days to preserve historical context, then decommission Datadog. Logs and traces follow the same pattern with Loki and Tempo respectively.

The most common failure mode is underestimating the dashboard rebuild cost. Datadog dashboards do not translate cleanly to Grafana, and the operational tribal knowledge embedded in five years of Datadog dashboards is rebuilt from scratch. Plan for 2 to 5 hours of dashboard rebuild per existing Datadog dashboard, depending on complexity. A team with 200 Datadog dashboards is looking at 400 to 1,000 hours of rebuild work, plus alert rule rewrites, plus new on-call runbooks for the self-hosted platform.

The second most common failure mode is underestimating long-term retention requirements. Vanilla Prometheus retains 15 days by default. Compliance, audit, and historical capacity planning typically require 90 days to 12 months. Building proper long-term retention via Mimir, Cortex, or Thanos is its own engineering project, easily 4 to 12 weeks of work for a competent platform engineer to set up correctly with high availability and proper backup procedures.

Verify before you commit

Citation and reference data

Datadog pricing verified against datadoghq.com/pricing in April 2026. Engineering cost estimates use fully loaded compensation of approximately $200,000 per FTE annually for senior platform engineers in major US tech markets, derived from BLS Occupational Employment Statistics for software engineers plus typical 30 to 40 percent overhead for benefits, infrastructure, and management. Self-hosted Prometheus operational time estimates are anchored to the Prometheus operations documentation and public conference talks (PromCon, KubeCon) on production Prometheus operations.

Cross-references

/datadog-pricing

Datadog pricing breakdown

/grafana-cloud-pricing

Grafana Cloud pricing breakdown

/open-source-vs-paid

Open source vs paid TCO

/datadog-vs-grafana-cloud

Datadog vs Grafana Cloud

/comparison

Six-vendor comparison

/calculator

Multi-vendor cost calculator

/kubernetes-monitoring

Kubernetes monitoring cost mechanics

/reduce-monitoring-costs

Twelve cost-reduction strategies

/methodology

How we research pricing

Frequently asked

Is self-hosted Prometheus actually cheaper than Datadog?

It depends on team size and honest engineering cost accounting. At very small scale (under 20 hosts, single-VM Prometheus, no high availability), self-hosted is genuinely cheaper because the operational burden is minimal. At very large scale (1,000+ hosts with dedicated platform engineering capacity), self-hosted is often cheaper because Datadog negotiated rates plateau and dedicated engineering can amortise across the broader observability function. The middle (50 to 500 hosts) is usually cheaper hosted, because the engineering cost of running production-grade Prometheus + Loki + Tempo with high availability and long-term retention exceeds the Datadog or Grafana Cloud bill at the same scale.

What does it actually cost to run Prometheus in production?

Three real cost categories. Cloud infrastructure (compute for Prometheus servers, storage for time-series data, optional object storage for long-term retention via Mimir or Cortex or Thanos) typically runs $500 to $5,000 per month at small to mid scale. Engineering time (platform engineering capacity for capacity planning, version upgrades, alerting infrastructure, on-call for the observability stack itself) typically runs 0.5 to 3 FTE depending on scale, at fully loaded cost of $13,500 to $80,000 per FTE per month. Indirect cost (downstream impact of observability outages, slower incident response when the platform itself has issues) is harder to quantify but real.

When does self-hosted Prometheus break down?

Three common failure modes. First, when long-term retention is needed (90+ days) and the team has not invested in object-storage-backed Prometheus via Mimir, Cortex, or Thanos; vanilla Prometheus does not handle this well. Second, when high availability is needed and the team has not implemented federation or replication; a single-VM Prometheus is a single point of failure that can blind operations during the worst possible time. Third, when cardinality grows beyond what a single Prometheus instance can handle (typically 5 to 10 million series); the team needs to shard, federate, or migrate to Mimir/Cortex.

Is Grafana Cloud just managed Prometheus?

Approximately, yes. Grafana Cloud Mimir is the managed equivalent of self-hosted Prometheus + Mimir + Cortex. The pricing maps reasonably well to the underlying engineering cost: customers pay $6.50 per 1,000 active series per month, which works out comparable to running production-grade Prometheus at scale once you account for engineering time. The hosted option removes the on-call burden, the version-upgrade tax, and the capacity-planning work in exchange for the per-series fee. For most teams above 50 hosts, hosted Grafana Cloud is cheaper than honestly accounting for self-hosted Prometheus engineering cost.

Should I use Datadog or self-host Prometheus?

If you have under 20 hosts, no dedicated observability engineer, and modest reliability requirements: self-host Prometheus + Grafana on a single VM. If you have 20 to 1,000 hosts and observability is not your team's differentiating capability: use Datadog or Grafana Cloud (Grafana Cloud is usually cheaper for Kubernetes-shaped workloads). If you have 1,000+ hosts, dedicated platform engineering capacity, and observability infrastructure is a core competency: self-host Prometheus + Loki + Tempo with proper Mimir-based scaling, accepting the engineering investment as a competitive advantage.

What is the hidden cost of self-hosting?

The cost most teams underestimate is the second-order operational impact when the observability platform itself has issues. When Datadog has a partial outage, Datadog engineers fix it. When self-hosted Prometheus has issues during an incident response, the same on-call engineers are debugging both the application incident and the observability platform simultaneously. The compounding stress and the lengthened mean-time-to-resolution is real. The team also pays for version upgrades (Prometheus releases monthly; Mimir and Loki release frequently), security patches, and the institutional knowledge required to maintain the stack across team turnover.

Datadog vs Prometheus + Grafana: true TCO math

Why "Prometheus is free" is misleading

The TCO comparison

Small team (10 hosts, 1 platform engineer)

Mid-market (100 hosts, dedicated observability engineer)

Enterprise (1,000 hosts, dedicated platform team)

What teams underestimate when self-hosting

Three valid reasons to self-host

What the engineering programme looks like

Citation and reference data

Related pages

Frequently asked