Vendor comparison
Datadog vs Prometheus + Grafana: true TCO math
The most common cost-savings pitch in cloud monitoring is to migrate from Datadog to self-hosted Prometheus and Grafana. The pitch is true at very small and very large scale and false in the middle. Honest TCO accounting requires including engineering time, on-call burden, and the compounding cost of observability outages.
TL;DR
Self-hosted Prometheus is genuinely cheaper at very small scale (single-VM Prometheus for under 20 hosts) and at very large scale (dedicated platform team for 1,000+ hosts). In the middle, Datadog or Grafana Cloud is almost always cheaper once engineering cost is honestly counted. The hidden cost is on-call burden during incidents when both the application and the observability platform need attention simultaneously.
The framing problem
Why "Prometheus is free" is misleading
The most common cost-savings pitch in cloud monitoring is to migrate from Datadog to self-hosted Prometheus and Grafana. The pitch goes: Prometheus is free, Grafana is free, the licence cost drops to zero, and the team pays only for compute and storage. The arithmetic is correct on the licence side and incomplete on the operational side. Honest TCO accounting requires including engineering time, on-call burden during the worst possible incidents, and the compounding cost of observability platform outages.
At very small scale (under 20 hosts, single platform engineer), the operational burden is genuinely modest. A single Prometheus VM scraping 20 hosts, a Grafana installation pointing at it, and Loki for logs on the same VM, fits comfortably in 30 to 50 hours of initial setup and 5 hours per month of ongoing maintenance. The Datadog free tier covers up to 5 hosts; above that, Datadog at 20 hosts is roughly $360 per month. Self-hosted is genuinely cheaper at this scale because the engineering tax is small.
At very large scale (1,000+ hosts, dedicated platform engineering team), self-hosted Prometheus with proper scaling (Mimir or Cortex for long-term storage, Loki for logs, Tempo for traces, all running production-grade with high availability) requires 3 to 4 platform engineers full-time. At fully loaded cost of $13,500 per FTE per month, that is $40,000 to $54,000 per month in engineering. Datadog at 1,000 hosts with negotiated enterprise rates is typically $60,000 to $120,000 per month. Self-hosted is cheaper at this scale, sometimes by $20,000 to $60,000 per month, but only because the engineering team can amortise across the broader observability function.
The middle is where the false economy lives. At 50 to 500 hosts, the engineering cost of running production-grade Prometheus + Loki + Tempo with high availability, long-term retention, on-call rotation, and ongoing capacity management exceeds the Datadog or Grafana Cloud hosted bill at the same scale. A team that hires one observability engineer at $13,500 per month loaded cost is spending more on the engineer than they would on Grafana Cloud at 100 hosts.
Three scenarios with honest engineering cost
The TCO comparison
Scenario
Small team (10 hosts, 1 platform engineer)
Datadog
$200 to $400 / month
Free tier covers most of this. Trivial operational burden.
Self-hosted
$0 license + ~50 hrs setup + ongoing time
Single-VM Prometheus + Grafana works at this scale. ~$0 cloud cost. ~5 hours/month maintenance.
Cheaper at this scale: Self-hosted
Scenario
Mid-market (100 hosts, dedicated observability engineer)
Datadog
$5,500 to $9,000 / month
Predictable monthly bill. Operational burden is configuring and maintaining Datadog dashboards/alerts.
Self-hosted
$1,500 cloud + $13,500 engineering = ~$15,000 / month
1 FTE platform engineer at ~$13.5K loaded cost per month for observability infrastructure ownership. Cloud storage for Prometheus/Loki adds $1K-$2K.
Cheaper at this scale: Datadog
Scenario
Enterprise (1,000 hosts, dedicated platform team)
Datadog
$60,000 to $120,000 / month
Negotiated rates apply. No platform team needed for observability infrastructure itself.
Self-hosted
$8,000 cloud + $50,000 platform team = ~$58,000 / month
3-4 platform engineers running Mimir/Cortex + Loki + Tempo. ~$50K loaded cost. Cloud storage ~$8K.
Cheaper at this scale: Self-hosted
The hidden costs
What teams underestimate when self-hosting
On-call burden during incidents
Version upgrade tax
Knowledge transfer risk
When self-hosting is the right call
Three valid reasons to self-host
Self-hosted Prometheus has three legitimate use cases that go beyond pure cost arbitrage. The first is the very small team where the operational burden is genuinely modest. A single-VM Prometheus + Grafana at 20 hosts costs nothing in licence, requires 5 hours per month of maintenance, and provides credible production observability. The team is not trying to scale; they just want Datadog-equivalent visibility without the bill. Self-hosted is the right answer.
The second is the very large team with observability as a core competency. Banks, hyperscalers, large e-commerce platforms, and multi-billion-dollar SaaS companies routinely operate dedicated observability platform teams that build deep expertise across Prometheus federation, long-term storage with Mimir or Thanos, custom metric discipline, and incident response tooling. At this scale, the platform team produces real engineering leverage, the hosted vendors' per-series or per-host pricing becomes uncompetitive, and self-hosted is structurally cheaper plus organisationally aligned with the team's capabilities.
The third is the regulatory or sovereignty case. Some customers (defence, federal government, certain financial services jurisdictions, certain healthcare regulations) cannot legally send observability data to a multi-tenant SaaS provider. Self-hosted Prometheus on customer-owned infrastructure is the only viable path, and the engineering cost is the cost of doing business in that regulatory regime.
Outside these three cases, the rational economic choice for most teams is hosted observability. Datadog if the team values integration breadth and zero-config auto-instrumentation. Grafana Cloud if the team is Prometheus-aligned and wants the architectural openness of OpenTelemetry. New Relic if the team wants a single unified data model with predictable per-GB billing.
If you do migrate
What the engineering programme looks like
For teams that have honestly evaluated the TCO and decided self-hosted is the right call, the migration programme typically takes 3 to 9 months depending on team size and existing observability footprint. The standard phases are: stand up Prometheus + Grafana on a small subset of services, validate the data quality and dashboard fidelity, expand to the full host fleet, parallel-run Datadog and Prometheus for 30 to 60 days to preserve historical context, then decommission Datadog. Logs and traces follow the same pattern with Loki and Tempo respectively.
The most common failure mode is underestimating the dashboard rebuild cost. Datadog dashboards do not translate cleanly to Grafana, and the operational tribal knowledge embedded in five years of Datadog dashboards is rebuilt from scratch. Plan for 2 to 5 hours of dashboard rebuild per existing Datadog dashboard, depending on complexity. A team with 200 Datadog dashboards is looking at 400 to 1,000 hours of rebuild work, plus alert rule rewrites, plus new on-call runbooks for the self-hosted platform.
The second most common failure mode is underestimating long-term retention requirements. Vanilla Prometheus retains 15 days by default. Compliance, audit, and historical capacity planning typically require 90 days to 12 months. Building proper long-term retention via Mimir, Cortex, or Thanos is its own engineering project, easily 4 to 12 weeks of work for a competent platform engineer to set up correctly with high availability and proper backup procedures.
Verify before you commit
Citation and reference data
Datadog pricing verified against datadoghq.com/pricing in April 2026. Engineering cost estimates use fully loaded compensation of approximately $200,000 per FTE annually for senior platform engineers in major US tech markets, derived from BLS Occupational Employment Statistics for software engineers plus typical 30 to 40 percent overhead for benefits, infrastructure, and management. Self-hosted Prometheus operational time estimates are anchored to the Prometheus operations documentation and public conference talks (PromCon, KubeCon) on production Prometheus operations.
Cross-references
Related pages
/datadog-pricing
Datadog pricing breakdown
/grafana-cloud-pricing
Grafana Cloud pricing breakdown
/open-source-vs-paid
Open source vs paid TCO
/datadog-vs-grafana-cloud
Datadog vs Grafana Cloud
/comparison
Six-vendor comparison
/calculator
Multi-vendor cost calculator
/kubernetes-monitoring
Kubernetes monitoring cost mechanics
/reduce-monitoring-costs
Twelve cost-reduction strategies
/methodology
How we research pricing