Datadog MCP: AI-Powered Alert Triage and Dashboard Queries (Bits AI Setup)

🐶 MCP Recipe

What you'll build: An on-call AI agent that triages alerts, queries metrics, searches logs and pulls APM traces from your Datadog account
MCP server: Official Datadog Bits AI MCP (GA March 2026), hosted at mcp.datadoghq.com
Time to complete: 5 minutes
Difficulty: Beginner-friendly — no install, no self-hosting

Datadog shipped the Bits AI MCP server to GA in March 2026 — a hosted, remote MCP endpoint that exposes APM, logs, metrics, monitors, dashboards, security signals and LLM Observability as MCP tools. Unlike most MCP servers you've seen, this one needs no install: it's a Streamable HTTP endpoint at mcp.datadoghq.com that any MCP client can connect to with two API headers.

This recipe shows how to wire Claude, GPT-5, Gemini or any other model up to Datadog in 5 minutes, then walks through the four queries that actually save on-call time: alert triage, metric trend analysis, log pattern search, and APM trace investigation.

What the Datadog Bits AI MCP Provides

The server exposes Datadog's product surface as toolsets — grouped collections of tools you can enable per request. Key ones:

Toolset	What it covers
monitors	List, search and inspect monitors. Surface active alerts, mute/unmute, read alert messages and notification settings.
metrics	Query timeseries and gauges with full DogStatsD syntax. Aggregations, rollups, group-bys.
logs	Full-text and structured log search. Filter by service, host, trace ID, status, time range.
apm	Pull trace details, list slow spans, walk service maps. Bottleneck analysis on a per-trace basis.
dashboards	Read dashboard definitions, query the underlying widgets, summarise current state.
incidents	List and read Datadog incidents, including timeline and affected services.
security	Read Cloud SIEM signals, posture findings, runtime security alerts.
llm_observability	Query LLM traces — token usage, latency by model, prompt/completion samples.

Default behaviour exposes a focused subset. Append ?toolsets=all to the URL to enable everything, or specific ones like ?toolsets=monitors,logs,apm for a tight on-call config.

Prerequisites

Datadog account

Any plan with API access (most do)

API Key

Identifies your org. Org Settings → API Keys

Application Key

Identifies the user. Org Settings → Application Keys

Step 1: Generate Your Two Keys

Datadog uses a two-key model: the API key identifies your organization, the Application key identifies the user (and carries that user's permissions). You need both.

API Key: app.datadoghq.com → Organization Settings → API Keys → create a new key (or reuse an existing one). Copy it.
Application Key: app.datadoghq.com → Organization Settings → Application Keys → create a new key. Pick the scopes you need — read-only is enough for triage; add write scopes only if you want the agent to mute monitors or comment on incidents.

Scopes matter

An Application Key inherits the permissions of the user who created it. For a triage-only agent, create a dedicated user with read-only roles and generate the App Key under that user. Don't use a key tied to an admin account unless the agent really needs write access.

Step 2: Connect from MCP Agent Studio

Open the pre-built Datadog Agent template — it's wired up with the right URL, both header fields, and a system prompt tuned for the on-call workflow.

Visit /templates/datadog-agent and click Open in Studio.
Paste your API key into the DD-API-KEY field, your Application key into DD-APPLICATION-KEY.
If you're on EU/US3/US5/AP1/AP2, change the URL subdomain (e.g. mcp.datadoghq.eu/api/unstable/mcp-server/mcp).
Send: "Which monitors are alerting right now?"

Or, in Claude Code:

claude mcp add --transport http datadog \
  https://mcp.datadoghq.com/api/unstable/mcp-server/mcp \
  --header "DD-API-KEY: ${DD_API_KEY}" \
  --header "DD-APPLICATION-KEY: ${DD_APPLICATION_KEY}"

Step 3: Four On-Call Workflows That Actually Save Time

1. Morning alert triage

Prompt

"Stand-up summary: every monitor that fired between 10pm and 8am, grouped by service. For each one, tell me the alert message and whether it auto-recovered or is still firing."

The agent calls list_monitors with state filters, groups by tag, and outputs a digest. Replaces the 15 minutes you spend clicking through the alerts feed before stand-up.

2. "Is this latency spike real?"

Prompt

"p95 latency on the checkout service for the last 6 hours, in 5-minute buckets. Compare to the same window yesterday. Is it actually spiking or is this normal noise?"

Calls query_metrics twice (now vs yesterday), computes the delta, returns a one-paragraph verdict. The model is good at "is this signal or noise" because it can read the variance, not just the point value.

3. Log pattern search

Prompt

"Search logs for 'stripe webhook failed' across all services in the last 2 hours. Group by error message and surface the top 5 patterns with their counts."

The agent doesn't dump raw logs — it summarises patterns and counts. The default system prompt in the template explicitly forbids returning more than ~20 raw log lines unless you ask for them.

4. APM trace bottleneck

Prompt

"Pull the slowest 10 traces for POST /api/orders in the last hour. What do they have in common — same downstream service, same DB query, same customer?"

Calls list_traces, ranks by duration, walks each trace's spans. Looks for a common bottleneck — usually a slow DB call or a downstream service. Beats clicking through 10 trace flame graphs by hand.

Picking the Right Model

Datadog tools return a lot of structured data — pick a model that handles long contexts well and reasons over numbers cleanly.

Model	When to pick it
Claude Sonnet 4.5	Default for triage. Good at "is this signal or noise" reasoning over metric data.
Claude Opus 4.7	When you need correlation across 3+ services or a deep trace investigation. Best at synthesizing many tool-call results.
GPT-5.4	Very strong on log pattern extraction. Tends to be more verbose than Claude — set a max-words instruction in your prompt.
Gemini 3.1 Pro	1M context — ideal for "give me everything from the last 24 hours and find the anomaly" prompts.

Regions and the URL

Datadog runs in 6 regions. Match the URL to your site:

Region	MCP URL
US1 (default)	`https://mcp.datadoghq.com/api/unstable/mcp-server/mcp`
US3	`https://mcp.us3.datadoghq.com/api/unstable/mcp-server/mcp`
US5	`https://mcp.us5.datadoghq.com/api/unstable/mcp-server/mcp`
EU	`https://mcp.datadoghq.eu/api/unstable/mcp-server/mcp`
AP1	`https://mcp.ap1.datadoghq.com/api/unstable/mcp-server/mcp`
AP2	`https://mcp.ap2.datadoghq.com/api/unstable/mcp-server/mcp`

You can also append ?toolsets=all or ?toolsets=monitors,logs,apm to scope which tools the agent sees.

Production Notes

Use a dedicated read-only user for the App Key. Application Keys inherit user permissions.
The endpoint is rate-limited per the standard Datadog API limits. For high-frequency batch jobs, throttle on your side.
OAuth is supported as an alternative to the two-header pattern. For most chat-style integrations, headers are simpler — switch to OAuth when you need per-user scoping in a multi-tenant app.
Never paste an Application Key into a public chat or log it. Treat it like a password — anyone with the key acts as that Datadog user.
Cap log result sizes in your prompts. Returning thousands of log lines into the model's context is expensive and rarely useful — ask for summaries and counts first.

Try the Datadog Agent in your browser

Pre-built template, hosted MCP — no install. Triage alerts, query metrics, search logs and walk APM traces from chat.

Open Datadog Agent →

Related Recipes

Frequently Asked Questions

Is the Datadog MCP server hosted, or do I need to run it myself?

It's fully hosted by Datadog at mcp.datadoghq.com. You don't install anything — just point your MCP client at the URL with your two API headers. This is one of the few major SaaS MCPs (alongside Linear, Notion, Vercel, Supabase) with a managed endpoint.

What's the difference between the API key and the Application key?

The API key identifies your organization. The Application key identifies the user and carries that user's role-based permissions. Both are required. You can have many of each — keep them scoped to specific use cases for easier rotation.

Can I scope which tools the agent has access to?

Yes — append a toolsets query parameter to the URL: ?toolsets=monitors,logs,apm for a tight on-call config, or ?toolsets=all for the kitchen sink. Default is a focused subset. Scoping reduces the schema sent to the model, which is faster and cheaper.

Does it work with EU, US3, US5, AP1, AP2 sites?

Yes — change the subdomain to match your site. EU is mcp.datadoghq.eu, US3 is mcp.us3.datadoghq.com, etc. The path stays the same: /api/unstable/mcp-server/mcp.

Can the agent mute monitors or perform write actions?

Only if your Application Key has the matching scope. For triage-only workflows, create the App Key under a read-only user — that prevents accidental writes even if the model tries. The default system prompt in the Datadog Agent template also requires explicit confirmation before any mutation.