ObservabilityData

Datadog Agent

Datadog's official Bits AI MCP server (GA March 2026) exposes APM traces, logs, metrics, monitors, dashboards and security signals as MCP tools. This template wires it up with a system prompt tuned for the on-call workflow — alert triage, dashboard-driven investigations, log searches and metric queries — so you can answer "what is broken right now?" without leaving chat.

Default model: Claude Sonnet 4.51 serverAccess token required

Default model

Claude Sonnet 4.5

MCP servers

mcp.datadoghq.com

Auth

Datadog API key + Application key (Organization Settings → API Keys / Application Keys)

What you can do

A few things this template does well out of the box.

  • Triage active alerts: which monitors are firing, what changed, who owns the affected service
  • Query metrics over a time range and summarise the trend (latency, error rate, throughput)
  • Search logs for a trace ID, error string, or user ID across services
  • Pull APM trace details for a slow request and explain the bottleneck (slowest spans, downstream calls, DB queries)
  • Summarise overnight incident activity for a stand-up: alerts fired, services affected, recovery time

How it works

Three steps to go from template to a live chat.

1

Click "Use this template"

Agent Studio opens with the MCP server, model and system prompt pre-filled.

2

Add your access token

Datadog API key + Application key (Organization Settings → API Keys / Application Keys)

3

Start chatting

Ask a question, watch live tool calls and switch models at any time to compare answers.

MCP servers used

The endpoints this template connects to by default. You can swap any of them in Agent Studio.

https://mcp.datadoghq.com/api/unstable/mcp-server/mcp

mcp.datadoghq.com

HTTP

Getting your access token

A quick walkthrough for the credential this template needs.

  1. 1**API Key**: Datadog → **Organization Settings** → **API Keys** → create or copy a key. This identifies your org.
  2. 2**Application Key**: Datadog → **Organization Settings** → **Application Keys** → create a key with the scopes your workflow needs (read-only is fine for triage). This authenticates the user.
  3. 3Paste both values into the **DD-API-KEY** and **DD-APPLICATION-KEY** fields in Agent Studio.
  4. 4**Region**: The default URL targets `mcp.datadoghq.com` (US1). For EU change to `https://mcp.datadoghq.eu/api/unstable/mcp-server/mcp`. For US3/US5/AP1/AP2, swap the subdomain accordingly.
  5. 5Optional: append `?toolsets=all` to the URL to enable every toolset (APM, Logs, Metrics, Monitors, Dashboards, LLM Observability, Software Delivery, Security). Default is a focused subset.
  6. 6Send *"Which monitors are alerting right now?"* to confirm the connection.

Try these prompts

Copy one into the studio to see the agent in action.

  • Which monitors are alerting right now? Group them by service and show me the alert message for each.

  • p95 latency on the `checkout` service for the last 6 hours — is it trending up?

  • Search logs for "stripe webhook failed" in the last 2 hours and summarise the error patterns.

  • Pull the slowest 10 traces for `POST /api/orders` in the last hour and tell me what they have in common.

  • Build me a stand-up summary: every alert that fired between 10pm and 8am, the service it hit, and whether it auto-recovered.

System prompt

The default instructions the model starts with. Edit it any time inside Agent Studio.

You are a senior site-reliability engineer connected to Datadog via the official Bits AI MCP server. You help on-call engineers triage alerts and investigate production issues.

Use the available tools to:
- List active monitors and alerts; for each, surface the service, the metric/condition, and the most recent state change
- Query metrics (timeseries, gauges, distributions) over a user-specified window — summarise trends, not raw numbers
- Search logs by query string, service, trace ID or time range; surface unique error patterns instead of dumping every line
- Pull APM trace details and explain bottlenecks: slowest spans, downstream services, database calls
- Read dashboards and incident details when the user asks for a higher-level view

Operating principles:
- Lead with the answer (e.g. "checkout p95 has spiked from 220ms to 480ms in the last 30 minutes"), then back it up with the evidence (the metric, the time range, the affected hosts)
- When triaging multiple alerts, group them by likely root cause — don't just enumerate
- For log searches, return summaries and counts before raw log lines
- Never recommend a destructive mitigation (mute monitor, restart service) without explicit confirmation from the user
- If a query needs a tag or service name you don't have, ask — don't guess

Ready to try the Datadog Agent?

Open Agent Studio with this template pre-loaded. Add your token, pick any model, and start chatting.

Use this template

Related templates