Skip to main content
CodeAlive

From Alert to Root Cause in Minutes, Not Hours

Wire CodeAlive into your investigation agent over MCP. It joins code context to your metrics and traces, so an LLM can walk both sides of an incident.

Why Complex Bugs Take So Long

  • Complex bugs span multiple services and require correlating logs, metrics, traces, and code.
  • Engineers spend hours jumping between Grafana, logs, and IDE trying to connect the dots.
  • The person investigating often didn't write the code and lacks context.
  • Observability tools show what is happening, but not why at the code level.
  • AI agents can analyze metrics or code, but rarely both together.
  • Post-mortems are incomplete because the full picture is scattered across tools.

Metrics Plus Code, Driven by One Agent

Pair CodeAlive MCP with Grafana MCP in one investigation agent. Grafana surfaces the spike; CodeAlive points to the function that caused it. The agent posts a draft RCA to Slack before the on-call engineer opens the alert.

An MCP-native investigation toolkit

Agentic Pipeline Integration

Combine multiple MCP tools in one investigation: CodeAlive for code understanding, Grafana for metrics, logs, and traces.

Code-Grounded Investigation

Ask why a service is timing out and get back the exact call chain, the timeout values, and where they're set.

Metric-to-code correlation

Connect observability anomalies to specific code paths: from metric spike, to service, to exact function.

Recent-Change Awareness

Surface the PRs and deploys whose timing lines up with the incident window, with diffs already explained.

Draft Post-Mortems

Generate a timeline, root cause, evidence, fix, and blast radius, formatted and ready for review.

On-Call Assist

Posts a Slack or Teams summary the moment an alert fires, so the on-call opens to context, not a paging.

From alert to draft RCA

  1. 1

    Alert Fires

    PagerDuty or Grafana alert triggers the investigation agent with the alert payload and time range.

  2. 2

    Grafana MCP Pulls Context

    Agent gathers error rates, latencies, log patterns, and impacted endpoints from your dashboards.

  3. 3

    CodeAlive MCP Explains the Code

    Agent traces the failing path, pulls timeout configs, and identifies recent changes touching the same files.

  4. 4

    Cross-Reference Timing

    Both MCPs are queried together to align metric anomalies with code deploys and config flips.

  5. 5

    Draft Root-Cause Report

    Agent produces a markdown summary with timeline, evidence, fix, and blast radius for human review.

What changes during incidents

  • Cut the time between alert and root cause.
  • Initial root-cause hypothesis ready before the on-call human shows up.
  • Eliminate 'finger in the air' debugging.
  • Catch regressions tied to specific deploys faster.
  • Produce post-mortems that are complete on the first draft.

What an agentic investigation looks like

Agentic investigation flow diagram showing both MCP servers feeding one investigation agent
Agentic investigation flow combining CodeAlive MCP and Grafana MCP.
Side-by-side view of Grafana metrics and CodeAlive code context for a payment incident
Side-by-side Grafana metrics and CodeAlive code context.
Slack channel showing automated investigation report delivered by the agent on alert fire
Slack integration delivering the automated investigation report.

Shorten every incident

Plug CodeAlive into your investigation agent and let it explain the code while Grafana explains the metrics.

CodeAlive MCP plus Grafana MCP, in Action

A fintech company experiences intermittent payment failures. The bug is subtle: it only occurs for specific currency conversions during high-load periods, and standard debugging hasn't identified the cause.

1

Step 1 — Gather Observability Context (Grafana MCP)

Agent prompt

Query payment service metrics for the last 2 hours. Show error rates, latency P99, and any anomalies.

Error rate spike at 08:45 UTC (2.3% to 8.7%). Latency P99 normal. Affected endpoint /api/v2/payments/process. 87% of errors are CONVERSION_FAILED. 94% from EU region. Correlated log: 'Currency conversion timeout after 5000ms'.

2

Step 2 — Understand the Code Path (CodeAlive MCP)

Agent prompt

Explain the code path for /api/v2/payments/process when handling currency conversion. Include timeout handling.

Entry PaymentController.processPayment(). Flow: PaymentService to CurrencyConverter to ExternalRateAPI. Timeout 5000ms hardcoded in CurrencyConverter.java:142. No fallback for timeouts. Recent change PR #4521 (3 days ago) modified rate caching. Finding: cache invalidation runs synchronously on miss.

3

Step 3 — Correlate Timing (Both MCPs)

Agent prompt

Show cache hit rate for currency-rate-cache and overlay with error rate timeline. Then analyze PR #4521.

Cache hit rate dropped from 95% to 23% at 08:42 UTC. Error spike follows by ~3 minutes; pattern repeats every 4 hours (cache TTL). PR #4521 reduced TTL from 24h to 4h without updating warm-up batch size, so a thundering herd hits the rate-limited ExternalRateAPI.

4

Step 4 — Generate Fix Recommendations

Agent prompt

Suggest fixes for the thundering herd in CurrencyRateCache. Show similar patterns in the codebase we can reference.

Recommended: staggered cache refresh (see InventoryCache.java:156), circuit breaker (pattern in AuthService), fallback to last-known-good rate on timeout. Similar bug solved in InventoryService Q2 via PR #3892. Quick mitigation: revert TTL to 24h (one line in CurrencyRateCache.java:89).

Integration Architecture
         ┌────────────────────────────────────────────┐
         │      AI Investigation Agent (MCP client)    │
         └────────────────────────────────────────────┘
                 │                          │
                 ▼                          ▼
       ┌──────────────────┐       ┌──────────────────┐
       │  CodeAlive MCP   │       │   Grafana MCP    │
       │  code, deps, PRs │       │ metrics, logs    │
       └──────────────────┘       └──────────────────┘
                 │                          │
                 ▼                          ▼
       ┌──────────────────┐       ┌──────────────────┐
       │   Your Codebase  │       │  Grafana Stack   │
       └──────────────────┘       └──────────────────┘

One AI investigation agent, two MCP servers, both sides of the picture.

Setup Example
// Investigation agent with dual MCP integration
const investigationAgent = new Agent({
  model: "claude-sonnet-4-6",
  mcpServers: [
    { name: "codealive", url: "https://mcp.codealive.ai/api/",
      auth: { token: process.env.CODEALIVE_API_KEY } },
    { name: "grafana", url: "http://localhost:3001/mcp",
      auth: { token: process.env.GRAFANA_API_KEY } },
  ],
  systemPrompt: `You are an incident investigation agent.
    Use Grafana MCP to gather observability data.
    Use CodeAlive MCP to understand code context.
    Combine both to identify root causes.`,
});

await investigationAgent.investigate({
  alert: "Payment failure rate exceeded threshold",
  timeRange: "last 2 hours",
  service: "payment-service",
});

Wire both MCP servers into one agent and let it investigate.