AI SRE
Learn how to use the AI SRE observability assistant.
NoteThis is a private preview feature. Contact your Observe representative for more information or open Help > Contact support in the product and let us know.
What is AI SRE?
Observe AI SRE is your AI-powered site reliability engineering (SRE) observability assistant. Use AI SRE to help you investigate incidents, debug slow or error-prone services, and summarize observability insights across logs, metrics, and traces all through natural language.
AI SRE leverages Claude and OpenAI models securely. It leverages the same Observe MCP Server used by agentic tools like Augment, Cursor, Claude Code, Claude Desktop, OpenAI Codex CLI, and Windsurf, so that AI SRE can reason directly over your telemetry data.
How it works
Access AI SRE by selecting AI SRE in the navigation bar:
Begin by asking AI SRE a question in the text field, such as Why did latency spike last night? or Show the slowest services in my cluster.
- AI SRE interprets your intent using Claude Sonnet 4.5, and calls the Observe MCP Server.
- The MCP Server accesses your Context Graph Tool and Query Generation Tool to find relevant Datasets (logs, metrics, traces, resources, correlations).
- Observe executes the query and returns summarized insights directly in the chat.
Context Graph updates
The AI SRE Context Graph (Knowledge Graph) refreshes twice daily at 01:00 UTC and 13:00 UTC. After a refresh completes, it may take an additional 2–3 hours for updated content to become available for search.
Changes to your environment, such as adding or updating reference tables, modifying Dataset definitions, or adjusting links, are reflected in the Context Graph after the next scheduled refresh.
NoteIf you need to verify that your changes have been picked up, ask AI SRE a question that references the updated content after the next scheduled refresh window.
Try these questions
Try asking AI SRE the following questions to get started:
- Why did error rate increase after deployment?
- Show top 5 services by latency this morning.
- Which Lambda functions are slow or error-prone?
- What resources had CPU bottlenecks?
- Summarize service health in the last 24 hours.
Updated 4 days ago