📘
Note
This is a private preview feature. Contact your Observe representative for more information or open Help > Contact support in the product and let us know.

What is AI SRE?

Observe AI SRE is your AI-powered site reliability engineering (SRE) observability assistant. Use AI SRE to help you investigate incidents, debug slow or error-prone services, and summarize observability insights across logs, metrics, and traces all through natural language.

AI SRE leverages Claude and OpenAI models securely. It leverages the same Observe MCP Server used by agentic tools like Augment, Cursor, Claude Code, Claude Desktop, OpenAI Codex CLI, and Windsurf, so that AI SRE can reason directly over your telemetry data.

How it works

Access AI SRE by selecting AI SRE in the navigation bar:

Begin by asking AI SRE a question in the text field, such as Why did latency spike last night? or Show the slowest services in my cluster.

AI SRE interprets your intent using Snowflake Cortex, and calls the Observe MCP Server.
The MCP Server accesses your Context Graph Tool and Query Generation Tool to find relevant Datasets (logs, metrics, traces, resources, correlations).
Observe executes the query and returns summarized insights directly in the chat.

Knowledge Graph updates

The AI SRE Knowledge Graph refreshes twice daily at 01:00 UTC and 13:00 UTC. After a refresh completes, it may take an additional 2–3 hours for updated content to become available for search.

Changes to your environment, such as adding or updating reference tables, modifying Dataset definitions, or adjusting links, are reflected in the Knowledge Graph after the next scheduled refresh.

To view the Knowledge Graph, perform the following steps:

Hover on your name in the left navigation, then select Manage account.
Click AI SRE settings.
The Knowledge graph tab should be selected by default, if it is not, select it.

📘
Note
If you need to verify that your changes have been picked up, ask AI SRE a question that references the updated content after the next scheduled refresh window.

Try these questions

To help you get started, try asking AI SRE the following questions:

Why did error rate increase after deployment?
Show top 5 services by latency this morning.
Which Lambda functions are slow or error-prone?
What resources had CPU bottlenecks?
Summarize service health in the last 24 hours.