Creating a Threshold Monitor

Note

The Monitors v2 engine is currently in private preview. Contact your Observe Data Engineer to enable this feature flag. See documentation for Monitors v1.

The Threshold monitor allows you to build expressions for monitoring numeric thresholds from metric, log, or other types of data. For instance, you might want to monitor CPU usage for an application and receive an alert when it exceeds a specified threshold.

Thresholds are ideal for metrics data, but can be used with any dataset with a single numeric column presenting a metric value.

From an Explorer

The Log, Metric, and Trace Explorers all offer a Create Monitor button, which leads to a threshold monitor creation.

Start with a scenario that you want to alert on. For instance, you might do one of the following scenarios:

Metrics Explorer

Open Metrics Explorer from the left navigation rail and select kube_pod_container_status_waiting from kubernetes/Pod Metrics. Use the filter bar to limit the data to labels.reason = CrashLoopBackOff. Set the data aggregation to Sum by Container Average over 1m. You should now see a chart of Kubernetes containers backing off from crash loop detection, which is ideally a very small number or zero.

To create a Threshold monitor from this data, click Create a monitor in the top right of the Explorer to open the Monitor creation form with a preconfigured input.

Log Explorer

Open Log Explorer from the left navigation rail and select kubernetes/Event. Use the filter bar to limit the data to sourceComponent = "kubelet" and reason = "ProbeWarning". You should now see a list of Kubernetes containers warning that their readiness probes had to be terminated, which is ideally uncommon. The Count column will be used as the threshold metric.

To create a Threshold monitor from this data, click Create a monitor in the top right of the Explorer to open the Monitor creation form with a preconfigured input.

From a Worksheet

Start with a scenario that you want to alert on. For instance, you might have a more complex query for Kubernetes Probe warnings that also looks up the DevOps team responsible and the service names affected. To create a Threshold Monitor, click the ellipses button the top right of the active stage on the left side of the screen, click Create a monitor and Threshold monitor to open the Monitor creation form with a preconfigured input.

Note

While Monitors v2 is in preview, only single-stage worksheets can be used for Monitors. Collapse multiple stages to a single stage before proceeding.

See Monitors Introduction for more details on alerting rules and actions.

From the Monitors List

Click New Monitor on the top right, then Threshold. Select a dataset to proceed.

Monitor name

Name the monitor before proceeding. Monitors must have a unique name within the instance. You can prepend a name with an App name and a slash for organizational purposes.

Monitor query

No matter how you’ve started a Threshold monitor, the flow is the same to proceed. First, review the Monitor query to ensure that it is gathering the data that you intend to monitor. Use the time selector at the top right of the preview panel. You have access to the entire set of Observe data manipulation tools: click Chart to organize the data, use filters to trim it, and add queries or formulas to enrich the monitor.

Queries and time

A monitor applies to a sliding window of time. As new data arrives and triggers an evaluation, the monitor query will:

  • Start at the time set by the stabilization delay, if configured. See Monitor query, Advanced options, Delay monitor evaluation to review or change.

  • Look back the amount of time set by the evaluation period. See Monitor query, Evaluate the number of rows over the last time period to review or change. For instance, a Threshold monitor that has a stabilization delay of 5 minutes and a lookback of 10 minutes will continuously monitor a sliding window from 15 minutes ago to 5 minutes ago. A Threshold monitor with no stabilization delay and a lookback of 120 seconds will monitor from now to two minutes ago.

Rules

You can construct multiple rules in a monitor, using conditional tests from the data to set a severity level. The preview panel will update in real time so you can review where your rules are matching.

Threshold monitors accept rules based on the count of rows within the sliding evaluation window. Given a ten minute window, if the count matches the condition in a rule at any point the rule will trigger and an alert will be created.

To further constrain matches or set severity by grouped values in your data, click “For any group” and select a grouped value. For instance if you are grouping by an Alarm ID, select “Alarm ID”, choose “equal to” or “not equal to”, and enter an Alarm ID.

Description

Use the Monitor description field as a free form text entry to inform users, link runbooks, or tag monitors. You can search Monitors or alerts by the contents of this field.

Notification actions

Once an alert is created, notifications can happen. If no notification is configured, the alert will still be visible in monitoring logs and Alert Explorer.

Observe supports Email, Slack, PagerDuty, and generic Webhook actions. For each action, use the Conditions area to select the matching severities that will trigger this action. For instance, you might use Slack for a Warning, but PagerDuty for a Critical.

All actions can use Observe’s extended Mustache formatting to refer to data. See Customizing Alert Messages for details.

Actions can send reminders on a periodic basis; this can be useful for Slack or Email to larger teams. Click Send Reminders beneath the action to select a time frame, such as “1 day”. Mustache variables can be used to control these alternate behaviors.

Actions can send end notifications, which is frequently used to close a ticket in a receiving system such as PagerDuty or OpsGenie. Click Send an update when the monitor has stopped triggering beneath the action to enable this. Mustache variables can be used to control these alternate behaviors.

Once configured, an action can be shared with your team members as a Saved Action, by clicking Share action with team in the title row. See Shared Actions for more information.

Sample Values

When a threshold monitor produces an alert, it is useful to know how much the threshold was breached by. Observe will include a representative sample value from the breaching data.