Monitors and Alerts¶
Observe Monitors provide a flexible way to alert when conditions are matched in your data. You can use multiple rules and conditional tests to determine the severity of a condition, route to different destinations based on severity, and reactively or proactively mute alerting.
The Version 2 Upgrade¶
The Observe Monitoring engine upgrade brings many valuable new capabilities to your Observability Cloud experience, but safely changing monitoring systems takes caution. When Monitors v2 is enabled via feature flag, your existing monitoring system continues to run and can still be managed independently of Monitors v2. To access Monitors v1 from a Monitors v2 enabled instance, go to Monitors and click View Monitors v1 at the top right. All features of the v1 system are accessible there.
What does a Monitor do?¶
A Monitor watches a dataset for a particular condition, such as a count of events or a specific text value. When you create a Monitor, Observe makes a new dataset based on the contents of the dataset and your conditions. This allows multiple Monitors from the same dataset to be independent of each other.
There are three monitoring types available:
Threshold - Send an alert when a value crosses a threshold over a period of time. A Threshold monitor is ideal for metrics correlations (CPU is high and transactions are low), logging alerts (errors are high), or negative monitors (service data is no longer arriving).
Count - Send an alert when the count of matching records crosses a threshold over a period of time. A Count monitor is ideal for matches
Promote - Send the data as an alert when new matching data arrives.
When does a Monitor execute?¶
Observe Monitors execute as soon as data is available. Incoming data is ingested, shaped into a Dataset, and then ready for use. A Monitor is just another Dataset, and evaluates when data is added to it.
Some data sources cannot produce data in a timely and ordered fashion. If your monitor relies on data that is not complete for a few minutes, there are some approaches to solving this problem:
In the Monitor Query section of the monitor, under Advanced options, you can introduce a delay before evaluation.
Freshness goal adjustment: Use Acceleration Manager to reduce the Monitor’s freshness goal.
OPAL editing: Use OPAL to prevent evaluation of the “ragged right edge” data. The window and frame functions can be used together to filter the evaluated data set.
Monitors Overview¶
When you select Monitors from the left hand rail, Observe displays a list of existing Monitors configured on your instance.
You can filter and search monitors by any attribute, including the Description field. Use Created By and Modified By to search by people who have worked on a Monitor.
Click the Last Triggered value to see Alerts generated by this Monitor.
Clicking a Monitor will open it in read-only mode; click the Edit button at top right to change its definition if your account has access. The read-only page for a Monitor includes access to that Monitor’s logs and insight to its metrics, so you can evaluate its performance.
There are five Status options for monitors:
Running - the Monitor is active and healthy
Triggering - the Monitor is active and has currently active Alerts
Degraded - the Monitor has recognized data issues or has been unable to send notifications
Error - the Monitor has not been able to execute
Inactive - the Monitor is disabled by an administrator
You can access v1 Monitors, Shared Actions, Mute Windows, and the New Monitor creation tool from the top right.
Types of Monitors¶
There are three types of monitors in Monitors v2:
Threshold¶
The threshold monitor alerts when a value crosses a threshold over a period of time. Thresholds are ideal for metrics data, where a numeric value is set as part of the dataset definition. You can also use other datasets, such as logs, traces, or resources, by selecting a single numeric column as the metric value.
For example, here are some threshold monitor use cases:
Alert when the
CrashLoopBackOff
metric in Kubernetes Pod Metrics is highAlert when the
bytesSent
in AWS S3 Access Logs is higher or lower than expected
Count¶
The count monitor alerts when the number of rows in a monitored set cross a threshold over a period of time. Counts are ideal for measuring volumes of data instead of contents of data and are good for negative monitors.
For instance, a count monitor use case would be to alert on the number of User Access Logs matching an error condition and a URL regular expression.
Promote¶
The promote monitor sends the matching data in a monitored set to the destination. Promotes are ideal for sending actionable alerts to human operators or analysts, because you can include all relevant data directly into the message.
Some example promote monitor use cases include:
Crash reports that link in the affected customer, responsible engineer, and triggering condition from other datasets
Customer feedback alerts that include contextual data or links to investigative tools
Conversion from earlier Monitor types¶
Monitors from legacy monitoring are not automatically converted or migrated. To plan a migration of existing monitors, contact your Observe Data Engineer.
Monitors 1 |
Monitors 2 |
---|---|
Metrics Threshold |
Threshold |
Log Threshold |
Threshold |
Count |
Count |
Text Value / Facet |
Count |
Promote |
Promote |
Muting Monitors¶
An active Observe Monitor always produces Alerts when the rules match, but you can suppress delivery of Alert notifications by muting the monitor. Observe provides two easy-to-use ways to mute: ad-hoc, and scheduled.
Ad hoc mutes - Using the context menu of a single Monitor or multi-selecting several Monitors, you can start an ad hoc mute. The selected Monitor(s) will be muted starting now for the selected time period. An ad hoc mute is good for suppressing alerts from a known issue so that you can concentrate on solving the issue.
Scheduled mutes - Scheduled mutes apply globally to all monitors that match your conditions. Click View mute windows in the top right of the Monitors page, then New mute window. Set a time range, then add key=value conditions to determine which monitors will be muted. A scheduled mute is good for preparing for planned activity, such as a deployment to a customer cluster.
Unmuting Monitors¶
An ad hoc mute is visible in the Monitors list page, and can be disabled from here. Select one or more muted Monitors and use the context menu to select Unmute.
A scheduled mute must be managed from the View mute windows area at the top right of the Monitors list page. Click the button to see the list of active mutes. Delete mute windows that are no longer needed.
Note that ad hoc mutes are also visible in the Mute windows list and can be deleted from here as well.
Creating a New Monitor¶
To create a new Monitor in Observe, use the following steps:
Log into Observe and click the Monitors icon on the left side navigation.
On the Monitors page, click New Monitor.
From the Select your monitor type panel, select the type of monitor you want to create in Observe:
Monitors can also be created from data browsing, such as Explorers or Worksheets.
In Log Explorer, click the Action menu at top right and select Create a monitor.
If the explorer visualization is raw data, a new Count monitor creation form will open using this data.
If the explorer visualization is a chart, a new Threshold monitor creation form will open using this data.
In Metrics Explorer, click the Action menu at top right and select Create a monitor. A new Threshold monitor creation form will open using this data.
In Trace Explorer, click the Action menu at top right and select Create a monitor. A new Count monitor creation form will open using this data.
On a Worksheet, click the context ellipses menu for a stage and select Create a monitor. Choose the type of monitor and proceed as above.
Note
Known Issue: Monitors v2 is currently restricted to a single stage when making a new monitor from a worksheet. Convert your worksheet’s logic to one stage to proceed.
Reviewing a Monitor’s Data Lineage¶
You may need to review a Monitor’s data sources to understand the data it relies on, the latency of that data, and the size of the queried window over time. Balancing the speed and cost of a monitor requires controlling the freshness of the data that monitor relies on.
To review metrics such as latency and queried window size, open a monitor in read-only and click the Insights tab.
Upstream latency - time from ingest to monitor
Result latency - time from ingest to alarm
Evaluation time - time spent in monitor evaluation
Query window size - time frame evaluated by the monitor
To quickly assess the effective freshness of a monitor with a complex data lineage, use the Acceleration Manager. From Settings at the lower left, click Workspace Settings, Acceleration Manager, and Monitors. Sort by Effective Freshness to filter the list. Monitors with an Effective freshness that is worse than the Freshness goal will have a warning icon and colored display. Hover over these lines to get a context menu, and click the More icon to edit the Monitor definition. You can also access monitor definitions from the Acceleration Manager’s Monitors list page. Hover over the monitor to access the context menu, and click the Pencil icon to edit the Monitor definition. Note that v1 Monitors are on a separate page and can also be edited.
Edit the Monitor by clicking the pencil icon at top right or from the context menu and navigate to the Monitor Query. Click Manage Inputs at the right side to list the input datasets, and click the Open or Edit buttons to review their data or definitions. Note the Manage Inputs button is hidden when using an expression builder, click the OPAL button at the right side to expose it.