Negative Monitoring

DevOps teams faced with Observability needs are often challenged to answer seemingly simple questions, such as “is the system down?” While it is straightforward to monitor for metric values or counts of logs breaching thresholds, it can be challenging to tell if needed data is late, out of order, or permanently missing.

Because no one wants to be paged for a false positive, it is worth spending some effort to ensure that negative monitors are designed to account for data production and delivery problems.

Host Is Down: Heartbeats and Crash Signals

The ideal answer for a negative monitoring question is to make it into a positive monitoring problem. Monitoring for reliably bad states such as Crash Looping in Kubernetes or Kernel Panic in Linux, or monitoring for a regular heartbeat or metric delivery from agents can indicate quickly that a system has stopped functioning.

Data Is Missing: Stabilization Delays

The second most ideal answer is to minimize transformation and maximize delivery assurance. Basing monitors off of data earlier in the chain of dataset definitions can reduce the amount of work that Observe is doing to prepare that data. However, all the systems upstream of Observe can also produce issues. Upstream delays or misordering of data can lead to false positives from negative monitors. To address that concern, use the stabilization delay option. In the edit view of a monitor, go to Monitor query, Advanced options, and adjust the Delay monitor evaluation value. This option shifts the monitor’s evaluation window back so that potential upstream delivery issues can settle before an alarm is fired. Note that the lookback period of a monitor is measured from the stabilization delay.

For example, historical analysis of a log datastream may show an average delta of 90 seconds between the origination timestamp in the records and the BUNDLE_TIMESTAMP when Observe received the data. In this case, setting the stabilization delay of a negative monitor to two minutes would allow the upstream system adequate time to deliver its records. If that monitor’s lookback is 10 minutes, it will effectively monitor a sliding window from 2 minutes ago to 12 minutes ago.

Example - Heartbeat Negative Monitor

A common negative monitoring scenario is to alert when there is no data reported for a specific interval from one or many systems. In this example, we will be creating a negative monitor based on the up metric contained in the Host Quickstart Metric dataset. Our monitoring configuration looks back every 10 seconds, aggregated by SourceAttributes.host.name values, and alerts when there is no up metric data for more than 5 minutes. Building the initial query can be done via the Expression Builder, and will automatically produce OPAL similar to the following:

align 1m, frame(back: 10s), A_up_sum:sum(m("up"))
aggregate A_up_sum:sum(A_up_sum), group_by(SourceAttributes."host.name")

This OPAL will then be used in conjunction with a subquery to first fill gaps in data with null points, and then converts those null points to the value 0 for our up metric. The finished negative monitoring OPAL will look similar to the following, and has been commented to explain each step. You can use this pattern for any monitor of type Threshold

// convert our UI generated OPAL to a parent query
// via use of the @ <- {} syntax
// we will call this parent query metric_process
// note that values that are dot-separated like "host.name" get converted
// into into underscore separated - in this case host.name becomes host_name
@metric_process <- @{
    // this is the query initially produced by the 
    // expression builder
    align 1m, frame(back: 10s), A_up_sum:sum(m("up"))
    aggregate A_up_sum:sum(A_up_sum), group_by(SourceAttributes."host.name")
}
// add the following sub-query OPAL
<- @metric_process {
    // use make_event to create synthetic events
    make_event
    // use make_resource to create a synthetic resource with two configs
    // options(expiry) should be the align value in the parent
    // in this case 1m, plus up to 2 hours into the future
    // primary_key should be something similar to group_by in the parent
    make_resource options(expiry:1m+2h), primary_key(host_name)
    // we will then do a leftjoin, where we are joining the parent 
    // query key of @metric_process.host_name to current subquery @.host_name
    // remember host.name gets converted to host_name
    // in this left join we are also converting null values/missing data to zeros via the if_null statement
    leftjoin on(@.host_name = @metric_process.host_name), A_telemetryspcsevent_table_metric_heartbeat_sum: if_null(@metric_process.A_up_sum, 0)
    // lastly use timechart to ensure this reflects properly in the monitor preview linechart
    timechart 1m, A_telemetryspcsevent_table_metric_heartbeat_sum:any(A_telemetryspcsevent_table_metric_heartbeat_sum), group_by(host_name)

}

Note that if the source of data for your metric regularly arrives late, you can take advantage of the Advanced options > Delay monitor evaluation by... feature, to provide a more stable window of data for the monitoring engine to evaluate.