Negative Monitoring¶
DevOps teams faced with Observability needs are often challenged to answer seemingly simple questions, such as “is the system down?” While it is straightforward to monitor for metric values or counts of logs breaching thresholds, it can be challenging to tell if needed data is late, out of order, or permanently missing.
Because no one wants to be paged for a false positive, it is worth spending some effort to ensure that negative monitors are designed to account for data production and delivery problems.
Host Is Down: Heartbeats and Crash Signals¶
The ideal answer for a negative monitoring question is to make it into a positive monitoring problem. Monitoring for reliably bad states such as Crash Looping in Kubernetes or Kernel Panic in Linux, or monitoring for a regular heartbeat or metric delivery from agents can indicate quickly that a system has stopped functioning.
Data Is Missing: Stabilization Delays¶
The second most ideal answer is to minimize transformation and maximize delivery assurance. Basing monitors off of data earlier in the chain of dataset definitions can reduce the amount of work that Observe is doing to prepare that data. However, all the systems upstream of Observe can also produce issues. Upstream delays or misordering of data can lead to false positives from negative monitors. To address that concern, use the stabilization delay option. In the edit view of a monitor, go to Monitor query, Advanced options, and adjust the Delay monitor evaluation value. This option shifts the monitor’s evaluation window back so that potential upstream delivery issues can settle before an alarm is fired. Note that the lookback period of a monitor is measured from the stabilization delay.
For example, historical analysis of a log datastream may show an average delta of 90 seconds between the origination timestamp in the records and the BUNDLE_TIMESTAMP
when Observe received the data. In this case, setting the stabilization delay of a negative monitor to two minutes would allow the upstream system adequate time to deliver its records. If that monitor’s lookback is 10 minutes, it will effectively monitor a sliding window from 2 minutes ago to 12 minutes ago.