Observe Performance Cookbook: Create Intermediate Datasets

Problem

A complex dataset takes a long time to materialize. It may have trouble meeting the intended dataset freshness level.

Solution

Where possible, create an intermediate event dataset by publishing partially shaped data as a new event dataset. For example, consider a dataset that starts with IAM access logs containing IP addresses, filters them by an infrastructure provider list and a threat intelligence list using Basic Threat Intel, and then performs a geographic lookup.

Explanation

Queries and further derived datasets typically have to read much less data than if you create them directly on top of the original input dataset. In our example, breaking the steps of the dataset construction into datasets that filter results reduces the number of rows passed to the next dataset. This also makes it easier to inspect outcomes and verify that results are as expected.

This technique is especially effective if the intermediate dataset applies a selective filter to the input dataset. For instance, you might pick a subset of input columns with pick_col, or extract JSON paths from an input column before dropping it, or use timechart or statsby to reduce volume.