Working directly in OPAL allows a wider range of options when modeling data. Below are some recommendations for better performance from your OPAL pipelines.
Limit query time window size¶
By default, worksheets read 4 hours of data. Depending on the input dataset, that can be a lot of data. Consider reducing the query time window to 1 hour or less while actively modeling.
Create intermediate datasets¶
Where possible, create an intermediate event dataset by publishing partially shaped data as a new event dataset. Queries and further derived datasets will typically have to read much less data than if they were created directly on top of the original input dataset.
This technique is especially effective if the intermediate dataset applies a selective filter to the input dataset, picks only a subset of input columns, or extracts JSON paths from an input column and then drops the original column.
Avoid defining datasets directly on the Observation dataset, as it contains all ingested data in the workspace.
make_resource time range¶
By default, the
make_resource verb reads a large time range of input events: 24 hours. The reason for this behavior is that
make_resource must compute the state of each resource at the beginning of the query time range, and, by default, it looks for events up to 24 hours in the past. Thus, a query with
make_resource that has a query time range of 4 hours actually reads at least 28 hours of input data.
24+ hours can be a lot of data, especially if the input dataset is the Observation dataset. So especially avoid defining resource datasets directly on the Observation dataset.
Most resource types receive events much more frequently than every 24 hours. We recommend adding
options(expiry:duration_hr(...)) to your
make_resource command to reduce its lookback where appropriate.
For example, if it is known that the live instances of some resource dataset receive events at least every 15 minutes, it would be appropriate to set the resource expiration to 1 hour, thereby greatly reducing the amount of data read by
make_resource options(expiry:duration_hr(1)), col1:col1, primary_key(pk1, pk2)