Performance#

Working directly in OPAL allows a wider range of options when modeling data. The following recommendations may give you better performance from your OPAL pipelines.

Limit query time window size#

By default, worksheets read 4 hours of data. Depending on the input dataset, that can be a lot of data. Consider reducing the query time window to 1 hour or less while actively modeling.

Create intermediate datasets#

Where possible, create an intermediate event dataset by publishing partially shaped data as a new event dataset. Queries and further derived datasets typically have to read much less data than if you create them directly on top of the original input dataset.

This technique is especially effective if the intermediate dataset applies a selective filter to the input dataset. For instance, you might pick a subset of input columns with pick_col, or extract JSON paths from an input column before dropping it, or use timechart or statsby to reduce volume.

A field extracted from a JSON column has the type of variant. Processing variant values can be significantly more expensive in query execution and transform execution. Observe recommends casting fields that are extracted from a JSON column to an explicit type (for instance, use make_col cluster_uid:string(FIELDS.cluster_uid) instead of make_col cluster_uid:FIELDS.cluster_uid).

Limit make_resource time range#

By default, the make_resource verb reads a large time range of input events, 24 hours. The reason for this behavior is that make_resource must compute the state of each resource at the beginning of the query time range, and, by default, it looks for events up to 24 hours in the past. Thus, a query with make_resource that has a query time range of 4 hours actually reads at least 28 hours of input data.

24+ hours can be a lot of data, especially if the input dataset is a source dataset from a Datastream. So especially avoid defining resource datasets directly on source datasets and create filtered down intermediary datasets instead.

Most resource types receive events much more frequently than every 24 hours. We recommend adding options(expiry:duration_hr(...)) to your make_resource command to reduce its lookback where appropriate.

For example, if it is known that the live instances of some resource dataset receive events at least every 15 minutes, it would be appropriate to set the resource expiration to 1 hour, thereby greatly reducing the amount of data read by make_resource:

make_resource options(expiry:duration_hr(1)), col1:col1, primary_key(pk1, pk2)

Always do filtering verbs (filter, ever, exists) earlier in an OPAL script#

These verbs reduce the data volume and make downstream operations faster. Our optimizer strives to automatically perform this optimization, but it is still a good practice to do it manually.

Do

filter stream="stderr"
make_col kvs:parse_kvs(log)
make_col cid:string(kvs.customer_id)

Don’t

make_col kvs:parse_kvs(log)
make_col cid:string(kvs.customer_id)
filter stream="stderr"

Use flatten_single rather than flatten if you only need single-level flattening#

flatten by default recursively flatten every nested array and object, which is more expensive than flatten_single which only flattens the current level.