Observe Performance Cookbook: Prefer Join to Lookup

Problem

Using lookup in explorers or worksheets that use Resources or Intervals performs more slowly than expected.

Solution

Using the OPAL editor, change lookup usage to join and re-verify that the use cases are still working as desired.

Explanation

A join is a relational inner join, while a lookup is a relational left outer join. The difference is most prominent when dealing with Resource datasets as inputs. A lookup will always return all the resources and their full time intervals of the left input (first Input dataset in OPAL), even for those resources and time intervals where no match is found in the right input. A join instead performs an intersection of all the interval rows that make up the resources, and will only return the time intervals where both resources were live.

Be aware that intersected intervals can become very small, so queries may produce incomplete results, especially when multiple join verbs are stacked up without resetting the valid-from and valid-to columns in between.

OPAL Join Verb Efficiency

Join verbs link multiple datasets together. They may generate more, fewer, or the same number of output rows as input rows, depending on the join condition and verb.

Joins are a powerful tool that can increase the amount of data collected and credits consumed. In order to keep your queries efficient, it is useful to check that the join being used is the most efficient one that will work for your use case.

From most expensive to least expensive:

  • lookup or leftjoin between two Resource or Interval datasets

    • This uses an interval overlapping band-join and an additional temporal left outer join. Prefer join to lookup in this case.

  • join between two Resource or Interval datasets

    • This only uses an interval overlapping band-join

  • lookup, leftjoin, or join between one Event and one Resource or Interval dataset (with or without window functions)

  • join or leftjoin between two Event datasets with window functions

    • These all use a point-in-interval band-join

  • lookup, leftjoin, or join between two Event datasets (without window functions), or between one Table and another dataset.

    • These use simple relational joins