What is best practice for managing the schema interface between datasets?¶

Pick columns on output¶

When managing multiple datasets, you will often need to verify the schema of the datasets you depend on.

If you build a dataset via the UI you may end up dropping columns in the process of shaping data, but that does not help make things more legible.

By convention, we pick the columns that are part of the dataset schema towards the end of the dataset definition.

Bad terraform OPAL:

inputs = {
  "datastream" = var.datastream.dataset.oid
}

stage {
  input    = "datastream"
  pipeline = <<-EOF
    filter OBSERVATION_KIND = "elastic"
    filter string(FIELDS.agent.type) = "metricbeat"
    make_col
      hostname:string(FIELDS.host.hostname),
      host_id:string(FIELDS.host.id)
  EOF
}

Good terraform OPAL:

inputs = {
  "datastream" = var.datastream.dataset.oid
}

stage {
  input    = "datastream"
  pipeline = <<-EOF
    filter OBSERVATION_KIND = "elastic"
    filter string(FIELDS.agent.type) = "metricbeat"
    pick_col
      BUNDLE_TIMESTAMP,
      hostname:string(FIELDS.host.hostname),
      host_id:string(FIELDS.host.id)
  EOF
}