histogram

Type of operation: Aggregate

Description

Generate an approximated equi-width histogram for the selected expression.

Histograms are useful to visualize your data, as they are an approximation of the underlying data density distribution.

Based on the minimum and maximum value of the expression in the input, divide the range into contiguous, equal sized “bins”. For each bin, count the number of input rows that fall into that bin. The returned dataset contains bins rows and the following columns:

  • bin_start: starting value of the bin (inclusive)

  • bin_end: end value of the bin (exclusive)

  • bin_count: the value of each bin (the number of items in the bin).

bin_start and bin_end are the same type of the input column, bin_count is always integer.

Histograms generated on Int64, Duration, Timestamp and Numeric columns are guaranteed to have integer bin_start and bin_end, and therefore will not generate more bins than the data range (max_value - min_value + 1). Moreover, the width of an arbitrary number of bins at the end of the histogram might be augmented by 1, to maintain the bins contiguous and to fit the whole data range without exceeding. When the input size is not large enough to generate the number of desired bins, the verb will generate the maximum number of bins for such input size.

Unlike with integer numbers, the exclusivity of bin_end is not guaranteed for Float64 values.

One could control the presence of empty bins (bins with a count of 0) in the output dataset via options(include_empty: <bool>) (default true).

options(bins: <positive-integer>) is used to control how many bins to distribute the data in. Defaults to 10.

Histogram counts are approximated in the current implementation of the verb, as this verb uses tdigest to calculate the values of the bins.

The approximation is particularly evident when the input data is highly skewed.

Usage

histogram [ options ], col

Argument

Type

Optional

Repeatable

Restrictions

options

options

yes

no

constant

col

numeric, duration, or tdigest

no

no

none

Options

Option

Type

Meaning

bins

int64

How many bins to use for the histogram. Defaults to 10.

include_empty

bool

Whether to include empty bins or not. Defaults to false.

Accelerable

histogram is never accelerable. A dataset that only uses accelerable verbs can be accelerated, making queries on the dataset respond faster.

Examples

histogram temperature

Creates a histogram with 10 bins that represent the distribution of values in the temperature column.

histogram options(bins:25), temperature

Creates a histogram with 25 bins that represent the distribution of temperatures.

histogram options(include_empty: false, bins: 100), receivedBytes

Creates a histogram with 100 bins which represents the distribution of received payload sizes. Filter out empty bins.