histogram¶
Type of operation: Aggregate
Description¶
Generate an approximated equi-width histogram for the selected expression.
Histograms are useful to visualize your data, as they are an approximation of the underlying data density distribution.
Based on the minimum and maximum value of the expression in the input, divide the range into contiguous, equal sized “bins”. For each bin, count the number of input rows that fall into that bin.
The returned dataset contains bins
rows and the following columns:
bin_start
: starting value of the bin (inclusive)bin_end
: end value of the bin (exclusive)bin_count
: the value of each bin (the number of items in the bin).
bin_start
and bin_end
are the same type of the input column, bin_count
is always integer.
Histograms generated on Int64, Duration, Timestamp and Numeric columns are guaranteed to have integer bin_start
and bin_end
, and therefore will not generate more bins than the data range (max_value
- min_value
+ 1). Moreover, the width of an arbitrary number of bins at the end of the histogram might be augmented by 1, to maintain the bins contiguous and to fit the whole data range without exceeding.
When the input size is not large enough to generate the number of desired bins, the verb will generate the maximum number of bins for such input size.
Unlike with integer numbers, the exclusivity of bin_end
is not guaranteed for Float64 values.
One could control the presence of empty bins (bins with a count of 0) in the output dataset via options(include_empty: <bool>)
(default true
).
options(bins: <positive-integer>)
is used to control how many bins to distribute the data in. Defaults to 10.
Histogram counts are approximated in the current implementation of the verb, as this verb uses tdigest to calculate the values of the bins.
The approximation is particularly evident when the input data is highly skewed.
Usage¶
histogram [ options ], col
Argument |
Type |
Optional |
Repeatable |
Restrictions |
---|---|---|---|---|
options |
options |
yes |
no |
constant |
col |
numeric, duration, or tdigest |
no |
no |
none |
Options¶
Option |
Type |
Meaning |
---|---|---|
bins |
int64 |
How many bins to use for the histogram. Defaults to |
include_empty |
bool |
Whether to include empty bins or not. Defaults to |
Accelerable¶
histogram is never accelerable. A dataset that only uses accelerable verbs can be accelerated, making queries on the dataset respond faster.
Examples¶
histogram temperature
Creates a histogram with 10 bins that represent the distribution of values in the temperature
column.
histogram options(bins:25), temperature
Creates a histogram with 25 bins that represent the distribution of temperatures.
histogram options(include_empty: false, bins: 100), receivedBytes
Creates a histogram with 100 bins which represents the distribution of received payload sizes. Filter out empty bins.