Manage application data volume¶
With the Observe Agent, you can filter and sample trace data in order to optimize ingest volumes, egress costs, and query performance with Observe.
Filtering¶
Filtering is especially useful when working with noisy instrumentation that generates a large amount of internal spans or excessive span events. This is also useful when instrumentation adds redundant spans or span attributes.
Observe APM requires certain spans and span events (see APM data collection). If these spans are filtered out, APM may not work correctly.
Example observe-agent.yaml
configuration that drops some spans and span events based on their container, application, or protocol, using the filter processor that is bundled with the Observe Agent:
otel_config_overrides:
processors:
filter/ottl:
error_mode: ignore
traces:
span:
- 'attributes["container.name"] == "app_container_1"'
- 'resource.attributes["host.name"] == "localhost"'
- 'name == "app_3"'
spanevent:
- 'attributes["grpc"] == true'
- 'IsMatch(name, ".*grpc.*")'
service:
pipelines:
traces/forward:
receivers: [otlp]
processors: [resourcedetection, resourcedetection/cloud, filter/ottl]
exporters: [otlphttp/observe]
Example observe-agent.yaml
configuration that filters out some attributes related to EC2 instances, using the attributes processor that is bundled with the Observe Agent:
otel_config_overrides:
processors:
attribute:
actions:
- key: "instance_size"
action: "delete"
- key: "instance_os"
action: "delete"
service:
pipelines:
traces/forward:
receivers: [otlp]
processors: [resourcedetection, resourcedetection/cloud, attribute]
exporters: [otlphttp/observe]
Sampling¶
When sampling span data, it is important to apply sampling rules on entire traces, rather than individual spans, since otherwise it will be difficult to debug traces in Observe.
Note that Observe APM calculates metrics based on ingested spans only; sampling prior to ingest, especially if it is non-uniform, will result in metric skew.
For further reading, check out the OpenTelemetry docs on sampling.
Uniform sampling¶
The simplest form of sampling is uniform sampling: sample one in every N traces.
Example observe-agent.yaml
configuration that samples 1 in every 10 traces, thereby reducing ingest volume by 90%, using the probabilistic sampler processor that is bundled with the Observe Agent:
otel_config_overrides:
processors:
probabilistic_sampler:
sampling_percentage: 10
mode: "proportional"
service:
pipelines:
traces/forward:
receivers: [otlp]
processors:
[resourcedetection, resourcedetection/cloud, probabilistic_sampler]
exporters: [otlphttp/observe]
Tail sampling¶
Tail sampling allows you to implement more sophisticated sampling rules, such as sampling every 1 in 100 “normal” traces while capturing all of the “slow” or “error” traces to aid in debugging, or always capturing all of the traces that come from a particular critical service.
Tail sampling requires the same collector instance to process all the spans for a given trace. This means you’ll need to set up additional deployment of OpenTelemetry Collectors in a gateway pattern, and set up each of your Observe Agents to forward all of their data to the same collector instance in the gateway.
OTel Collector gateway¶
Start with a single instance of the OpenTelemetry Collector as your gateway to validate tail sampling functionality. Configure the gateway with the tail sampling processor and ensure it receives data from upstream Observe Agents.
Example configuration:
processors:
tailsampling:
policies:
- name: drop-k8s-probes
type: string_attribute
string_attribute:
{
key: http.user_agent,
values: [^kube-probe, ^kube-resource-report],
enabled_regex_matching: true,
invert_match: true,
}
- name: keep-error-traces
type: string_attribute
status_code:
status_codes:
- ERROR
- name: keep-slow-traces
type: latency
latency:
threshold_ms: 5000
- name: heavily-sample-everything-else
type: probabilistic
probabilistic:
sampling_percentage: 10
exporters:
otlphttp:
endpoint: <OBSERVE_ENDPOINT>
headers:
authorization: <OBSERVE_TOKEN>
service:
pipelines:
traces:
receivers: [otlp]
processors: [tailsampling]
exporters: [otlphttp]
See Configure the Collector to export to Observe’s OTLP endpoint to configure the exporter.
Tuning and scaling the gateway As you move from a proof-of-concept to production, the single gateway collector may need to be tuned or scaled to handle higher trace volumes.
Monitor the CPU and memory usage of the gateway:
CPU utilization: aim to keep CPU usage below 80%. Sustained usage above 80% or frequent spikes to 90-100% indicate that additional resources or instances are needed.
Memory usage: The tail sampling processor retains spans for active traces in memory. Monitor memory consumption and ensure it stays well below the system limit to avoid out-of-memory crashes or data loss.
Manage the gateway’s memory by tuning the max number of traces that can be stored in memory, and the maximum length of a trace. See the documentation on tuning the tail sampling processor for more details.
You can scale the gateway collector vertically (add more CPU/memory) or horizontally (add more instances).
Tradeoffs:
Vertical scaling is simpler to set up but limited by machine size and lacks redundancy.
Horizontal scaling provides better fault tolerance and supports higher volumes, but requires a load balancer to ensure all spans for a trace ID are routed to the same instance.
Observe Agent exporter configuration when tail sampling¶
Configure your Observe Agents to export to the gateway instead of to Observe directly:
otel_config_overrides:
exporters:
otlphttp/observe:
endpoint: http://<gateway-collector-host>:4318
When using multiple gateway collector instances, all spans for a given trace must be processed by the same Otel collector instance to ensure correct tail-based sampling. OpenTelemetry documentation recommends using the loadbalancingexporter
, but we do not recommend it because it relies on DNS-based load balancing, which introduces the following issues:
Unpredictable DNS behavior: the OS-level resolver and DNS caching behaviors can result in uneven distribution or outdated routing.
Client-side load balancing: this is inherently less robust than server-side load balancing, lacking advanced features like health checks, connection pooling, and traffic draining.
Operational complexity: requires maintaining accurate and responsive DNS records for the collector cluster.
We instead recommend using a server-side load balancer like HAProxy, NGINX, or a cloud-based load balancer to route spans to gateway collectors based on trace ID. Here is an example HAProxy configuration:
frontend otlp_front
bind *:4317
use_backend otlp_backend
backend otlp_backend
balance roundrobin
hash-type consistent
server collector1 <collector1-host>:4317 check
server collector2 <collector2-host>:4317 check