Kubernetes

Installation

Getting started

To proceed with this step, you will need a customer ID and token.

Observe provides a manifest which installs all the necessary components for collecting telemetry data from Kubernetes. This manifest can be retrieved directly from https://api.observeinc.com/v1/kubernetes/manifest.

At its simplest, the install process can be reduced to two steps:

$ kubectl apply -f https://api.observeinc.com/v1/kubernetes/manifest && \
	kubectl -n observe create secret generic credentials \
	--from-literal=customer=${OBSERVE_CUSTOMER?} \
	--from-literal=token=${OBSERVE_TOKEN?}

This example is for illustrative purposes only. In production environments, we recommend downloading the manifest separately and tracking changes over time using your configuration management tool of choice. If you have a preferred installation process which you would like us to support, please let us know on slack.

By omission, our manifest will create an observe namespace which contains all our collection infrastructure. Only then can we create a secret containing the appropriate credentials for sending data to Observe.

If you are monitoring multiple clusters, it is useful to provide a human readable name for each one. You can attach an identifier by providing a observeinc.com/cluster-name annotation:

$ kubectl annotate namespace observe observeinc.com/cluster-name="My Cluster"

Validating installation

Once your manifest is applied, you can wait for all pods within the namespace to be ready:

$ kubectl wait pods -n observe --for=condition=Ready --all

To verify data is streaming out correctly, you can check the egress logs going through the proxy:

$ kubectl logs -n observe -l name=observe-proxy
172.20.59.66 - - [01/Jul/2020:17:33:44 +0000] "POST /v1/http/kubernetes/logs HTTP/1.1" 202 11
172.20.51.151 - - [01/Jul/2020:17:33:44 +0000] "POST /v1/http/kubernetes/events HTTP/1.1" 202 11
172.20.51.151 - - [01/Jul/2020:17:33:44 +0000] "POST /v1/http/kubernetes/logs HTTP/1.1" 202 11
172.20.46.135 - - [01/Jul/2020:17:33:45 +0000] "POST /v1/http/kubernetes/logs HTTP/1.1" 202 11
172.20.59.52 - - [01/Jul/2020:17:33:45 +0000] "POST /v1/http/kubernetes/logs HTTP/1.1" 202 11

You should see requests logged in Apache access log format. All requests are forwarded towards /v1/http/kubernetes/*. If your credentials are incorrect, you may see 401 status codes.

Overriding defaults

Our manifest is derived from a template which is populated at query time. You can override certain settings through the use of query parameters. For example, to instead install our agent to the kube-system namespace, you would override the namespace parameter:

$ kubectl apply -f https://api.observeinc.com/v1/kubernetes/manifest?namespace=kube-system

To override multiple parameters:

$ kubectl apply -f 'https://api.observeinc.com/v1/kubernetes/manifest?namespace=kube-system&multiline=true'

The following table documents accepted query parameters:

Parameter

Default

Description

version

0.0.2

Manifest version. You may want to specify this value explicitly to avoid breaking changes when applying updates.

namespace

observe

Namespace used for install. The manifest will create a new namespace only if its name starts with observe. All other values will be assumed to belong to an existing namespace.

collector

collect.observeinc.com

API endpoint to send data to

coordination

true

Enable use of lease locks. This must be disabled for Kubernetes versions older than 1.14.

prometheus

Port number for prometheus proxying. This feature is experimental and disabled by default. Please contact support for more information.

zipkin

Port number for zipkin proxying. This feature is experimental and disabled by default. Please contact support for more information.

startupProbe

false

Enable the use of a startupProbe that does end-to-end validation that data can be submitted to Observe before marking a proxy as ready. This is only supported in Kubernetes 1.16 onward.

proxyReplicas

1

Number of replicas in proxy deployment. If you hit resource limits, we recommend increasing the number of replicas.

multiline

false

Enable multiline log parsing. When set to true, log records beginning with a whitespace are coalesced with the previous message. This option is disabled by default as the heuristic used is susceptible to false positives.

otel

false

Include an OpenTelemetry agent container as part of the observe-agent daemonset. This option is disabled by default.

otelVersion

latest

Specify a version of the OpenTelemetry agent to use. The default is the most recent version.

Uninstalling

The cleanest way to remove our integration is to delete all resources included in the original installation manifest. You must include any query parameters you provided on install in order to delete the correct set of resources:

$ kubectl delete -f https://api.observeinc.com/v1/kubernetes/manifest

Manifest contents

Collection architecture

By default, the manifest creates an observe namespace, within which there are two main components:

  • the observe-proxy deployment and corresponding proxy service, through which all data egresses the cluster

  • the observe-agent daemonset, which is responsible for collecting data

All data collected by Observe goes through a proxy service, which maps to pods in the observe-proxy deployment. The proxy modifies all requests towards Observe with:

  • authentication data provided in the credentials secret

  • a clusterId tag to all observation data

This avoids the need for configuring credentials for every process wishing to post data to Observe, and ensures that rotating credentials can be achieved through a single deployment rollout. Any request sent to http://proxy.observe.svc.cluster.local will be forwarded towards https://collect.observeinc.com with the appropriate Authorization header and cluster ID.

A final advantage of decoupling where data is collected from where data egresses the cluster is that more fine grained network policies can be applied. For example, we can restrict external network access within a cluster to a subset of nodes without affecting data collection.

The observe-agent daemonset ensures our collectors run on every node. It is responsible for collecting container logs, kubernetes state changes, and kubelet metrics.

Container log collection

Container logs are written to the /var/logs/containers on the host node’s filesystem. The observe-agent daemonset runs a fluentbit container, which is responsible for reading all files within this directory, and parsing out metadata from the log filename (podName, containerName, containerId, etc).

This stream of data is batched and shipped to the /v1/http/kubernetes/logs endpoint. As a result, container logs will show up in Observe with kind http and path /kubernetes/logs.

The fluentbit container is additionally configured to track the current state to a SQLite3 database mounted on the host node. This allows log processing to continue seamlessly across pod restarts, allowing any update to the observe-agent daemonset to be rolled out safely.

Kubernetes State Changes

The Kubernetes API allows watching for changes to any resource type. Our agent runs a kubelog container which subscribes to all resource changes, and emits them in JSON format. Rather than submit this data directly to the proxy, we instead write the data to the fluentbit container in the same pod. This allows us to reuse the same batching and retry logic we have in place for shipping container logs.

If every node were to run kubelog concurrently, we would get multiple copies of the same set of events. Instead, we ensure that only one kubelog is running at any given time through the use of a Kubernetes LeaseLock. This is more convenient than managing a separate deployment, since it reduces the number of moving parts in our manifest and maintains the abstraction of running a single agent per node.

The Lease type in the coordination.k8s.io API group was only promoted to v1 in Kubernetes 1.14. For legacy Kubernetes versions, please contact support.

Kubernetes API updates are streamed from fluentbit to Observe over the /v1/http/kubernetes/events HTTP endpoint. As a result, this data will appear in the Observation table with kind http and path /kubernetes/events.

Kubelet metrics

This collection method is experimental and disabled by default

The kubelet agent runs on every node, and exposes a set of metrics over an HTTPS endpoint. If metrics collection is enabled, the observe-agent pod will have an additional telegraf container. This container will periodically poll kubelet for metrics, and submit the data directly to the proxy under /v1/http/kubernetes/telegraf.

FAQ

What Kubernetes versions do you support?

Kubernetes maintains a concept of “supported versions”, which are the three most recent minor releases. For example, if the most recent release is 1.18, the supported Kubernetes versions are 1.18, 1.17 and 1.16:

Our support policy is as follows:

As of July 2020, the oldest release we officially support is 1.13. In order to do so, you will need to disable the use of lease locks (?coordination=false) when generating a manifest.

What container runtimes do you support?

The container runtime only affects log collection. Our current fluentbit configuration has been validated on both docker and containerd. Other runtimes or older versions may require minor configuration adjustments - please reach out to support for assistance.

Can collection be restricted to a specific namespace?

We do not currently support this. Collecting kubernetes state in particular requires accessing resources that are not namespaced, such as node and persistentvolume. We may be able to restrict log collection to specific namespaces - if you are interested in this feature please contact support.

How can I disable scheduling the agent on a specific node?

In order to provide full coverage of your cluster, the observe-agent daemonset is by design scheduled onto all nodes. If you wish to remove it from a subset of nodes, you can add a taint:

$ kubectl taint nodes ${NODENAME?} observeinc.com/unschedulable

This taint is only verified during scheduling. If an observe-agent pod is already running on the node, you will have to delete it manually:

$ kubectl delete pods -n observe -l name=observe-agent --field-selector=spec.nodeName=${NODENAME?}

Retry on failure

Fluent Bit retries on 5XX and 429 Too Many Requests errors. It will stop reading new log data when its buffer fills and resume when possible. kubelog memory usage will increase, however. In the event of extended failures, you may experience kubelog out of memory errors.

Fluent Bit does not retry on other 4XX errors.