Getting Started with Observe

Overview

As a Site Reliability Engineer (SRE) for your online retail company, Astronomy Shop, you ensure site uptime and accessibility for your customers. The Astronomy Shop services are running in Kubernetes, and have been instrumented with OpenTelemetry. You’ve been receiving problem reports from customers indicating that the new and important “You May Also Like” feature may be broken. As no alerts have fired yet, you will want to do some initial triage of the issue via Observe.

What is in this data?

For this tutorial, you will be using Datasets bundled with the Demo data app. These datasets receive logs, metrics and traces from Astronomy shop services in real time. You can interact with the application by going to https://astronomy.sandbox.sockshop.biz/. After logging into Observe and locating the Demo data/Container Log Dataset in Logs Explorer, open it and perform a few simple operations to better assess the situation.

Note

This tutorial is based on data bundled with the Observe free trial, via the Demodata app. The Demodata app is a fork of Observe’s Kubernetes app, and is purely for educational purposes.

Let’s Get Started

Ad-hoc Log Investigation

  1. Log into Observe and click Investigate > Logs.

  2. Enter Demodata/Container in the Search bar.

Search for demodata container logs

Figure 1 - Search for Container Logs Dataset

3. Click on the Container Logs Dataset to open it.

4. You should now see a list of Containers on the left side of Logs Explorer. Adjust the time range to Past 4 hours, and select recommendationservice from your list of containers. Note that the Query Builder bar is automatically updated to reflect your selection. This filters the logs down to just the recommendationservice Container, and provides logs from the last 4 hours. You can also filter on specific column values via the filter action in each column, try filtering down to just stderr in the stream column.

Select recommendation service

Figure 2 - Filter down to recommendationservice & expand time range

5. In the Query Builder bar type the term WARNING, and select the option Contains warning.

filter on the term warning.

Figure 3 - Search for the term WARNING in container logs

Pivot to Metrics

1. Based on the initial set of logs, there’s no obvious root cause indicated. You will now use Observe’s pivot functionality by hovering over a table cell in the Container column, and clicking the green context menu (you can also double-click the cell) and selecting Inspect.

Inspection feature

Figure 4 - Pivot into details about recommendationservice container

2. You should now see the Inspect Rail on the right side of the screen. Click the Metrics tab and type memory into the search box.

Inspect rail for recommendationservice

Figure 5 - Inspect metrics for recommendation service from Log Explorer

3. At a glance we can see that memory utilization for recommendationservice is spiking at regular intervals. To drill down further you can open a direct link to the Metrics Explorer by clicking the open in new tab icon next to any metric name.

4. In your new tab, you will see Metrics Explorer, with the proper metric selected, filtered to your recommendationservice container, and using the same time range that was used for your Log Explorer query.

Metrics Explorer Focused On Recommendation Service

Figure 6 - Metrics Explorer Drill Down

Create a dashboard

1. Now that we’ve seen some unexpected behavior with recommendationservice, it makes sense to save this as a Dashboard to refer back to in the future. Click the Actions menu in the top right of Metrics Explorer, and select Add to dashboard. Then select Create new dashboard.

Note

There are multiple ways to share context in Observe, when doing ad-hoc investigation. While dashboards are very common, you can also save any Metric or Log query as a monitor, or simply share your current query and associated context as a link.

Add to dashboard

Figure 7 - Add Metrics Explorer Query to Dashboard

2. In a new tab you will see an unnamed dashboard. Click the pencil icon in the title field and rename your dashboard to Reco Service Memory. You can see that your time range was carried over, but editable. You can edit your new Metric Explorer panel, as well as add additional panels from other datasets (logs, traces, resources, etc.). Remember to click Save changes in the top right, and then Leave editor.

My first dashboard with metrics explorer

Figure 8 - Reco Service Memory Dashboard

Shaping and Joining Data

Now that you’ve uncovered one potential symptom, let’s use Observe to create a custom Log Dataset that joins in Span and Trace data for your Recommendation Service.

1. Navigate to the Datasets lister page and find the Container Logs dataset under the Demodata namespace. Click the “Open in New Worksheet” icon,

Container Logs open in Worksheet

Figure 9 - Open Container Logs in a New Worksheet

2. With your Worksheet open, bring up the OPAL Console from the bottom of the page, and select the Inputs tab. Click Add Input and search for Trace, then select the Trace dataset under the Demodata/otel namespace. Repeat this for the Span Dataset as well, again selecting from the Demodata/otel namespace.

Container Logs open in Worksheet

Figure 10 - Adding an Input to a worksheet

3. Update the name values for your new inputs to be trace and span, and set their Role to be Reference.

Renaming input sources in a worksheet

Figure 11 - Rename your newly added Worksheet sources

4. Switch to the OPAL tab in the console, paste the following OPAL script in, and click run.

//filter down to just your reccomendation service container logs
filter label(^Container)="recommendationservice"
// parse the raw log message and extract your span and trace IDs
make_col log_reco: parse_json(log)
make_col otelTraceID:string(log_reco.otelTraceID),
            otelSpanID:string(log_reco.otelSpanID),
           appmessage:string(log_reco.message)
// filter logs that don't have traces
filter not is_null(otelTraceID)
// hide your parsed values, just to keep things tidy
set_col_visible log_reco:false
// use the power of linking / joining to wire in your trace and span info directly to your logs
set_link ^"Demodata/otel/Trace", otelTraceID: @trace.trace_id
set_link ^"Demodata/otel/Span", otelSpanID: @span.span_id, otelTraceID:@span.trace_id
// export this to the Explorer via the interface verb
interface "log", "log":log

Each major section has comments explaining what each OPAL command is doing to shape the data. The most powerful command in your script is set_link, which is creating two Linked Columns; Demodata/otel/Trace and Demodata/otel/Span, which joins your Trace and Span datasets to the Logs emitted by recommendationservice. You can now publish your newly shaped logs by clicking Publish New Dataset. Name the new Dataset Demodata/Reco Service App Logs.

Publishing a new dataset

Figure 12 - Shaping, Linking and Publishing your new Logs

5. Navigate back to Logs Explorer, and locate your newly published Demodata/Reco Service App Logs. Your new columns are now available to pivot from, and you have also successfully parsed the message field into its own column appmessage. Click the drop-down arrow in the top right of one of the cells in the Demodata/otel/Trace column, and click the Trace ID under Selected resource > Instance Page. This should open a new tab.

Trace drilldown from logs

Figure 13 - Pivoting from Logs to Traces

Note

By double clicking the linked Demodata/otel/Trace or Demodata/otel/Span columns, summary data about Traces and Spans are also immediately available.

6. In your new tab, you will see your Trace Resource dashboard. Double-click the span grpc.hipstershop.RecommendationService/ListRecommendations associated to the frontend service. The right rail will contain a summary of the Trace data, and other attributes. You can also pivot into other Resource dashboards, such as Operation.

Trace details

Figure 14 - Trace Resource Dashboard

Note

As service failures are intermittent, you may not immediately see a trace with an error. In the Log Explorer, you can filter down to just appmessage values that contain cache miss, which are more likely to have traces with errors.

Congratulations! Now that you have tied together Logs, Metrics and Traces, you feel confident that the recommendation service is likely suffering from OOM issues, possibly due to some application changes that have been exacerbated by a recent increase in user traffic. For now, you’ve paged the on-call developer for the service, and will start in on the inevitable Post Incident Review.