Troubleshoot slow databases / n+1 issues

Here’s how you can use Observe APM to detect n+1 issues and other types of database slowness in your distributed service.

Let’s say you’re an on-call developer, DevOps engineer, or SRE that’s responsible for the e-commerce site & you’ve received a report of abnormal latency in the checkout service. Inspect the checkout service in Observe APM and see from the latency distribution that there are some outliers, and from the performance breakdown chart that there’s a major spike in latency starting a few minutes ago.

Service inspector for the checkout service
View the performance breakdown by downstream service to see that most of the time spent in the checkout service is actually waiting for the cart service to return a response. Time spent breakdown for the checkout service
Navigate to the downstream cart service from the service map for the checkout service. Time spent breakdown for the checkout service
The cart service is also experiencing a latency spike in the last few minutes. Service inspector for the cart service
View the performance breakdown by operation to see that most of the time spent in the cart service is LREM operations in Redis. Time spent breakdown for the cart service
Narrow down the degradation further to a single endpoint, namely the `EmptyCart` endpoint in the cart service, which had a spike in latency in the last few minutes, unlike the other two endpoints in the cart service whose latency remains stable. Endpoint list for the cart service
The performance breakdown of the `EmptyCart` endpoint reveals that the latency spike can be explained by LREM operations in Redis, which means we've narrowed down the issue to just this endpoint. EmptyCart endpoint in the cart service
It's time to look at slow traces for the `EmptyCart` endpoint. The traces tab in the endpoint inspector is automatically filtered to the slowest traces. Slow traces for the EmptyCart endpoint
The waterfall for one of these traces shows a lot of LREM calls, a telltale sign of an n+1 issue. Trace waterfall
Inspect the attributes on these LREM calls to see the specific queries being issued: Database span attribute
Now we have enough information to conclude what caused the latency spike in the checkout service: when users empty their cart, the frontend calls the checkout service, which calls the `EmptyCart` endpoint in the cart service, which runs a database query for each individual cart item, a classic n+1 scenario. The fix is simple: update the logic in this endpoint so it removes all the cart items in a single query.