Why user-focused observability helps mobile teams resolve issues faster

observability

18 July 2025 • 6 min read

Learn the differences between traditional observability approaches and user-centric solutions when it comes to troubleshooting a mobile app issue.

Mobile applications present unique observability challenges that general-purpose platforms weren’t designed to handle. While tools like Grafana offer powerful capabilities for backend systems, they aren’t quite as useful when it comes to debugging mobile-specific issues. This post examines a real-world troubleshooting scenario to illustrate the fundamental differences between traditional observability approaches and user-centric solutions.

The challenge: failed checkout transactions

User icon in a smartphone alongside a backend system

Your support team is receiving multiple reports: “I can’t complete my purchase – the app does nothing when I tap ‘Buy Now’.” As the observability engineer, you need to understand what’s causing these checkout failures.

The root cause? A 15-minute token is expiring during a 30-second checkout flow. The app doesn’t notice, sends a request with an expired token, and gets rejected, causing the flow to break without explanation. Now let’s see how you would find this out with a traditional observability versus a user-centric solution.

Troubleshooting with Grafana + OpenTelemetry SDK data

As an observability engineer, you begin your investigation using familiar tools and workflows. Your open source observability setup includes a mobile app and backend services instrumented with OpenTelemetry SDKs, with logs, metrics, and traces exported to Grafana for monitoring and analysis.

Step 1: Check aggregate metrics

You begin by checking your Grafana dashboards for checkout conversion metrics and error rates, but even identifying the right dashboards can be a challenge. You search for anomalies in metrics like payment_flow_duration_seconds, network_error_rate, crash_rate, and anr_rate, hoping to find clues. But these signals are often indirect, at best.

For instance, a spike in payment_flow_duration might suggest authentication timeouts, but it could just as easily be caused by network latency or backend slowness. A rise in the network_error_rate metric might include the 401 errors triggered from the expired tokens, but there’s no clear way to separate those from general connectivity or login issues. Even if crash or ANR rates increase due to token expiration, these symptoms are too broad to pinpoint a specific root cause.

Ultimately, you’re left trying to interpret high-level performance signals that are several steps removed from the actual issue. And in many cases with an issue like token expiration, there may be no noticeable spike in the metrics at all.

Metrics dashboard in Grafana — Issues can easily be lost in a sea of aggregate metrics.

Step 2: Analyze distributed traces

Next, you turn to your tracing tool, Tempo or Jaeger, and search for traces related to the checkout flow, hoping to uncover insights from the Purchase trace. You spot several failed payment attempts marked with generic error indicators. Occasionally, a 401 Unauthorized response appears in the span attributes, but it’s buried in metadata and not immediately clear in the broader context. More importantly, a single trace doesn’t show that the token started out valid and gradually expired over time. Without seeing the full timeline, it’s easy to misinterpret the 401 as a straightforward login error, rather than the result of a slowly expiring token in a purchase session. While the trace tells you that authentication failed, it doesn’t explain why you’re left without the context that connects session age, token expiration, and the failed checkout.

Trace detail page in Grafana — A failed checkout Network Trace showing a 401 Unauthorized Network error, highlighting how token expiration issues can be hard to diagnose without full session context.

Step 3: Dive into application logs

With the 401s from the checkout traces as your lead — assuming you were able to spot them buried in span attributes and recognize their significance — you pivot to your logging dashboard. You begin searching for authentication-related log entries that occurred around the same time as the failed checkout spans, using queries like:

{service="payment-api"} |= "401" |= "unauthorized"

You then add filters to try and isolate all the Error Logs:

Logs in Grafana — Thousands of noisy mobile logs flood the screen. A 401 Unauthorized and Token Expired error at checkout are buried among debug clutter and other error logs, making root cause triage painfully manual.

This is where things get time-consuming. The log volume is high, and the mobile logs often lack consistent user identifiers, so you can’t easily correlate a specific user’s checkout failure with a specific expired token event. You manually compare timestamps between logs and traces to piece together the sequence of events. Eventually, you notice that some 401 checkout errors align closely with prior “token expired” log entries, suggesting that the token expired during checkout.

This entire investigative path relies on a series of well-informed discoveries: first, identifying the 401 in trace metadata, then understanding it points to an authentication failure rather than a generic permission issue, and finally knowing to correlate it with token expiration logs.

Why this is so hard

The problem isn’t just technical… it’s systemic. Traces, logs, and metrics each give you fragments of the story, but none of them offer a complete, user-centric view. There’s no obvious way to trace a session’s health from login to checkout, or to see how the session aged and when the token expired. You can’t see that the token expired as the user tapped “Buy Now” unless you manually stitch everything together. You’re left guessing, inferring, and hoping you’ve interpreted the signals correctly.

Time to resolution: Often hours, sometimes days depending on how quickly you can correlate everything manually.

Troubleshooting with Embrace

As a mobile observability engineer, you take a fundamentally different approach focused on user session analysis:

Step 1: Locate the affected user session

You start by searching for one of the users who reported checkout failures and open their complete session timeline. Immediately, you can see the user’s entire journey: app launch, browsing products, adding items to cart, and finally attempting checkout, then the error.

Alternatively, with Embrace you can also start at a high level, viewing aggregate performance metrics for the checkout flow, and immediately drill down into a specific user session that exemplifies the issue. For instance, if you notice increased checkout latency across your user base in a performance dashboard, you can jump straight into an affected user session to investigate further.

Step 2: Examine the User Timeline for the user affected by the error

Within the session view, you navigate to the checkout attempt and observe the sequence of events in real-time. You can see the user tap “Buy Now,” which triggers the payment API call initialization. At the exact same moment, you notice the authentication token expiration event appearing in the integrated logs on the timeline the token expires literally as the payment request is being constructed. The failing network call and the token expiration log entry are displayed together in chronological order, making the causal relationship immediately apparent.

Step 3: Analyze the failure chain with full context

The integrated timeline shows you the complete failure sequence: The app constructs the payment request with what it believes is a valid token, but by the time the request reaches the payment service, the token has expired. The payment service rightfully rejects the request with a 401 error, causing the checkout to fail. You can also see device context like network conditions and app state, confirming this isn’t related to connectivity issues.

Embrace User Timeline — In the User Timeline, you can see in chronological order the sequence of this user's actions leading up to the failed checkout. Retracing the users steps to find the root cause takes just seconds.

Root cause identified: The mobile app’s token lifecycle management has a race condition in how the app handles token expiration. The token becomes invalid just after the user starts checkout, but before the payment request reaches the server. The app doesn’t catch this in time, so it sends an expired token and the request fails.

Time to resolution: Minutes, with complete confidence in the diagnosis and clear direction for the fix (implementing token validation before payment API calls or extending token lifetime for checkout flows).

The technical difference: context vs. correlation

The fundamental distinction isn’t about data collection; both approaches can gather the same telemetry. The difference lies in how that data is presented and contextualized.

General platforms: metrics, logs, and traces in isolation

General platforms organize logs into authentication events and error messages, while traces capture API call performance and dependencies. Metrics provide aggregate performance indicators, but engineers must manually reconstruct user sessions by connecting these separate data sources across different views.

User-focused observability: a user session-based view

Mobile-specific platforms present a timeline view of the complete user session from launch to completion, with contextual integration that unifies all data types by user session. The platform automatically handles correlation between related events while integrating mobile-specific context like device state, network conditions, and app lifecycle.

Making the right technical choice

This comparison is about selecting the right tool for the specific technical challenge. General observability platforms like Grafana provide excellent infrastructure monitoring, but mobile and web applications benefit from tools designed specifically to measure performance and reliability from the user’s perspective.

The authentication token example demonstrates a broader principle: The toughest mobile and web issues can rarely be identified from a single technical indicator such as a metric, log, or a trace. They’re user experience problems that require understanding the complete context of how applications behave in real-world environments.

For development teams, the ability to see complete user sessions in a unified timeline doesn’t just improve developer experience during issue investigations, it fundamentally changes how quickly and accurately they can identify and resolve issues that directly impact user experience.

If you’d like to learn more about user-focused observability with Embrace, you can start a free trial of our platform or request a custom demo for your team.

Deliver incredible mobile experiences with Embrace.

Get started today with 1 million free user sessions.

Get started free

Author

Aissa Mamdouh

Aissa Mamdouh is a Solutions Engineer at Embrace with a background in mobile development. Aissa joined Embrace in 2025 with years of experience across iOS and Android, and he brings practical insight to writing about performance, stability, and best practices for building better mobile apps.

User Flows in practice

Learn how, and why, to create User Flows that target the mission-critical parts of your mobile and web applications.

observability

29 August 2025 • 2 min read

Embrace Open Source Roundup

Take a journey through our engineering team's recent contributions to open source projects in web, mobile, and observability.

observability

6 August 2025 • 1 min read

Monitoring Vs Observability Vs Telemetry: What’s the Difference

In this article, we’ll clarify what sets monitoring, observability, and telemetry apart, and why a clear grasp of these terms is crucial for building resilient systems.

Product Overview

User-focused observability for mobile and web

Use Cases

Industries

Featured Resource

Overcoming key challenges in mobile observability: A guide for modern DevOps and SRE teams

Company

Community + Support

Why user-focused observability helps mobile teams resolve issues faster

The challenge: failed checkout transactions

Troubleshooting with Grafana + OpenTelemetry SDK data

Step 1: Check aggregate metrics

Step 2: Analyze distributed traces

Step 3: Dive into application logs

Why this is so hard

Troubleshooting with Embrace

Step 1: Locate the affected user session

Step 2: Examine the User Timeline for the user affected by the error

Step 3: Analyze the failure chain with full context

The technical difference: context vs. correlation

General platforms: metrics, logs, and traces in isolation

User-focused observability: a user session-based view

Making the right technical choice

User Flows in practice

Embrace Open Source Roundup

Monitoring Vs Observability Vs Telemetry: What’s the Difference