Solving Android app issues with OpenTelemetry: Beyond local profiling

android framework

11 December 2024 • 10 min read

As an Android developer, my first instinct for solving a bug, measuring performance, or improving the overall experience of an app is to test it and profile it locally. Tools like the Android Studio Profiler provide powerful capabilities to detect and address all kinds of performance issues, such as UI thread blocking, memory leaks, or excessive CPU usage.

While these local tools are indispensable, they do have limitations. Certain problems don’t show up in controlled environments, with consistent network connectivity, predictable user behavior, and a limited range of testing devices. In the real world, users interact with apps in unexpected ways, with diverse hardware, and varying conditions, exposing issues that are difficult to replicate locally.

This is where OpenTelemetry comes in.

OpenTelemetry is a framework for collecting, processing, and exporting data about application performance. While relatively new to mobile, it’s become a fast-growing standard for backend performance management.

The benefits to using this framework for mobile are significant. OpenTelemetry enables developers to collect observability data from production environments, providing a window into the app’s real-world behavior.

Local profiling and production observability

Local Profiling has its uses

Local profiling is invaluable for identifying issues that are reproducible in a controlled environment.

There are many common issues that can be detected and solved locally:

Main thread blockage: tasks that block the main thread can cause app freezes or ANRs, as that thread is responsible for handling user interactions and rendering the UI.
Memory leaks: memory leaks happen when objects that are no longer needed are not released correctly. This leads to excessive memory usage, which could cause Out Of Memory (OOM) errors.
Capacity-related jank: when some resources like the CPU or the GPU become overburdened, the UI might not be rendered correctly within a given timeslice.

These issues are quite straightforward to reproduce during testing, and local profiling tools are great for detecting and fixing them.

When production observability is needed

While local profiling covers a wide array of issues, not all problems are evident in a local setup. Observability in production is essential for diagnosing:

Unexpected user behavior: Users may upload massive files, perform rapid actions, or navigate the app in unplanned ways, exposing edge cases.
Device-specific crashes: Android’s diversity means issues can arise on specific devices or OS versions, often undetectable during local testing.
Low network connectivity: Real-world users often face slow or unreliable internet, causing timeouts or prolonged loading, which can be hard to emulate.

Production-ready observability tools like OpenTelemetry are essential for uncovering and resolving these challenges.

OpenTelemetry in Android

OpenTelemetry is a powerful observability framework that helps developers collect, process, and export telemetry data like traces, metrics, and logs.

There are many advantages to using OpenTelemetry for performance monitoring vs. proprietary tools. SDKs built on OpenTelemetry are very flexible, allowing engineers to easily extend their instrumentation to 3rd-party libraries. As an open source, widely-adopted framework, OpenTelemetry also lets organizations avoid vendor lock-in and have more control over their own data.

By integrating OpenTelemetry into your Android app, you can track the performance of individual operations, identify bottlenecks, and gain insights into how your app performs under various real-world conditions. Let’s walk through how to do this.

Initial integration and set-up

To add the OpenTelemetry SDK to your app, you can include the OTel bill of materials along with some necessary dependencies, like this:

// libs.versions.toml
[versions]
opentelemetry-bom = "1.44.1"
opentelemetry-semconv = "1.28.0-alpha"[libraries]
opentelemetry-bom = { group = "io.opentelemetry", name = "opentelemetry-bom", version.ref = "opentelemetry-bom" }
opentelemetry-api = { group = "io.opentelemetry", name = "opentelemetry-api" }
opentelemetry-context = { group = "io.opentelemetry", name = "opentelemetry-context" }
opentelemetry-exporter-otlp = { group = "io.opentelemetry", name = "opentelemetry-exporter-otlp" }
opentelemetry-exporter-logging = { group = "io.opentelemetry", name = "opentelemetry-exporter-logging" }
opentelemetry-extension-kotlin = { group = "io.opentelemetry", name = "opentelemetry-extension-kotlin" }
opentelemetry-sdk = { group = "io.opentelemetry", name = "opentelemetry-sdk" }
opentelemetry-semconv = { group = "io.opentelemetry.semconv", name = "opentelemetry-semconv", version.ref = "opentelemetry-semconv" }
opentelemetry-semconv-incubating = { group = "io.opentelemetry.semconv", name = "opentelemetry-semconv-incubating", version.ref = "opentelemetry-semconv" }// build.gradle.kts
implementation(platform(libs.opentelemetry.bom))
implementation(libs.opentelemetry.api)
implementation(libs.opentelemetry.context)
implementation(libs.opentelemetry.exporter.otlp)
implementation(libs.opentelemetry.exporter.logging)
implementation(libs.opentelemetry.extension.kotlin)
implementation(libs.opentelemetry.sdk)
implementation(libs.opentelemetry.semconv)
implementation(libs.opentelemetry.semconv.incubating)

Then, we can create an OpenTelemetry instance that acts as a central configuration point, managing the tracer provider, resources, and exporters.

A tracer provider creates and manages tracers, which in turn generate spans. A resource contains metadata about the app and is attached to every span, helping to contextualize telemetry data. An exporter defines where the telemetry data will be sent, such as a backend observability platform or a local file for inspection.

// Resources that will be attached to telemetry to provide better context.
// This is a good place to add information about the app, device, and OS.
val resource = Resource.getDefault().toBuilder()
    .put(ServiceAttributes.SERVICE_NAME, "[app name]")
    .put(DeviceIncubatingAttributes.DEVICE_MODEL_NAME, Build.DEVICE)
    .put(OsIncubatingAttributes.OS_VERSION, Build.VERSION.RELEASE)
    .build()// The tracer provider will create spans and export them to the configured span processors.
// For now, we will use a simple span processor that logs the spans to the console.
val sdkTracerProvider = SdkTracerProvider.builder()
    .addSpanProcessor(SimpleSpanProcessor.create(LoggingSpanExporter.create()))
    .setResource(resource)
    .build()
    
// The OpenTelemetry SDK is the entry point to the OpenTelemetry API. It is used to create spans, metrics, and other telemetry data.
// Create it and register it as the global instance.
val openTelemetry = OpenTelemetrySdk.builder()
    .setTracerProvider(sdkTracerProvider)
    .buildAndRegisterGlobal()

Once everything is initialized, we can get a tracer and create spans, using openTelemetry.sdkTracerProvider.get().

A trace represents a single operation or workflow within a distributed system. For Android apps, it could capture the entire journey of a user request or an action through the app. Within this journey, a span represents an individual unit of work, such as a network request, database query, or UI rendering task, providing detailed information about its duration and context. Here’s how it looks in code:

val tracer = openTelemetry.sdkTracerProvider.get("testAppTracer")
val span = tracer.spanBuilder("someUserAction").startSpan


try {
	someAction()
} catch (e: Exception) {
	span.recordException(e)
	span.setStatus(StatusCode.ERROR)
} finally {
	span.end()
}

Solving problems with OpenTelemetry

Now that we understand how to set up an OpenTelemetry instance in our Android app, let’s look at some common types of issues and how this framework actually helps us track them.

Network latency issues

Network performance is one of the most unpredictable factors in a production environment. While local testing occurs under stable, high-speed conditions, real-world users face diverse scenarios. They might encounter intermittent mobile connections, unreliable public Wi-Fi, or backend delays during periods of heavy traffic. These challenges can lead to long request times, failed operations, or even app abandonment.

With OpenTelemetry, you can instrument network requests to measure their durations and identify bottlenecks. By tagging spans with metadata like endpoint URLs, request sizes, or response statuses, you can analyze trends such as:

Endpoints causing the longest delays: Identify APIs that consistently perform poorly and prioritize their optimization.
Impact of network conditions on user experience: Correlate high-latency spans with user drop-offs to measure the effect of slow responses.
Response time variability by region: Understand how performance differs geographically and tailor improvements for the most affected areas.

Let’s take a look at an example.

Suppose we have an endpoint where users upload images to a server. Network performance might vary based on the image size, user location, or connectivity type. By instrumenting the network request using OpenTelemetry, we can capture relevant metadata and analyze trends, such as whether larger images or specific regions are associated with longer upload times. Here’s how we can instrument this scenario:

 
fun uploadImage(image: ByteArray, networkType: String, region: String) {
    val span = tracer.spanBuilder("imageUpload")
        .setAttribute(HttpIncubatingAttributes.HTTP_REQUEST_SIZE, image.size.toLong())
        .setAttribute(NetworkIncubatingAttributes.NETWORK_CONNECTION_TYPE, networkType)
        .setAttribute("region", region)
        .startSpan()
    try {
        doNetworkRequest()
    } catch (e: Exception) {
        span.recordException(e)
        span.setStatus(StatusCode.ERROR)
    } finally {
        span.end()
    }
}

OS version or device-specific issues

Android’s ecosystem is vast, with apps running on a wide variety of devices, OS versions, and hardware configurations. This diversity makes it challenging to ensure a consistent user experience across all devices. Certain crashes or bugs may surface only on specific devices or under particular conditions, making them hard to find in a controlled testing environment.

With OpenTelemetry, you can capture device-specific metadata in a centralized way, and add it to the resource configuration during the OpenTelemetry setup. This ensures that important contextual information is automatically attached to spans, logs, and metrics. This approach ensures consistency across telemetry data.

By analyzing this metadata, you can uncover trends like:

Frequent crashes on certain device models: users on older or budget devices can encounter crashes due to insufficient resources, and detecting this pattern might allow optimizing memory usage or offering a lighter version of the app.
Behavioral changes across Android versions: certain crashes may occur only on specific OS versions due to changes in Android APIs, stricter permission requirements, or bugs introduced in updates. With this data, you can prioritize compatibility fixes or update your app’s dependencies to avoid deprecated API usage.
Hardware-specific rendering issues: Some devices may have unique graphics drivers or hardware quirks that cause rendering issues like visual glitches or artifacts in the UI. For example, a custom animation might behave unexpectedly on devices with non-standard screen resolutions or refresh rates. Metadata about screen specs or GPU details can help pinpoint and address these inconsistencies.

Let’s see how to set this up:

// Add some useful attributes to the Resource object.
val resource = Resource.getDefault().toBuilder()
    .put("device.model", Build.MODEL)
    .put("device.manufacturer", Build.MANUFACTURER)
    .put("os.version", Build.VERSION.SDK_INT.toString())
    .put("screen.resolution", getResolution())
    .build()// Use the resource object to build the tracer, logs and other telemetry providers
val sdkTracerProvider = SdkTracerProvider.builder()
    .setResource(resource)
    .build()

Unexpected user behavior

Real users often interact with apps in unexpected ways. This unpredictability can lead to performance issues, crashes, or unoptimized user experiences that aren’t caught in local testing.

For example, users might upload files much larger than anticipated, causing memory or performance bottlenecks. Others might repeatedly perform actions in rapid succession, like submitting forms or refreshing pages, leading to race conditions or server overload. Some users might navigate through the app in untested sequences, triggering unexpected states or errors.

By leveraging OpenTelemetry to instrument user interactions, you can capture and analyze spans that detail how users actually use your app. This data provides invaluable insights into unexpected patterns, allowing you to:

Detect resource-intensive actions: Track spans representing operations like image uploads, database queries, or API calls to identify scenarios where excessive usage impacts performance.
Uncover uncommon navigation paths: By monitoring user navigation flows, you can discover sequences that frequently result in errors or crashes, helping you prioritize fixes for real-world issues.
Identify high-demand features: Analyze spans to see which actions or features are used most often, even if they weren’t part of your initial test cases. This can guide both optimization efforts and feature prioritization.

Let’s consider a scenario where users frequently navigate back and forth between two screens (e.g., a product listing and a product details page) in rapid succession. While this behavior may seem harmless, it could inadvertently cause resource leaks or worsen the rendering performance.

By tagging spans with navigation metadata like the screen name, a timestamp, and some other user interactions, you can analyze patterns in navigation behaviors:

Users may be toggling between screens at an unexpectedly high frequency, highlighting a need for caching or lazy loading mechanisms to reduce resource strain.
A particular screen might consistently produce errors, revealing edge cases or bottlenecks in its rendering logic.
Insights into navigation sequences can help refine user experience flows, making the app more intuitive for common behaviors while handling edge cases more gracefully.

This ability to uncover and address unexpected user behavior ensures your app remains reliable and performant, even under unconventional usage scenarios.

Next steps: forward your data where you want to analyze it

As we’ve discussed, instrumenting your Android app with OpenTelemetry is incredibly helpful for monitoring and understanding common performance issues.

Once you’ve started collecting data, however, you’ll need to set up a place for it to go. One of the great things about OpenTelemetry as a framework is that there are many, many observability tools that support ingesting this type of data. You may choose to forward it to a vendor-specific backend or to any number of open source tools, like Jaeger for spans or Loki for logs.

Forwarding OpenTelemetry data from an SDK requires adding one or multiple exporters to give a destination for your data once that data is actually generated.

The exporter is a component that will connect the SDK you are using, which will capture the data, with an external OpenTelemetry collector that will receive the data. Exporters are designed with the OpenTelemetry data model in mind, emitting OpenTelemetry data without any loss of information. Many language-specific exporters are available.

The OpenTelemetry collector is a vendor-agnostic way to receive, process and export telemetry data. It is not always necessary to use, as you can send data directly to your backend of choice via the exporter. However, having a collector is good practice if you’re managing multiple sources of data ingest and sending to multiple observability backends. It allows your service to offload data quickly, and the collector can take care of additional handling like retries, batching, encryption, or even sensitive data filtering.

Embrace has a tutorial walk-through on how to set up an OpenTelemetry exporter for Android, if using our SDK. You can also use the basic OpenTelemetry Android SDK, which is built on top of the Java SDK, and which has its own exporter resources.

Final takeaways

In a world where apps run on countless devices, in varied environments, and are used by diverse users, achieving optimal performance and reliability requires more than just local testing. While tools like Android Studio Profiler excel at addressing issues reproducible in controlled environments, production observability fills the gap for uncovering real-world problems that only surface under specific conditions.

OpenTelemetry provides a robust framework for collecting and analyzing telemetry data, giving developers the insights they need to understand and optimize their apps in production. By instrumenting spans and attaching meaningful metadata, you can pinpoint bottlenecks, diagnose device- or OS-specific issues, and uncover unexpected user behaviors that impact the app’s performance or user experience.

Interested in getting started for yourself? Check out the OpenTelemetry SDK for Android. Or, for more advanced monitoring, you can use Embrace’s open source Android SDK. It’s built on OpenTelemetry and uses the same data conventions, but has added functionality for tracking complex Android issues, like ANRs, in a way that is OpenTelemetry-compliant.

Deliver incredible mobile experiences with Embrace.

Get started today with 1 million free user sessions.

Get started free

Author

Francisco Prieto Cardelle

Product Overview

User-focused observability for mobile and web

Use Cases

Industries

Featured Resource

Overcoming key challenges in mobile observability: A guide for modern DevOps and SRE teams

Company

Community + Support