WEBINAR Dec 9 | Fireside chat: The future of user-focused observability with Embrace + SpeedCurve.

Sign-up

Developing a mobile crash model for OpenTelemetry

Background services and background threading are a vital part of any mobile application, but they can also be a source of crashes and other issues. To ensure the smooth operation of background services and threads, monitoring and analyzing their performance using metrics is essential.

This article was originally published on The New Stack.

OpenTelemetry (OTel) provides flexible, extensible, and vendor-neutral standards for instrumentating and monitoring applications. It’s completely changed the observability game within the last few years, prompting many solutions providers to participate in an ecosystem that encourages open standards.

For the most part, however, OpenTelemetry has been largely focused on backend infrastructure monitoring.

With the growing importance of mobile as a means of transacting with businesses, plus users’ rising performance demands, it makes sense for mobile to become the next big frontier for OTel.

This is exactly what Embrace wanted to help with when we adopted the OTel standard and open sourced our software development kits (SDKs). We’ve long focused on providing hyper-specialized and ultra-granular means for collecting and analyzing specialized mobile observability signals that reveal the true user impact of app performance issues.

As part of this effort, we’ve been working with OpenTelemetry maintainers, contributors, and  Special Interest Groups (SIGs) to develop standards for modeling mobile data within OpenTelemetry. Our latest project has been to adopt events, one of OTel’s emerging constructs, as a way to effectively model mobile crashes.

Modeling crashes as logs

Prior to the introduction of events, we had mapped out mobile crashes as regular LogRecords with a specific attribute, emb.type, to internally convey the schema. Each known value of emb.type maps to a well-known set of attributes, depending on the type of crash it is modeling. Unfortunately, no one outside Embrace knows this mapping, and even if we were to publicize it, it would be very much a solution specific to Embrace.

The lack of standardization for occurrences that are fairly standard in mobile meant the crashes we record are less portable, as no other backends understand our proprietary typing system. Having a common understanding and definition for mobile telemetry is key to solving this problem.

OpenTelemetry introduces the event data type

OpenTelemetry maintainers and contributors have been working on introducing the event data type for some time. Currently, it’s in an experimental state, which means that breaking changes are still allowed.

Events are the next evolution of structured logs in OTel. They are based on the LogRecord signal, so they share many of the same characteristics as their parent. The main difference is that events have a specific schema of both required and optional attributes that a LogRecord must or can have, respectively.

The schema is defined in the OTel Semantic Conventions. This allows backends to know what data they can expect in a particular LogRecord, and how to interpret the values of the expected attributes. This schema is outlined by the values in the required attribute, event.name, the existence of which qualifies a LogRecord as an event.

This event.name attribute functions in a very similar way to emb.type, except it is now part of the OpenTelemetry specification. Because of that status, all OpenTelemetry tooling will treat this as a first-class platform construct, so all backends supporting OpenTelemetry will be able to understand and use it.

Crashes are examples of events. Beyond crashes, other noteworthy happenings during the execution of mobile apps can be captured as events, including button clicks, session changes, or network changes. Anything that occurs at a point in time when a mobile app is active is eligible to be an event. So crashes are just the start.

Crashes as events

A crash in a mobile app is very much a “thing that happened at a point in time,” so using events to model them works well.

Because events are structured logs, the type of associated data that can be included with them is more useful, provides better context and is much easier to process. This makes OTel events ideal for use with an observability analysis platform, which does a lot of the heavy-lifting in terms of producing aggregated metrics from disparate telemetry and providing visualizations like charts and dashboards.

It also makes crashes (as events) more useful when it comes to forwarding data from the SDK to external observability backends because the structure of the payload is now well defined and part of OpenTelemetry

Developing the model

As events become better established, more of them will be accepted into OpenTelemetry’s official semantic conventions. At that point, they will have a documented payload structure in the form of an event schema, semantics for the defined attributes, and stability and requirement levels.

Embrace has been working with the community to define mobile crashes as an event because we believe it is vital for mobile telemetry to have shared, vendor-agnostic definitions that exist and are usable within OpenTelemetry and its ecosystem.

Unique challenges with mobile crashes

The nature of mobile creates some unique challenges for developing a standard for mobile crashes as events in OTel.

Modeling the many flavors of crashes

The most challenging issue, in terms of data modeling, is that crashes come in many flavors.

Depending on the platform, the nature of the crash and the data source of the crash details, you can get vastly different information. Much of this information is not usable without additional data that may not be available to the instrumentation tooling (e.g., mapping files that deobfuscate stack traces) or decodable by the app (e.g., binary data).

Our proposed solution to this problem is adding an attribute to the schema that contains a data blob, along with another attribute that describes the relevant, unique combination of factors that affect how the crash data can be interpreted (e.g., Android crashes obtained from an UncaughtExceptionHandler implementation will be “android_jvm”).

Additional information, like encoding, is optionally included so backends can parse the blob. What the blob actually contains, however, will not be specified, as the structure can be complex and dynamic. Backends that understand specific types of crashes will need additional information to interpret the custom fields inside the blob, which are outside the scope of this event definition.

This proposal differs from our proprietary solution, which maps each unique combination to a specific emb.type attribute value. For instance, the emb.type for Android crashes obtained from an UncaughtExceptionHandler is “sys.android.crash.”

This solution doesn’t work for a more general approach to crashes, however, because it can lead to a proliferation of events that are all trying to model crashes with slightly different data. This inevitably leads to a lot of overlap between each event and its effort to model the crash, making it hard to keep consistent definitions as more and more crash types are modeled.

Processing delayed data

Another challenge is dealing with delayed data, as the app may not know that a crash has happened until the next time it is launched.

When a mobile app experiences a crash, it’s no longer able to send data to the server in real time. Not only that, but the logging of the crash by the SDK installed on the device may also be delayed. The SDK will have to wait until the user reopens the app in order to emit the data it captured. That could be seconds, hours or even days later, and it has the potential to manipulate the overall data timeline – and the implications of that information – if not reported correctly.

We’ve had to be very explicit in dealing with this challenge when modeling the new crash event.

To do so, we specify that any fields on the event object should describe the state of the client at the time of the crash, not the time when the event is logged. This includes fields that are automatically captured as part of the event spec, like timestamp, as well as globally defined attributes via semantic convention like session ID. Therefore, when the backend sees a session ID or looks at the timestamp of the LogRecord, it will be based on the values at the time of the crash.

The ongoing process

Like many aspects of the OpenTelemetry initiative, the process for developing the mobile crash model is continuously evolving. If you’d like to learn more about it, including details on the scheme being developed, head over to the docs section or feel free to follow along in the pull request (PR).

Additionally, check out some of the great ongoing work that’s being done by the OTel community to further define the measurement and modeling standards for mobile telemetry. If you’d like to learn more about Embrace, check out our open source SDKs or head over to our site.

Embrace Deliver incredible mobile experiences with Embrace.

Get started today with 1 million free user sessions.

Get started free
Related Content
OpenTelemetry panelist headshots with leaf backgrounds

Pumpkin spice and OpenTelemetry for mobile panel recap

In this OpenTelemetry expert panel, we discuss the challenges of collecting telemetry in mobile apps, why mobile developers struggle with observability, and what the current support for OpenTelemetry is on Android and Swift.