This year’s Mobile DevOps Summit (presented by Bitrise) was an event aimed to empower developers, DevOps engineers, and teams to streamline their CI/CD processes. Through a series of great sessions, attendees learned about best practices, advanced techniques, and the latest innovations in the field of CI/CD. This year, our Senior Software Engineer at Embrace, Craig Hawco, led a session called “Solve hard-to-fix crashes in minutes with Bitrise & Embrace”. In this session, you’ll learn…
- How to use the Embrace platform to get to the root cause of your crashes with speed and precision.
- How customers have successfully reduced their crash rates with Embrace.
- How our new Bitrise integration provides a seamless way to upload symbols as a CI/CD step.
Quick note: Embrace has a new integration with Bitrise, so mobile engineers can upload their symbols files during the CI/CD process and get a more seamless, secure, and reliable experience.
Here are a few previews of the insights Craig shared:
Why Embrace is different from other mobile monitoring approaches:
“You’ve probably used a bunch of different monitoring tools in the past… they’re fairly thin on the feature set. They show you what’s going on, but they don’t provide you with deep insights. Embrace takes a different approach. We’re very deep on the mobile side. We care about sessions, what happens, and what a user is doing before a crash happens to give you the most data to solve a crash…”
How Embrace provides complete visibility:
“We are the only solution that collects all the details from 100% of user sessions. We don’t down sample. We don’t compress things. We keep everything that happens and we let you search it. While most apps give you breadcrumbs and require a lot of instrumentation to get those breadcrumbs in place, [with Embrace] you instrument the SDK and 90% of that data is already collected.”
How Embrace helps adidas Runtastic:
“In the end, Paul (the Lead iOS Engineer at Adidas Runtastic) told us they were able to ship fixes twice as fast with Embrace.”
How Embrace helps GOAT:
“[GOAT] was able to report that on Black Friday they were seeing about a 44% increase in traffic, but they were remaining 99.99% crash free. That was thanks to the insights that we [Embrace] provide…”
Presented by Craig Hawco.
Video length: 25:04
Date premiered: Oct. 4, 2023
Craig Hawco: Hello and welcome to today’s talk, “Solve hard-to-fix crashes in minutes with Bitrise & Embrace.”
So my name is Craig Hawco. I’m a Senior Software Engineer at Embrace. I started working in mobile back in 2009 when it was a very, very different space. I saw a lot of things like BlackBerry and Symbian back then that we don’t really see anymore.
But since about 2012, I’ve been mostly working on the monitoring and analytics side. So just some… a little bit about some of my hobbies. I’m a very novice sailor. Last year I kind of took lessons and I learned. This past summer I’ve been spending a lot of time learning how to be a better sailor and how to race and those types of things. So it’s been a very interesting year.
So what are we going to do today? So we’re gonna start off with an introduction to Embrace. Kind of who we are, what we do. Those types of things.
Then we’re going to move into learning how to solve hard-to-find, hard-to-fix crashes with Embrace. So you know what can Embrace do to help you figure out what’s going on with a crash? Not just like a plane stack trace, but, you know, some extra detail that’s super helpful for figuring that stuff out.
And then at the end, we’re going to have a quick plug of our Bitrise integration, how you can, you know, plug our step into your Bitrise workflow and get your debug symbol information integrated with Embrace.
So what is Embrace? In short, we help engineers build better mobile experiences. You’ve probably used a bunch of different monitoring tools in the past that you know, you integrate them, they give you some data, you get some plots back, you get some numbers… Make it a stack trace and some breadcrumbs. But they’re fairly thin on the feature set.
They, you know, show you what’s going on, but they don’t really provide you with deep insights. Embrace takes a kind of a bit of a different approach where we’re very, very deep on the mobile side.
We care about sessions and what happens and what a user is doing and all of those things that happen before a crash happens to give you the most context possible. That gives you the best chance of solving a crash, where you know a lot of other tools, you might just see a few breadcrumbs and the stack trace, like I said before, we go more in-depth, maybe we’ll talk a little bit about some of the things our customers have accomplished with Embrace.
So Paul at Runtastic was having that classic issue with visibility and monitoring their existing monitoring solution required a lot of instrumentation. So when a crash would happen, they’d figure out that they didn’t… They couldn’t really see what was going on, so they’d go in and they’d add some instrumentation. Then they’d ship a release and then wait for the data to come in, and then they could see what was causing the crash and then they could fix the crash.
That whole workflow is not at all how Embrace works. So with Embrace, when you integrate the SDK, you get about 90% of the value just straight out of the box. So you just integrate with a few lines and that’s it. So if you want to add additional instrumentation for certain things, you know, we have facilities for that. But for the most part you get the vast majority of the value without doing any work at all.
So one thing Paul told us was that in the end they’re able to ship fixes twice as fast as they were previously by using a brace go.
I don’t know if you’ve heard of GOAT, but GOAT is the leading and trusted sneaker marketplace. And so like many ecommerce apps, they depend very, very heavily on holiday revenue. It makes up a large portion of their total revenue for the year. And so that means that things like Black Friday are super, super important for them as an event.
So GOAT was using Embrace to refine their app and make sure there were no issues with it leading up to Black Friday. And then they were able to report that on the actual day of Black Friday, they were seeing about a 44% increase in traffic, but they remained 99.99% crash free. And that was thanks to the insights that we provide and the real time monitoring that we can do. On top of all the other stuff I talked about.
And next, we’re going to talk about games. So we have lots of gaming customers.
You guys probably know that discoverability in the App Store is like a huge factor in the success of an app, but that’s even way more true for a game.
If you didn’t know having high application, not responding or answers is a signal that’s used in the Google Play Store too to figure out where to rank you.So if you have a lot of ANRs, you will get pushed down in the rankings and it’ll be hard for people to find your app.
And so Lucas at Wildlife was able to use Embrace to find those ANRs, figure out what was going on, fix them and that resulted in them being featured way more frequently in the App Store which you know it’s the way the way to win.
So we talked a little bit about what our customers do. Let’s talk a little bit more about what Embrace does. So we see what we actually do.
So we are the only solution that collects all of the details from 100% of the user sessions. We don’t down sample. We don’t compress things. And in live data, we keep everything that happens and let you search it. You can figure out what’s going on. You can dive into any session that’s happened on your app within the last number of days.
So, you know, while most apps can be breadcrumbs, we allow you and they require a lot of instrumentation to get those breadcrumbs in place. You know, you integrate the SDK and 90% of the information that you need is auto collected, you don’t even do anything.
The next thing that you need to think about is triaging all of these crashes. So today we’re going to dive into some specific examples of you… you picked a crash that you decided is important, but figuring out which crashes are important is also a key part of the whole lifecycle of an app.
And so, you know, we provide a lot of tools that help you figure out which crashes are more important than others for whatever your use is. So if it’s impacting more users or… whatever!
So like I said today, we’re going to be covering crashes, but there’s a whole bunch of other stuff that we also help with things like recall failures out of memory, exceptions, etc.
We cover a whole bunch of different event types and we’ll actually see a couple of these in our exploration here in a few minutes. But I don’t want you to get the idea that we’re all about crashes and we don’t do any of this other stuff.
We do all kinds of this other stuff. And crashes are one important, but just one of the things that we do.
So let’s dive in a little bit. So I took us here to the crashes tab and what we can see here right away, some top level stats on your app. So this is like, you know, kind of key health metrics that you should keep a look at. Probably something you’re most interested in if you’re, you know, a manager or something like that. You want to see what the top level stats are.
But for today’s talk, we’re more interested in digging deep into a specific crash. Like you’re an engineer and you’ve been told, Hey, we need to fix this specific crash.
So, you know, you can see some of the information here. I pre-selected a couple of crashes that we’re going to dig into a bit more, though.
So here we see the crash details. And so there’s a few things I want to call out here before we start looking at stack traces and whatnot.
The first is that we show you what version of the crash occurs. So if a new crash starts, you need to know what version of the app started so that you can go look through the code and figure out, Hey, was it something I changed from version one to version two?
Or if you fix a crash, more importantly, you want to know that you actually fixed it. So if I ship 3.1 94 with a fix for this particular crash, I want to make sure that it doesn’t show up on this plot.
The other thing you need to keep in mind is that you probably want to be able to reproduce crashes. So, you know, a lot of the time we’re taking our best shot. You know, given the information that we have, we’re saying we think we know what the crash issue is. But in order to really definitively know that we’ve solved an issue, we need to be able to reproduce it.
So we also provide some information for helping tease out some of that in that detail, such as when crashes are related to specific versions of the OS or specific devices. You know, you need to know that because if you know users are reporting a problem in a particular view and you load it on your phone, it doesn’t crash, you know, “works on my device” isn’t really acceptable. Right. So you need to figure out, well, what’s different between me and that more. I can look at the whole population here and see, oh well this particular device model or this particular OS is experiencing this crash, you know, at ten times its normal or it’s expected rate or something.
So we’re able to see that very, very quickly. And then we know what we’re dealing with and how we can probably figure out how we need to reproduce this.
So I just wanted to call that up before we looked at the stack trace.
So in this first example, we’re going to look at something pretty straightforward. We can see this… the exception here that we see on Android, we’re going to see some of the exception messages and we see a list of the stack frames.
In this case, it’s some sort of tab that we’ve configured, some UI element, and we can see here that it’s, you know, some resource not found cause. So we know it’s some resource thing. We didn’t package the resource or whatever you know, that’s something you can probably solve with just about any crash tool out there. You know, it’s not, not super, super interesting.
In this example that I picked though, this is something that’s probably closer to a harder to fix crash. So here we can see we have an illegal argument exception and there’s these entries that are of different sizes. We expected three, the race size was two. Looks like a classic, you know off by one error, but it’s not being experienced by this huge population.
So we scroll up, there are relatively few crashes here, certainly a lot fewer than some of the other crashes we’re seeing. So we know this isn’t just a plain coding error. We know there’s something bigger going on here.
We probably want to start thinking about things like, you know, is there a race condition? Again, Is it potentially related to some specific device or OS version that we care about? We don’t know!
But what we do know from looking at this is that, you know, we’d really, really like to have more context about what’s happening so we can do that on this page here we see this timeline details, and if I click this open user timeline, we’ll be on this page.
So what we’re seeing here is the entire user session or more properly, we’ll see here. It says three sessions. What we’re actually seeing is three separate sessions or three separate times the user foreground of the app. But they were so close in time that we’ve stitched them together into a single session. So this is a single string of interactions with your application. And you can see that here we’re foreground for 4 seconds, in the background, in the foreground again and then finally foreground it again at the end for 12 minutes.
So there is quite a long session at the very end there. But it’s this first crash, this first short session here where we have the crash. And so we can see a few other things here on the right hand side just to call it out quickly, we see what I would call something closer to demographic data.
It’s not really demographic data, but it’s the nuts and bolts of when the session started and the version and the device type and whether it was a cold start or not. In this case, it was a cold start… environment, app version, all the standard properties that you’d expect to be associated with a session.
In the middle, we see the different events that happened within that session.
So here the first thing we see is something called the moment, which you can think of as a checkpoint in the app that we want to keep track of. And we see here it’s something related to search results. So we should keep that in mind for later.
We can look down through this list and see all the different things the app was doing so we can see a bunch of different network requests. Thankfully, they were all successful. Okay, looking good.
And there’s our crash. So the user didn’t do a whole lot here. We know they had something related to search happening in their app before the app started, so they had some sort of state on launch. A bunch of network requests happened, but they were all successful. And then we had this app crash with this off by one error.
So again, it’s not affecting a large population of the user base. So we’re still thinking maybe race condition, maybe something specific to this device or OS. In either of those cases, you know, it seems pretty likely that a user could have the same crash multiple times, right? And so if we do see that multiple times, then we can start doing some, you know, differential analysis here.
We can look at other users with similar OS and devices, see if they crash as much. Is it this user that’s crashing a whole bunch with this particular thing? We’re not sure, but we can go look.
So we see on the right here we have this embrace user ID and we can click on that and bring up all of the sessions for this embrace user ID just a quick call out here about PII, which is that this is a randomly generated ID and we provide that to you to you do with it as you will. We typically recommend that you send that to your own back end and associated with your user records that way.
The reason we recommend that is that if a specific user is saying, “Hey, I’m having this crash” or “Hey, I’m having all of these different crashes,” you can go in and look up their Embrace ID, plug it in here, which is the session view, and see all of the sessions that that user had.
And so, you know, what are we going to do here? Well, we’re interested in crashes, right? So we’re going to go here and we’re going to go pick “has crash” and we’re going to say “is true.” And if we click here, well, that’s the session. We saw it first for 412. And look, we got a second one here!
So, you know, maybe maybe it’s the same crash, maybe it’s not. So we can go through I’ve already gone through and scroll down to where the crashes here for time’s sake. But you can see here. Yeah. It’s a different crash. So, you know, it didn’t happen there. So we’re not he’s not seeing repeated crashes for the same reason at least.
So, you know, we can continue looking through different details in this session view and see if we can spot something that might be helpful for us to debug this crash.
One of the things I did notice, though, before we clicked on that was that overall there’s quite a few memory warnings here. In fact, almost all of the sessions have a memory warning here now in the crash that we’ve been looking into. It’s probably not related to a memory error from what I can guess from what I would guess. But say that it had been related to something related to memory. Well, you know, we have another way to slice the data here.
We can look at OOMs. So again, we have the standard population analysis here. You can see the different versions of your app and the prevalence of OOMs in those different versions, etc., etc..
But remember, this crash is related to search. So maybe we can find something related here…
We could see search activity here. So if we click through to that one, we get very similar things here. Now, again, if you’re trying to reproduce this, you want to know, “hey, is this related to the based distribution? OS distribution?” Remember what I said earlier about, you know, this, You need to be able to reproduce it, right?
Well, if I’m running a different version of the OS than 12, by the looks of this, my chances of reproducing this particular OOM are very, very low. So that’s a key piece of information for me to debug this particular OOM as well. And you can see here we scroll down, we see all of the sessions related to this particular home.
So you can now go through and find commonalities, see things that are related to the Out Memory error that you’re trying to debug and see if they’re doing similar things or if they’re accessing similar things and trying to figure out the patterns through there.
So we’re going to hop back.
And so I just wanted to reiterate, you know, to solve a crash, to definitively solve a crash, I should say, you need to be able to reproduce it.
So in this particular line, Android, you know, you need to make sure that you have the same OS version, same device type, because vendors do customization of OS. And so you need to know if those things are correlated to your crash and if they aren’t, you know, there’s some types of crashes like race conditions and OOMs that are correlated to those things. But even if they’re not, they’re hard. They’re hard to debug anyway. And so you need as much context as possible to figure out what’s going on for those. And you can find that through the Embrace tools.
So I just wanted to give a quick plug here before we end. So none of this is going to work without debug symbols.
So we saw stack traces. They had the proper symbol information in them… But if you’re not uploading the symbols, when you create your builds, you’re not going to be able to access any of this. It’s not going to make sense.
And so, you know, I’m recording this a little bit ahead of time, but so this may be live in the marketplace already, but if it’s not, expect to see it soon.
There is now a Bitrise step for uploading simple information on your builds and…
My contact information is here if you want to learn more. That’s my email address there. Craig.email@example.com. And if you have more questions about… in particular about how the integration works and how you get that easy integration that I talked about…
If you want more details about that go to www.embrace.io.
Now, do we have any questions?
End of transcript
Want to try Embrace for yourself? Get started with a free trial today.