In this webinar, our CTO, Fredric Newberg discusses how to solve Unity ANRs and boost your app’s ranking in the Play Store. This includes Fredric discussing unique insights his team has uncovered that help customers better monitor their ANRs and stay below Google’s orange line.
In this talk, you’ll learn:
- Which ANR metrics negatively impact your game’s discoverability.
- What data you can use to monitor and troubleshoot ANRs (from Android, the GPC, and third-party solutions) and how Embrace brings that together in one place.
- The pros and cons of different ANR data sources.
- What it takes to finally address ‘com.unity3d.player.UnityPlayerActivity.onPause’ ANRs.
- Why flame graphs are critical for quickly surfacing actionable patterns as opposed to manually searching stack traces.
Fredric Newberg: Hey everybody! I’m Fredric Newberg. I’m the CTO and one of the co-founders of Embrace. You probably already know that we monitor mobile apps to help people build better experiences if you’re here at this webinar and we’ve worked with a lot of customers to help improve their ANR rates. We take what we believe to be a pretty unique approach to ANR data gathering and I’m going to go through that in this webinar.
So, why should you care about ANRs? In talking to folks, I find that, really, people fall in two different buckets. Either you’re very much focused on the end user experience or your focus is on Play Store rankings and, I find almost exclusively with our customers who are building games, that Play Store rankings is the primary focus.
I’m sure most of you know that it hurts your rankings if you exceed ANR limits, and you don’t get featured, and you don’t show up in places that you want to show up in to make your game more discoverable. So that’s what we’re here to talk about today to help you avoid getting dinged in the Play Store and getting featured as you deserve to be.
So, let’s start off by talking a little bit about how ANRs are defined.
There’s documentation from Google on this, but it’s really not all that easy to decipher and it’s changed a bit over time. Oftentimes, the main thing that’s said is, “Oh, it’s when the UI thread of an Android app is blocked for too long. Then you get an ANR.” While that is a truestatement,t it is not exactly a complete statement.
ANRs can be caused by many other things as well and Google really has expanded this over time. There are services that can be slow to respond to broadcast receivers that aren’t performant, you maybe aren’t starting foreground services correctly. In Android 14 they’re adding new things for example like the improper use of the job scheduler.
So probably a more accurate way of looking at this it’s not just that ANRs occur when the app isn’t responding to users trying to interact with it, it’s really when the app isn’t responding to system messages. A subset of that is, you know, input taps and other gestures, but it’s important to realize that it’s not just the, “Hey it hung and it took five seconds…” So there’s definitely some nuance there.
How are you evaluated by the Google Play Console? Well, it’s DAUs and I think this has gotten better over time how they’ve described it. They used to talk about it as sessions and I think that was confusing to a lot of people. Certainly, our definition of a session is different than what their definition of a session was, and if you came from other tools maybe you thought of a session it’s like, “Hey timed out after half an hour usage.”
But what the limit is, it’s based on DAU and if a user experiences a DAU at any point in a day then you get ding for that. And obviously with games where you’re trying to bring people back throughout the day, maybe it’s a little bit different if you have other types of app, but with games you really are trying to create that engagement. So you’re getting more chances to have an ANR occur and there’s no real recognition of that saying like, ”Oh there’s more time spent in the app.” It’s just like, “Hey we’re just measuring this basically on DAU,” and that limit’s pretty, pretty strict right. It’s like 1 in 213 users can have it before you go over it. So that definitely makes it challenging.
I think another thing that is a little bit confusing is how Google Play console lists stuff. They list it in the same way as they show crashes, but it’s really important to recognize that ANRs are very, very different than crashes.
So let’s talk a little bit about what makes solving ANRs a challenge. So, as I just said ANRs are not crashes. Crashes are deterministic. Normally, you’ll get a stack trace that shows you where the crash happened — that doesn’t always make it easier to solve the crash. I’m not trying to trivialize that, but more to highlight the fact that solving ANRs is even more challenging.
Basically a snapshot is taken if you’re using GPC as your only tool approximately five seconds into an ANR. So instead of like having the car hit the wall and you’re like okay well that… that’s where the problem occurred, with ANRs you have more of a live environment where things are moving and maybe the snapshot captured part of it, or maybe it captured nothing at all. And also if you think about like what happens during an ANR, it’s not always going to be constant.
We’ll get into it a little bit later on how we got this data, but this basically shows the evolution of an ANR, where you can actually see that we’ve highlighted four ad SDKs here. I think they’re actually got a couple smaller contributions as well.
So you can see that what happens during an ANR interval can be very dynamic and is not not necessarily just a static thing where it’s like one single cause of it. That makes it really challenging to solve because if you think about just taking a snapshot of what happened at the five second mark here, it really wouldn’t have given you any insight into what caused the majority of the time consumption to occur here.
Another thing that we find with a lot of our customers is that things are grouped under the somewhat nebulous native poll once heading. Up until recently Google didn’t really say much about that but then they added this rather lengthy explanation and…
I’ll save the time of reading that and just say all it says is the stack traits we captured will not help you understand what caused the ANR, but you’re still being punished for it.
So that puts you in a bit of a difficult position where you’re being graded on something and you’re really not given any indication of what you did wrong — but you’re still dinged for it. That makes it pretty challenging to solve that.
If we look at it, and we’ve seen quite a few of these dashboards from our customers, it’s not uncommon… actually it’s quite common for, especially games, to be in that 30 to 60 percent range where the native poll once syncs roll up to that.
If you’re looking to go from 0.6 to 0.47 and you have 60 percent in this native poll once group, really you have a very small group of data that is actionable given the data that GPC gives you. That makes the problem of hitting the limit harder because you’re just kind of blind on a lot of ANRs that are causing you to exceed that limit.
By the way, as Colin mentioned, if you have any questions please drop them in as we go and we will get to those at the end.
So what data can I get to solve the ANR? So, this may be kind of obvious, but I think it still is worth going over — pick your battles, right? If you can figure out that the ANR is in your code, great, you have the most ability to change it. Still might not be all that easy to do it, but you have the best understanding of that, and you have access to it.
We find that a lot of ANRs come out of SDKs specifically, at SDKs. There, you may have some ability to change that, maybe how you’re using it, maybe how you’re initializing it, where in your code you’re using it.
We’ve definitely been able to help folks out with that to help them reducing ANR rates by how they use SDK slightly differently. But you also then are able to go to the vendors and talk with them, and not only show up and say, “Hey I think your SDK is causing some problems for me…” You can actually show up with more concrete data than what you have just available on GPC if you are using Embrace.
Also Unity itself could be part of the problem. We’ve had some customers who run into issues there. Some of them have access to source, some do not, some have great relationships with Unity, some maybe have a little harder time getting their attention. So your mileage may vary, but it’s ultimately a numbers game right so start with where it’s the easiest.
Android’s probably like the trickiest, like their custom builds on different manufacturer’s devices — really, some devices probably will never see an Android update. So even if you can identify the issue, you’re probably not going to be able to fix this. It’s more a matter of figuring out how to work around it or maybe you decide, hey we’re just not going to show ads on this subset of devices because we’ve seen it causes this issue.
There are different solutions to how you tackle it too, not just changing code, but if you’re looking to do anything with Android code it’s more probably just for educational purposes. You really aren’t going to have a whole lot of ability to make any changes there that will end up on your customers’ devices…
Colin Contreary: I wanted to chime in real quick Fredric just because you mentioned this. So Wildlife Studios is one of the biggest mobile game studios in the world, and they had did exactly what Fredric just mentioned, which is they had an ANR problem that was very much restricted to low power devices.
The solution was ultimately to disable ads entirely on specific device types so they ran the numbers, figured out the revenue impact, and figured out, okay, how can we raise — or lower our ANR rate enough, such that it’s worth losing that revenue by just turning off ads for these devices. So, yeah, that’s very much a common solution.
Fredric Newberg: Yeah, that’s… it can be a pretty tricky trade-off, but it’s definitely worth considering. We’ve had some other folks too who’ve just turned off support for lower end devices or support for older OS versions where the math just didn’t work out. They weren’t generating enough revenue from those to justify the impact that they were having on their overall ANR rate.
So there’s quite a number of things you can do to get more data about your ANRs. They’re not necessarily easy to do. Obviously you have the GPC dashboard, there’s the ApplicationExitInfo [API] that was introduced in Android 11, there’s a thing called SIQUIT monitoring, and I’ll talk about in the next slide…
Or you can write your own ANR detector and you can go and try to do all those things on your own. It’s all out there, but we put in the effort to bundle that all together to provide as comprehensive a solution as possible so that you don’t have to invest the time to go do that.
I’m not gonna go and read every single word on the slide, there’s a lot. This is more if you want to go back later [and read it on your own]. If you have any questions at the end please add them, but it’s more if you want to go back and read this, the webinars recorded and will be distributed, but I want to just touch on a couple of these things.
Obviously, like the GPC Android vitals are, for better or worse, they are where you are graded. They are thus the ultimate source of truth in terms of whether it has an impact on the ranking. So you do have to look at that even if you don’t find it to be super actionable. And you can certainly use it to solve simple ANRs. We’ve had some customers who’ve had some pretty egregious ones pop up in releases and they pop up there, but for solving the more difficult ones, it’s probably not the best approach.
Application exit info was something that was introduced in Android 11. It has evolved a bit over time but it gives you similar data to what you get on the GPC dashboard, but it’s not quite the same. There are certain things that are missing in the data. The data, as I mentioned, has evolved from Android 11. A lot of the data that’s present in Android 13 has missing pieces but then that just helps you solve it for part of the population.
There are still the same issues with that data, like if we look back at the native poll once stuff where you’re being told, “Hey, we didn’t capture anything,” well the AEI data is going to have the same problem as that. Sometimes there just isn’t even stack traces, you’ll get a blob when you call that API and the blob will be largely empty. So there are definitely some data cleanliness things and going through that can be quite challenging.
One of the things that we have explored and other people in the ecosystem have is leveraging the fact that Android sends a SIGQUIT signal to that one when an ANR is detected so you have some ability to detect that on device. The application exit info just happens, as the name would imply, after the app instance has exited. So there’s a lag there as well. SIGQUIT while initially seeming somewhat enticing, it gives you an idea when it happened it is very limiting in that all you get is a timestamp you can’t get a stack trace for what was sent to GPC. So trying to line that up with and get some additional data for troubleshooting is pretty challenging and we found that there’s limited applicability of that.
And then, another option is write your own ANR detector where you can go look for main thread blockage where a lot of these issues are coming from. You can capture a richer set of data now there are challenges with that too, first of all it’s not easy to write something that doesn’t make a bad situation worse. If you’re already experiencing an ANR, the last thing you want to do is add more load to the system that will take it longer to resolve itself. Also just aligning this like as I mentioned GPC is the ultimate arbiter of whether you’re meeting ANR limits or not so trying to line this up with the GPC data is definitely not an easy problem to solve.
But there are also some other benefits, like five seconds is a fairly arbitrary duration for hang. If you look at what customers are willing to put up with, sure, maybe they’ll be willing to put up with a five second hang every once in a while, but if you have three second hangs all over the place, that’s going to be very frustrating to people and they will probably vote with their feet and leave.
So, if you do build your own ANR detector, what are you doing? You’re effectively getting more data to try to solve than ANRs if you take the approach that we’ve taken.
So going back to the analogy of the moving car, in the earlier slide, I showed where it’s like maybe you just got one of these pictures but instead of just getting one of them, here we can get all of them, and then we can use all that data on the back end and then do an analysis to create — present the more accurate picture of what’s going on.
And maybe you didn’t just see like the one event go by, maybe there were multiple events as we saw in the earlier slide…
And you can think of this as sort of like the different cars of passing by, they map the different ad SDKs, so you can see the evolution, and you’ve captured a larger set of data to help you understand truly what was going on rather than getting just that one snapshot that may have been useful but also may have been misleading.
Okay, so ultimately, if you’re going to attack ANRs you’ve got to start somewhere. You’ve got to decide what you are going to go tackle. We have taken a couple of swings at this and work with customers to see what works the best.
Ultimately, having something that can be prioritized down to the method level seems to be what has worked the best for folks. Looking at things at the package level had some benefit, but it also had some challenges in terms of understanding if you made progress if you didn’t go down all the way to the method level.
So that is the approach that we have taken but while the list may look similar to how you’d see a crash list, there is a lot more that is happening under the hood and one of the first things is um… So it’s great we collect all this additional data but how do we avoid putting the burden on you the user to understand how to look at that data?
We’ve done a lot of analysis and come up with these three options for looking at the date and you can kind of look at this — it’s like it’s the same set of data but you’re looking at it from different angles.
First sample gets you closest to the start and given how we detect that, that’s one second away from when the thread was blocked, not five, and we found that generally, for most apps, again your mileage will vary depending on how your app is written and the ANR issues you have. But the first sample definitely helps a lot more than just looking at the last one. But also like most representative sample can be very helpful.
Like if we go back and look here, it’s like you know the areas that are much wider is what the most representative algorithm is going to go pick up and not necessarily pick up the thing that was the first one here.
Or let’s say you got unlucky and you’ve landed in the sample at five seconds and cough that one — that would be very misleading. So most representative helps with that and then given that a lot of our customers are very much 1 focused on ads as being sources of their ANRs which I think is, not meaning to throw shade at the ad networks, but there are definitely a lot of challenges when you pile in a lot of ad SDKs in an app.
So focusing on that as your short surveyance is a legitimate thing to be doing. The data definitely does support that, but we give you the ability to have multiple views and you can take a look at all of them and see which one gives you the most insight to go tackle your problem.
And if you’re used to going into GPC, this view will probably look painfully familiar — you’re given stack traces and you’re left to paginate and that’s kind of slow, and it’s also like how many times are you willing to click on the next button?
We’ve taken a bit of a different approach to that because we don’t think that is the ideal experience, so what we’ve done is we’ve combined things into a flame graph…
Which has typically been used for performance monitoring and profiling purposes, but if you can think of this as like think of every ANR that occurred as being a vertical slice in this graph. So you can start seeing commonalities that occur between different ANRs and so you can get a much more complete picture in a much shorter period of time.
So maybe the sample that you got was like the super narrow spike here, which really didn’t tell the whole picture, it may be even a bit misleading, um so this gives you the ability to see, in this case, I guess they’re about like 900 different samples being represented in a single view and it’s still reasonably difficult, I think, to see all the patterns.
So we also have a mode where you can just do a prioritization thing. The debug mode is great if you’re actually going to go try to dig in and solve it, but initially when you’re doing the prioritization pass, you probably just want a cleaner view where you see a lot of the noise from the previous one is gone and you very clearly see that here it’s com.google and GMS ads that is causing issues and then when it comes time to troubleshoot maybe you want to move back to the debug one.
This gives a good overview initially.
The other thing that we allow you to do is then dig in to see a timeline of what happened for some sample ANRs. So you’ll be able to see that… you’ll see breadcrumbs, you’ll see network calls, maybe you didn’t expect this network call to occur right before that, maybe it’s an indicator that something went wrong.
You’ll be able to see that there were web views that occurred, maybe there were network errors that occurred, and getting that context we found is very helpful to solving ANRs more quickly.
One of the things that we’re super excited about that is going to be coming out in Q4, is a more cohesive mapping to the data that you see coming out of AI, which is a pretty close approximation of the GPC data.
I would say that when we’ve done analyses, we find that it’s about like 90 to 95 percent overlap, so there’s some stuff that’s missing in the AI data, but it’s still directionally very much on point and what we are doing here and all, by the way, all the support for this already exists in the SDK.
So if you’re excited about trying this out, you can definitely go integrate SDKs now. I know that sometimes takes a bit of time, but the SDKs are all ready to go. We’re working on the back end portion of it — to combine.
So you can think of a layering where it’s like, hey, we talked earlier about the native poll once stuff like okay, well what does that map to exactly in the Embrace data? And this will give you a much better ability and a simpler path you could still piece together a lot of stuff on your own but, it was… we were finding that it was probably a little bit too much work for certain use cases to do this, so we wanted to make it easier for all of our customers to be able to make that correlation.
So this will be coming in Q4! Quite excited about this one.
Obviously, yeah, we were focused here on Unity. We have a lot of customers who develop apps in Unity and probably why most of you here on this webinar. So, what if the problem is in the Unity code?
So typically what happens is, keeping in mind that the ANR is still going to be graded. In most cases, it’s going to be that the main thread and the JVM is blocked. So how does that end up happening if the problem is in your Unity code?
One of the things… one of the patterns we see a lot is the pause Unity. So what happens is you want to go show an ad and there is the ads going to be shown in a different activity you have to switch activities. When you switch activities, because there can only be one active one at any given time, then you end up having to pause the Unity Player activity in order to show the ad activity.
And as part of that, what happens under the hood is the Unity Player is called to pause and it then sits there and waits for the Android part, in the JMV part sits there and waits for the player to actually pause, but you don’t have any real visibility into what is going on and that is changing though.
So… we have this in beta that we’re working with some customers on where we are able to monitor the Unity thread so if you have a lot of stuff that is being, you know, maybe you have cleanup that occurs, maybe you have other long-running processes that sometimes you get unlucky that are running when the pause comes and if you think about it’s like all it takes is one out of 213, right?
So the numbers are not in your favor and you need to make sure that you do all those things quickly and it’s not okay if you have something happen for 1 in 100 users.
So we, I added a feature to the SDK where we can start monitoring the Unity main thread and, similar to what you saw earlier here with flame graphs, we’re going to be doing the exact same thing with the Unity main thread so that will give you insight into what actually was happening when you were waiting to pause and again all the functionality for this exists in the SDK. So if you’re interested I would encourage you to reach out and we will happily set you up and then let you be part of that beta.
So ultimately, I think it comes down to putting all the pieces together. There are a lot of challenges, there’s no one solution for solving ANRs. You really want to get data from as many sources as possible.
I think it’s important to understand like how when that data is being put together, what the pros and cons the sources are and we’ve obviously invested a lot of time of that in order to give our customers the best possible way of analyzing ANRs, and we believe with a richer set of data and we’ve certainly gotten the feedback from customers that they feel the same way. It’s like you will be able to more effectively solve your ANRs and lower the ANR rate and get underneath the magical .47 percent rate that is required for your app to be fully featured in the Google Play Store.
End of transcript.
Interested in learning more about Embrace? Start a free trial today.