3 lessons I learned at Twitter while optimizing Android app startup performance

Android

26 May 2023 • 5 min read

Before I came to Embrace, I spent nearly eight years at Twitter working on various aspects of Android performance and stability. During that time, I worked on a number of teams and initiatives focused on improving performance (perceived and actual) and addressing regressions throughout the app. I personally spent time working on low-level concerns (e.g. networking, thread usage, architecture, etc.), high-level perf-focused features (e.g. server-push updates, web browsing, etc.), and targeted initiatives that aimed to improve usability on low-end devices and poor networks. While these efforts yielded varying results, the one area of improvement on the client that consistently paid dividends — particularly in terms of user growth — was optimizing and improving app startup. With that in mind, my goal for this post is to share some of the things I learned to a wider audience of Android engineers. So, let’s talk about some lessons I learned in the trenches, as my teammates and I looked to build a Twitter for Android that started just a little bit faster.

Lesson 1: Cold app startup is the single most useful performance metric to improve if you want to increase DAU

The client performance metric we talked about most at Twitter was Time to First Tweet, which defines the time between when the app starts up, to when the main Activity in the app renders Tweets to the user.

There were four flavors of this metric that varied on two dimensions:

Whether the app startup is a cold start or a warm start;
And when we considered this workflow to be complete, i.e. Tweets loaded from cache are displayed vs Tweets fetched from the server are displayed.

Specifically, the Time to First Tweet flavor that correlates most strongly to Daily Active User (DAU) improvement is the one for cold start — specifically a user opening the app through the launcher — and ends with cached Tweets displayed to the users. We called this TTFT-CC (cold-cached).

This is not to say other performance improvements don’t move the needle. But, in A/B tests, even in longer holdback experiments that run for months, it is harder to causally improve DAU even if we improved those other performance metrics.

For example, we found improving other underlying performance metrics like scroll jank, or even the flavor of TTFT that ends when new tweets are fetched and displayed, didn’t have the same correlation to DAU. While we may see other important core metrics, like revenue, change positively, meaningful and sticky DAU changes are less likely to be correlated with other performance metrics beyond TTFT-CC.

More often than not, if mean TTFT-CC (after some amount of outlier filtering) is improved in a statistically significant way, you will see improvements in DAU if you observe the experiment buckets over a longer period of time. It’s really quite remarkable.

The insight for me here is that unless you’ve done all that you can to improve app startup, you may not need to look elsewhere for ideas to increase DAUs via performance. While more flashy features that target user-perceived performance like prefetching or supporting HTTP/3 may seem attractive, the impact they make to DAU may not exceed that of simply making startup faster (like by following some of the tips in this post).

Lesson 2: Startup performance optimization isn’t a project — it’s a practice

On mobile apps, performance will always deteriorate as you add new features. Nowhere is this more apparent than in app startup.

These types of regressions can be very subtle too; each cut imperceptible on its own, but when added together, they make a huge difference. This is especially true on lower-powered devices, where every delay is magnified.

Just as improvements to app startup performance will improve DAU, regressions in app startup performance will lead to a loss of DAU.

This may not be noticeable if you only monitor top-line production metrics as there are many other factors that contribute to DAU changes. But at Twitter, our experiments showed that the TTFT-CC-to-DAU relationship is at least somewhat symmetrical, as when TTFT-CC gets worse, DAU also drops.

For us, this meant that preventing app startup times from getting worse was just as important as making it faster. In fact, at Twitter, we made a habit of monitoring TTFT-CC in production, and we were working on automated synthetic performance tests in CI to try to catch regressions as they get introduced to the code base.

Doing commit-to-commit, even version-to-version, monitoring of app startup time on Android is difficult considering how much variance there is due to the number of factors that contribute to it beyond your app’s code. Even in Macrobenchmark tests, controlling for as many factors as you can, including the device itself, device state, background activities, pre-loaded data, the number of previous cold starts, and so on, there’s always some amount of variability to the results in each test run. This means if you find a regression, it could be due to the random variance, not necessarily an actual code regression.

The lesson here is that no matter how you do it, you have to know when your app startup deteriorates so you can address it, or you risk losing users. This work is better thought of as an ongoing practice, where you monitor the metric constantly and prioritize fixes when possible. It should be part of a process built into your software development lifecycle, rather than a one-time project a SWAT team takes on because complaints from customers and CEOs get too loud.

Lesson 3: App startup performance varies significantly on Android, and its impact isn’t uniform across your user base

The Android ecosystem, to put it mildly, is very heterogeneous.

While top-end flagship phones with the latest and greatest hardware, running the latest and greatest version of Android, dominate the public discourse, far more people are using older and more basic phones that are much less powerful. This is especially true in markets outside the West.

The lack of uniformity of devices or operating systems creates a situation where cold app startup times can vary greatly. So if you want to monitor app startup time for regressions, what does that even mean? Mean in production? P50? The number I see on my test device?

At Twitter, we looked at production app startup metrics through a few different lenses.

Mean TTFT-CC was important (particularly because it’s easy to derive and understand), but looking at various percentiles (p50, p75, p95, etc.) was also important as it told us the range of experiences our users were seeing, which depended largely on the quality and performance of the device (but not tied exclusively to hardware specs). So, to evaluate changes to app startup performance we would use all these metrics, as some changes can have similar effects on all types of devices, while others disproportionately impact slower devices.

Furthermore, we found it useful to filter down the population where you take app startup measurements from the cohort that is the most impacted when app startup performance regresses: users with poor performing devices (there are various ways of doing this).

This is because a 20% increase in app startup time for a person using a Pixel 7 Pro running the Android 14 Beta may take the wait from 200ms to 240ms — perhaps unnoticeable. However, a similar magnitude regression for a person using a Moto G from 2015, running the same Android 5.1.1 that the device shipped with, could take app startup from 5 second to 6 seconds, which may just be enough for that user to stop using the app.

In other words, when you include users in the sample population for whom app startup performance is already very fast, you risk diluting the comparison and muting differences in your data because even relatively big changes may not be noticeable.

Conversely, focusing on users who are already experiencing slow app startups makes it clear how regressions are impacting them, and makes aggregations like mean or P50 more meaningful.

It could very well be the case that because of the dilution, mean and P50 wouldn’t have any statistically significant changes, but after the filtering, there would be. This means that unless you do this kind of filtering, you may not even notice that your most vulnerable users will see an app startup regression.

Suffice it to say, evaluating app startup performance changes isn’t just about looking at the average values at two points in time: it is a lot more involved than that.

Learn how Embrace can help with app performance

Embrace is a data-driven toolset that helps engineers manage the complexity of mobile and build better user experiences.

As mentioned above, app performance isn’t just about optimizing, but about monitoring and proactively maintaining those performance gains. Embrace gives mobile engineers unmatched visibility into their apps and helps them proactively spot performance risks for predefined user flows.

With Embrace, you can uncover every possible root cause of poor app performance with a full play-by-play of impacted user sessions.

Learn more about Embrace and how it can help you improve your app’s performance, here.

Author

Hanson Ho

Hanson Ho is an Android engineer at Embrace. In his role, Hanson focuses on resolving performance and stability issues on the Android SDK with speed and efficiency. Hanson brings nearly 20 years of experience in software development to the org. Prior to joining Embrace, Hanson worked in different technical roles at SAP, Salesforce, and Twitter, where he spent over seven years focused on Android performance and observability at a massive scale. During his time at Embrace, Hanson has played an integral role in developing the Embrace Android SDK and has played a key role in lending his technical expertise to members of the team.

Product Overview

User-focused observability for mobile and web

Use Cases

Industries

Featured Resource

Overcoming key challenges in mobile observability: A guide for modern DevOps and SRE teams

Company

Community + Support