Crash Fast: Square's Approach to Android Crashes

Oredev ricau cover?fm=jpg&fl=progressive&q=75&w=300

Crash Fast: Square's Approach to Android Crashes

by Pierre-Yves Ricau

Feb 17 2016

The Square Register Android app has few crashes. Getting there requires a systematic approach: coding defensively, gathering information, measuring impact and improving architecture. In his talk at Øredev Conference 2015, Pierre-Yves Ricau presents concrete steps towards lowering the crash rate, from the general philosophy to the tools they use at Square, together with real crash examples.

Introduction (0:00)

When using a mobile app that regularly crashes, stops responding, or has errors, people tend to uninstall or remove the app (source). There is a strong emotional connection to crashes. People hate them 😤. And developers delay fixing crashes because they prefer to work on creating new features. For instance, Twitter announced the ‘like’ feature (they replaced “favorites” with “❤”️), and the main reaction was: why are they wasting their time on that? Why not they fix the crashes first? It is your responsibility as a developer to fix a crash: it is crucial to the user’s experience.

What is a Crash? (3:26)

Android is a Linux system, where every app is a process. A crash is the process getting killed, and eventually restarted. There are two types crashes: Java, and native crashes.

Java (3:58)

In Java, there is a class called Thread. When something throws an exception, you may or may not catch it. If you do not catch it, it is going to bubble up until the run method of the thread. If it is not coded, it is delegated to the thread UncaughtExceptionHandler. Every thread can have an ExceptionHandler (one per thread). It is an interface that you implement - it is going to call into that interface, “got an exception in the thread, and this thread is now dying”. Nevertheless, in most cases, you never set an exception in the thread, but delegate it to defaultUncaughtExceptionHandler, a static field on the thread class. See video for an example.

Get more development news like this

In the implementation on Android, there is a class called RuntimeInit. Android is a Java program: it has a public static void main method (where the app starts). The first thing is setting a default exception handler. Briefly, when there is a new exception, it does logCrashToLogcat(); secondly, it sets a system call to display an error dialogue displayErrorDialog(): “Unfortunately XX has crashed”. When the user selects ok, it tries to kill the process. Moreover, when the user gets this error dialogue, very few users hit the report button. Another option is to set your own UncaughtExceptionHandler. You take the exception, serialize it, send it to the cloud, and then delegate to the default which is going to crash. So, you really don’t want to do that. Instead, you can use a library. A few examples are:

Crashlytics, a company that was acquired by Twitter. It’s the easiest one on the market. Great when you are starting, but when you need the APIs to report the crash or consume the crashes in a specific way, it is not that great. The UI on the website is terrible.
ACRA (client library) and you implement your own backend.
Bugsnag: better UI and opensource client API.

Native crash (9:07)

Sometimes, because it is a Samsung device, you will get a native crash and you did not do anything.

When there is a native crash, a signal is sent to a signal handler (installed in your C code). Google Breakpad is a signal handler, and will capture information in minidump. We make a string, and create a fake exception. You can also set your own StackTrace on that exception with random values. Then, we can upload that exception using Bugsnag, Crashlytics, or ACRA. We can then read the minidump value, and use that information to understand what was happening.

StackTrace (10:09)

When we get a crash, the first thing to check is the version of the crash and the sources. On the client side, it is callback based. On the server side, you get a request and every layer (every stack in the StackTrace is a layer): you understand causes & consequences. On Android, because it is callback based, the cause was in the previous callback; but you have no idea what created the callback.

Reproducing (12:17)

It is essential to be able to reproduce the crash. To reproduce a crash, you can call your customer (only works for a small test). How do we then channel frustration following a crash, and avoid one ⭐️ reviews?

You could make a custom dialogue and ask for information (and maybe provide a link to support; see video for an example). I started listening to ActivityLifecycle and looking to see if there is an activity. If there is no activity, I kill the process and the process will get restarted (when the user returns). If the process crashes and someone brings back the app through recent apps, the process gets restarted and the save state gets passed into the activity. Magical (you did not see anything; it did not crash). If you were using the app, you can create an intent to start the activity, and show the dialogue to input feedback. You should make sure that the current activity is finished, so that it does not come back. You tell the system to start this activity, then to ‘die’ and the system will need to start this activity. This has been inspired by Jake’s Process Phoenix code, and Twitter and Facebook are probably doing something similar.

What Else Can We Do About Crashes?: Static Info (17:39)

We also need to collect information. For instance, the device type, OS version, or even the SHA (the version number in your code repository), because you can find that in your source repository and go to the right line.

Current Screen (18:56)

Getting a picture of the screen would be great, but would also mean uploading a bitmap: it is very heavy, and may show private information. Instead, we can describe the view hierarchy in text. For example, the signature screen on Square Register (see video). There is a sign view, which contains the relative layout. You see multiple views and attributes (but no text). Because we do not want private information showing, we can look at Espresso: a private class called RootsOracle. RootsOracle uses reflection (can access all the windows of your process). They have different code base on the Android version, but they list all the root views of the app. You iterate every root view, and make a stream representation of that view. We have a library for that (soon to be open sourced).

History: High Level Log (20:56)

You also need a “flight recorder” - high level history log. For example, knowing how long the HTP call took helps you disentangle the origin of the crash (e.g. timing issue). Also, “Show_Screen” gives you steps for reproducing the crash. Usually I check the view hierarchy, then I check log (past activity) and last activity (navigation, HTP calls). An easy way is to use JakeWharton/Timber directly into the crash logs (Timber.i("Navigating from that %s to %s, outgoingPath, incomingPath)). The key HTP has interceptors (very easy to log all the HTP calls).

LeakCanary (22:37)

For a very popular crash, OutOfMemoryError, the StackTrace is less used (as it can happen anywhere). Instead, LeakCanary detects memory leaks early, to prevent out of memory errors later on.

How to Prevent Crashes (23:06)

We’ve talked a lot about how to fix crashes, but what I really want to address is how to prevent crashes. We’ll look more at that below.

Defensive programming (23:20)

There are different meanings to defensive programming, for example: I am going to check for inputs and, if it does not match, I will do nothing, and then shrug. But if you ignore issues that you could fix, they could manifest later on.

Offensive programming: Crash Fast (25:10)

Do not ignore the problems: make them crash faster. Asserts preconditions (e.g. I expect this field to not be blank). You do not have a direct correlation in the StackTrace, but if you move the conditions earlier in the navigation, you might catch the problem at an earlier time and have more information to fix it. For example, using a static method, we can say: I expect this to not be blank, otherwise throw in an exception and crash. We create more crashes earlier, but we have less crashes down the line.

Exception grouping (26:26)

One of the side effects is, when you start using preconditions, you might get exception grouping. For example (see video), we can customize exceptions. I create a new NullPointerException, I use checkNotNull method, I extract the StackTrace element (an array of StackTrace elements), and then I create a new one that does not have the last element. I remove it, reinsert that in the exception, and throw that exception. In the Stacktrace it looks as if the caller of that method is the one that threw that exception. Looking at the line that is calling precondition is useful to ungroup exceptions.

Integration tests (27:48)

If we have more assertions/preconditions, we also get more crashes. We can write integration tests using Espresso. We run it on VMs (real devices tend to show false positives). You also need to work on making the build fast; if it takes forever to run the test, you are never running the test, or running them once a day and the feedback loop is much longer.

Smoke testing (28:41)

Smoke testing and manual QA (ideally, both) can be other options to prevent crashes. You could also pay high-skilled workers to work on destroying your app, or pay professional testers.

Dogfood / beta (28:37)

You can be the own user of the app. Square register is hard, but I can go and work at a place where they are using my app. Then, betas.

Staged rollout (30:24)

You make a staged roll out to 5% of the people: it is going to crash more for them. For one or two days, you collect the crashes, you multiply that by the expected number of your users, and you get your crash rate. You should always focus on the crashes that are impacting the most people (e.g. Facebook has a policy of releasing, to one country, New Zealand: similar population to the US, but less media impact if issues arise).

Android crash rate by day (31:45)

We built a tool that shows the crash rates for each release over time. For example, for Square, what we care about is payments, transactions. When you do or take a payment, what is the chance that you get a crash? We divide the number of crash by the number of transactions on that period of time, and on that version. That gives an appropriate metric, core to our business.

To Sum Up: (32:45)

Reproducing
Static info (customer feedback)
Flight recorder
View hierarchy (current state of the app)
Exceptions, and make sure that everything is in the right state
Staged rollouts and manual testing, to make sure that all these exceptions that you are throwing around never get into the hands of your customers

About the content

This talk was delivered live in November 2015 at Øredev. The video was transcribed by Realm and is published here with the permission of the conference organizers.

Pierre-Yves Ricau

Pierre-Yves Ricau is an Android Baker at Square, working on the Square Register. He formerly worked at Siine in Barcelona. He also enjoys good wine and low entropy code.

Twitter

4 design patterns for a RESTless mobile integration »