-
Notifications
You must be signed in to change notification settings - Fork 46
Android missing data investigation and fixes
We have determined that most of the missing Android data is from periods when the app's BackgroundProcess stopped running. During testing, we frequently observed the BackgroundProcess stop running and fail to restart, and we saw that coincide with periods of missing data.
Most of the missing data files come from periods when there are no data files of any type: no accelerometer data, no app log, no regular WiFi logs, etc.; this pattern indicates that for those periods, the entire app wasn't running. Furthermore, at the end of a period of missing data, the data usually resume for many datastreams all at once (in the same second), and that usually coincides with the statement started with flag 0
in the App Log file, which indicates that the BackgroundService just received a signal to start.
-
The app crashes. We had previously built a Crash Handler with code to restart the app if it crashed, but there was a bug in the way we were telling it to restart, so it wasn't working. We fixed the Crash Handler's restart code in commit 1a6c0e8, app version 2.3.0.
-
The user closes the app using the App Switcher. When the user closes the Beiwe app from the phone's App Switcher, the BackgroundService got killed. In
BackgroundService.java
, we were already callingrestartService()
insideonTaskRemoved()
, but that wasn't working until we fixed the restart code in the Crash Handler (described above). -
The operating system kills the app. To solve this:
-
We started the app with the
START_STICKY
flag, which tells the operating system to try to restart the app if it has killed it (documentation link). We did this in commit ebe91f5, app version 2.3.3. -
We considered making the app a Foreground Service, because Foreground Services don't get killed when their apps are killed from the task switcher. We decided against doing this, because Foreground Services need to display an extra notification in the Notification Bar while they're running. (Foreground Services are mostly used for apps that play audio or make phone calls; the purpose of the notification is to allow you to control the audio player or phone app even when you've switched away from it; see this documentation link)
-
-
We added more Intent triggers to BootListener to restart the app, including
SMS_RECEIVED
andNEW_OUTGOING_CALL
, meaning that any time the phone receives an SMS or initiates an outgoing phone call, it wakes up the Beiwe BackgroundService if it's not already running. We did this in commit b559862, app version 2.4.0. -
We added a repeating timer that goes off every two minutes and tells the BackgroundService to restart if it has stopped. We did this in commit ebe91f5, app version 2.3.3. We believe there’s no problem with sending a start signal to the BackgroundService as frequently as we please. If the BackgroundService is already running, calling
startService()
will not stop and restart the service. According to the documentation for thestartService
command, "If this service is not already running, it will be instantiated and started (creating a process for it if needed); if it is running then it remains running."The restart alarm persists, even when the app crashes or isn’t running. While plugging the app into a debugger, we used the command
adb shell dumpsys alarm | grep beiwe
to see the alarms registered for the Beiwe app. Alarms remained registered through app crashes, App Not Responding (ANR) errors, and events where someone killed the app from the task switcher and we disabled the app’s automatic restart code. We weren’t able to test the event where the operating system kills the app, because when that happens is unpredictable, and it often takes several hours.
-
Survey Answers Files: We found and fixed one issue on an old Android 4 phone where a Survey Answers file wasn’t being written properly for one survey on a study. The app was crashing because it was assuming that a variable was a String, but it wasn’t in fact a String; we changed it to force it to be a String. This issue was probably not widespread. This issue was apparent to the user, because when the user submitted other surveys, their notifications went away, but when the user submitted this survey, the notification persisted. This is beiwe-android issue #31, and it should be fixed in commit 336f51a, app version 2.3.2.
-
Audio Files: We’ve found stack traces in which the app crashed while trying to write an audio recording file. These are beiwe-android issue #22 (which we fixed) and issue #33 (which we didn't fix). We don't know if either of these issues caused audio recording files to fail to be written.
-
Sometimes the app fails to write data because the phone has no free storage space. We’ve received several crash reports with the error message
write failed: ENOSPC (no space left on device)
(beiwe-android issue #32). There's no way for us to comprehensively fix this issue, because handling the out-of-storage scenario is complicated and unpredictable.
During our testing, we did not observe any files get stuck on the phone and fail to upload to the server. However, we did find and fix several code problems that could theoretically have caused this to happen:
-
When the phone joins a WiFi network, we made it immediately start uploading files: previously, the Android app only uploaded data files when a repeating timer told it to do so. This could theoretically result in files getting stuck on the phone if the repeating timer was set to a low frequency and the phone wasn’t connected to WiFi very often. We supplemented the repeating timer by adding a listener so that when the app connects to a WiFi network, it should immediately try to upload data files. This should make file upload more reliable.
-
Synchronized the file upload function: while examining the code, we realized that the function that uploads data files can be called multiple times simultaneously. This would be likelier to happen if the file upload frequency was set to a time shorter than one hour. We don’t know what could have gone wrong if the file upload function tried to upload the same file twice, although it probably wouldn’t have resulted in any data being overwritten. Nevertheless, we thought it safest to add Java’s
synchronize
keyword to the file upload function, which will prevent the function from being called more than once at a time. Now, if the function is called twice, the first run must complete before the second run can start. -
Redid file upload timeout logic: the file upload function has a built-in timeout to keep it from running for too long. Previously, this timeout was set to be the file upload frequency (which is customizable for each study) minus 2.5 seconds, the idea being to prevent one call of the file upload function from overlapping with the next call. However, if the file upload frequency was set to something very low, like 10 seconds, that would mean that each file upload function would time out after 7.5 seconds. If a large file took more than 7.5 seconds to upload, it would never actually be uploaded to the server, and it would remain stuck on the phone. After adding the Java
synchronize
keyword to the file upload function, we no longer needed to prevent the file upload function from being called while it was still running, since even if it’s called, it shouldn’t actually start until the previous call has finished. Therefore, we hardcoded the file upload timeout to one hour, which should be long enough for the largest data files on the slowest WiFi connections. We also added functionality to log errors in both Sentry and the Android App Log file if the app ever does hit the upload timeout.
All three of these fixes are in commit 971312b, released with version 2.3.3 of the Android app.
We discovered that the Android app's survey notifications weren’t reliably appearing because the BackgroundService wasn’t running. In our testing, we found that if a survey notification alarm fires when the BackgroundService isn't running, the app doesn't get woken up to handle it, and therefore the app doesn't make a notification appear. Our testing showed that missed survey notifications coincided with periods when the BackgroundService was not running. The next time the app does wake up, it looks at survey notification schedules and it shows any notifications that it missed showing (we observed the app reliably do this during testing).
(FYI, when the Beiwe Android app stops running, any survey notifications that were already showing in the Notification Bar at the top of the screen will continue to show. The Beiwe app tells the Android operating system to display a notification, but once that notification exists, the operating system maintains the notification until the phone shuts down, or until the Beiwe app tells the operating system to remove the notification.
Our fixes to make the BackgroundService run more consistently (described above) should make Android survey notifications appear reliably. In our tests after making the BackgroundService run more consistently, we have always seen survey notifications appear very close to their exact scheduled times, although they have been up to 4 minutes late.
We considered making survey notification alarms themselves wake up the BackgroundService if it's not running, but we decided against it. In order to have an alarm wake up the BackgroundService, we would have to register it with a hard-coded ID in the Android Manifest file. Furthermore, every alarm needs a unique ID, which means it would be difficult to create multiple survey alarms. We thought of two implementation options: either have one alarm schedule all surveys, or create a fixed number of unique IDs that could be allocated to survey alarms. Both solutions would be complicated and brittle, so we decided to leave the architecture as-is, and not make survey notification alarms wake up the BackgroundService. Instead, we'll rely on other features to keep the BackgroundService running and to wake it up more frequently if it dies.
We also went through several other theories about why survey notifications weren't reliably coming through at the scheduled times, including:
-
Errors in the survey scheduling code. But we inspected and tested the survey scheduling logic on the Android app, and we believe it works correctly.
-
Issues caused by survey branching logic. We've looked, but we haven't found any way that survey branching logic could affect the survey schedules. Survey branching logic only seems to affect the app when the survey is open, which at least in the Android app is after and separate from the notification scheduling logic.
-
Surveys getting rescheduled too frequently. We briefly considered that one of the problems might be the phone downloading new surveys from the server too frequently, and overwriting previously-scheduled survey notification alarms. But after inspecting the code, we verified that when the app downloads surveys from the server, it checks whether the survey schedules have changed, and it only updates the scheduled survey notifications if the schedules are different than the last time the app downloaded surveys.
We have not figured out any cause of the problems with iOS survey notifications not appearing on time.
-
0.003% of files are corrupt. Over a 2-month period when 2 million data files were uploaded, we have records of 60 files being impossible to encrypt or crashing the batching/indexing code. These 60 files are from both Android and iOS phones. Because the number of files is so small, we don't think this is a major cause of missing data.
-
Some of these corrupt files were created on phones that ran out of storage space. For one user, we got the crash report
write failed: ENOSPC (No space left on device)
several times over a period of a few hours, and several files created during that period were impossible to encrypt. We investigated the Android file-writing code, and found out that when the phone ran out of storage space, it was possible to create a file that didn't include a decryption key. We fixed the Android code so that if it can't write the decryption key, it won't create a file without a key, it will instead keep trying to create the file until it successfully writes the key. We made this change in commit e092597, app version 2.3.3. -
Some of these corrupt files are iOS survey timings files. There may be some problem with how those files get written, but we're not sure what it is.
-
-
Sometimes Survey Timings data is in the wrong folder. We found the root cause of this bug, but decided it would be too much work to fix right now. This affects only Survey Timings data and no other data stream, because Survey Timings is the only data that is both grouped into folders by survey ID and batched into hour-long files. This bug doesn't cause data loss, it just puts Survey Timings data into the wrong files. We made some abortive attempts at fixing it and left them on the surveytimeings_interleaveing_fixes branch if we decide in the future to resume fixing this.
-
Sometimes there are more Survey Timings files than Survey Answers files. We've received reports of this, although we haven't observed it directly. There are two reasons why this could happen:
-
If someone begins taking a survey but doesn't submit it, that should create Survey Timings data but no Survey Answers data.
-
The Android app only generates Survey Timings data when users take questionnaire surveys, but I believe that the iOS app also generates Survey Timings data when a user takes an audio survey.
-
We fixed several other bugs that may have either been crashing the BackgroundService or preventing the app from writing one type of data or another:
-
A permissions issue that sometimes crashed the app when trying to record the WiFi log: https://github.com/onnela-lab/beiwe-android/issues/34
-
Two separate issues that sometimes crashed the app during registration: https://github.com/onnela-lab/beiwe-android/issues/29 and https://github.com/onnela-lab/beiwe-android/issues/30