On Thursday, November 26th, one of Woopra’s non-critical sub-systems responsible for making user data accessible to the real time triggers and actions engine, suffered a failure, causing outages in the Woopra website and web app. All tracking data that made it to the Woopra servers was stored and queued, but many Woopra users found that data was not displayed properly in the Woopra web app, and that analytics reports were failing to load or were incorrect. This post-mortem describes at a high level, the failure, the solution, and the path from here back to peak operation.
The failed subsystem is a cluster of data nodes that holds a secondary copy of user profile data in a data structure that allows for fast real-time queries on individual users. These queries determine whether a user should be added to a label, or if a trigger should be fired, etc.
While the profiles system is integral to Woopra users’ experience of their data, and to real time triggers in Woopra, the system is not mission-critical because it is not trusted to be the final source of truth in the data. What this means is that even under a total failure of this system–which did not occur–no data is lost and the system can be replaced and profiles can be rebuilt using the tracking data from the tracking database–the much more stable, and reliable critical system that is the source of truth.
The profiles system crashed due to a confluence of many events. First, a bug in the software of the database exposed a memory leak under high load, as well as during data rebalancing. Second, an expected spike in traffic over Thanksgiving week put the system under higher load. Finally, a failure in one node caused both the rebalancing process and the high system load to combine, setting off a chain reaction of further node failure as loads became higher per remaining node.
Woopra’s engineering team has been hard at work this last week upgrading the profiles database, and designing a more robust architecture for the whole profiles sub-system. The solution will contain two parts.
First, more simply, the database software has been upgraded to a more stable and reliable version, and the cluster itself has been scaled up to be better able to handle spikes in traffic. Additionally, node operating system settings have been fine tuned to allow for more graceful slowdowns.
The second part of the solution introduces a new module to the profiles sub-system. This new module is a caching layer that sits in front of the profiles database. Caching the most accessed data prevents the same query from running twice on the database system when the results would be the same. This has the effect of smoothing out spikes in traffic so that the database system itself does not feel these spikes as acutely. Also, latency is greatly reduced because the caching system can respond much faster than the database itself.
The tracking system, while connected to the profiles system, is not dependent on it. For a small period between the beginning of the chain reaction in the profiles database until the profiles system crashed entirely and the backup system came online, the tracking servers themselves may have been overloaded for brief instants with the number of hanging requests to the struggling profiles system mixed with the high load of incoming traffic.
Despite this, another failure recovery and mitigation system, in place to prevent any loss of data in the event of a primary system failure, began queueing failed or questionably successful tracking requests. This system has given no evidence of having had any problem at all. When the new profiles subsystem came online, along with performing the data migration and recovery, the woopra system began digesting the queued failed requests from this secondary failure recovery queue. So this system performed well, mitigating most and potentially all, loss of data.
While it is possible that all of this caused network level, and operating system level queues to overflow, permanently dropping some tracking requests, we have not yet been able to find any examples or evidence of this occurring.
It is also possible that the system slowdown led to an aggravation of an ever-present problem with client-side web tracking. If a user leaves a page before the tcp connection for an http request is established with our servers, the browser will kill the request, and it will not reach our servers. While our servers came under heavy load, it is likely that this slowdown would exacerbate this problem by causing requests to take a little longer to establish a connection with our servers, and thus opening the window in which the visitor could navigate away from the page causing the browser to kill the request. This would have occurred between the beginning of the slowdown and the time that the backup system began queueing requests.
The only aspect of Woopra that is known to have been lost during the outage are the real time AppConnect triggers and label actions that would have occurred during a brief time after the system came back online in safe mode, and when it became stable enough to exit safe mode. This occurred because safe mode stopped the emergency enqueuing of incoming tracking requests and started accepting them again in a way that ensures we get the tracking data but allows the profiles system to recover–which is what will cause certain actions and triggers during these times that rely on the profiles system to be lost. The system was in safe mode for approximately 24 hours.
The process of getting back up to full operation entails assembling the new profiles database cluster, migrating the profiles data to the new database, and implementing the caching layer system.
As of this writing, the new cluster is up and running, and we are about 75% of the way through migrating and rebuilding the profiles data. This migration process should last a little over a week in total, so we expect it to be fully operational by Thursday or Friday this week (December 4th or 5th.)
While the migration is in progress, Woopra users can expect an occasional delay in triggers and other actions firing in the system, as well as some missing data or inconsistencies in the data caused by duplicates coming from other parts of our system as well as retry attempts made by your system. Rest assured that these inconsistencies will be worked out automatically by the Woopra system as it gets closer to having all data, and as it runs clean ups after the migration has completed.
Once this process is complete and verified, the Woopra engineering team will begin testing and deploying the new caching layer. We do not anticipate this process having any effect on Woopra users.
We apologize for the inconvenience this has caused. If you have any questions or concerns, our team is here to help at email@example.com.
The post Post-mortem of Profiles sub-system failure of November 2015 appeared first on Woopra.