On 2/23/2017, 12:08pm Pacific Time, TokBox began deployment of a new version of our API server codebase by deploying to one server in the SJC datacenter. The new API server code generated and stored session data for session IDs in a manner that was not backward compatible with the servers that had not been updated. This resulted in a service disruption for some partner applications.
Failures occurred when the API call from the first client to connect to a session hit the updated API server and subsequent client connection attempts to join the session hit a non-updated server.
The following is a timeline of what occurred:
At 12:08 PT, TokBox pushed the new API server codebase to one API server instance in the SJC datacenter.
At 1:05 PT, the decision was made to roll back the server instance codebase due to rising error rates. New sessions generated after this time would not have experienced the connection problem.
At 2:10 PT, TokBox pushed a compatibility fix for all session data. All server instances now inter-op with each other correctly, regardless of where the session data was generated.
Failure Windows Experienced:
After TokBox pushed the new API server codebase and before it was rolled back (12:08 PT to 1:05 PT) customers may have experienced rising percentage of client error rates, with an client connection error code of 1026.
After the rollback and before the compatibility fix (1:05 PT to 2:10 PT), new sessions generated were successful. Old sessions that were already generated between 12:08 PT and 1:05 PT, would still have had issues with clients connecting.
Future Mitigation of the Issue:
To mitigate issues caused by rolling deployments to our API servers, TokBox will be adding additional server capacity to its testing environment to better simulate rolling deployments. TokBox will also extend test coverage to more thoroughly test backward compatibility during deployments.
If you have any further questions about this incident, please email firstname.lastname@example.org.