Uptime and recovering from outages is critical to our success as a company. Our customers expect the highest availability possible, and our policies and actions as a company are oriented around ensuring the highest possible uptime and availability for customer data.
In the case of a serious production or security issue, we follow our internal on-call escalation procedures and our target response time is 15 minutes.
For any information security issues, in addition to escalating to the operations person on call, the issue will be immediately escalated to the CEO or Head of Engineering.
After any serious security or production issue, the DevOps team will conduct a post-mortem of the issue, which the on-call and engineering staff will review. Any results of the review will be shared with the customer in question if the problem related to customer data.
Please see the Information security policy for more details.
Restoring Customer Data
Quip maintains multiple, off-site backups of all user data. We maintain both incremental backups (no more than 5 minutes old) and daily snapshots.
See Backup Policy for more information on backup policies and the availability of customer data.
The entire site can be restored from serving from any backup (incremental or snapshot).
To handle serving outages and disasters, we have implemented a multi-tier system:
- All frontends on the system have multiple replicas, so no single frontend can cause the site to be unavailable.
- All databases are sharded to multiple instances, so no single customer or data issue can cause the site to be unavailable to all customers (e.g. Isolation).
- All significant reads on the system are transactionally-consistent failover reads to our slave databases held in another datacenter, meaning that a short-term serving issue in a single datacenter does not affect access to data.
- All databases are Multi-AZ RDS databases, which automatically failover in the case of regional unavailability or outage.
- In the case of a failure of RDS across multiple availability zones, rendering the RDS service unavailable, we will manually failover Quip's own slave replicas servers, which are independent instances and can withstand a complete outage of our normal RDS system.
- In the case of a catastrophic failure of RDS and our own independent slave instances, we will restore from a snapshot or incremental backup to a new series of database servers.
In the above disaster recovery system, steps 1 through 5 can be accomplished in less than 10 minutes. In the case of a failure of all 5 layers, the final step 6 can be accomplished in approximately 1 hour.
Ongoing Recovery Testing
To ensure that our recovery systems are maintained and work well, our Operations team tests the failover steps listed above quarterly.
To further test the above systems, we maintain a separate internal set of Quip servers that run in the “disaster recovery” mode. These internal testing servers simulate an outage of the primary database. We test these servers regularly (weekly to monthly) to ensure that the site can withstand a primary DB outage while still serving all user data.
Review of Procedures
All disaster recovery systems and procedures will be reviewed by Quip Operations employees quarterly.
All disaster recovery policies will be reviewed and updated by the Head of Engineering and each quarter.