Last week, we updated the Backupify Service Level Agreement (SLA) to cover 100 percent of our customers’ restores and exports in addition to our 100 percent backup SLA. The update to the Backupify SLA reflects our proven track record and confidence that our customers’ data is always available as well as secure. (If you’re interested in learning more about cloud SLAs and what you should look for in a reputable one, check our post from June, 5 Key Elements of a Reputable Cloud SLA.)

In this, my inaugural blog post, I’m happy to write on a topic that’s near and dear to my heart: building reliable web services. I joined Backupify two months ago impressed with their technology choices— it’s built from the ground up to be secure, easily manageable, and resilient. Since being here, I’ve come to see that our reliability and security track record is a direct result of our team’s dedication, smarts, and commitment to efficient, effective development and validation processes.

A reliable SaaS service begins with monitoring.  I’m in heaven at Backupify where I view tons of end-to-end operational data in real-time and can quickly drill down into the details. While we use a variety of custom and thirty party tools, the cornerstones for us is using Splunk for their incredible log analysis (we gather over 110 gigabytes of logs a day) and StackDriver with their holistic service displays, proactive performance alerts, and system drill-down. We monitor the third party services we back up as well as our own. Our monitoring wall displays show this data to everyone in engineering, support, and devops which lets us easily spot issues which we can then quickly address.

We strive for reliability in an uncertain world with a deep commitment to scalability and resiliency. By resiliency, I mean our ability to stay running even in the midst of failures outside of our control (e.g., server crashes, network hiccups, partner APIs down, etc.) This results in technology choices like the hyper-scalable Cassandra database and geographically distributing our servers and storage. Just as important, but harder to implement, is our ability to work around problems at the source such as temporary SaaS outages, partial outages, request throttling, etc. We do not stop until we’re able to successfully back up and restore your data. We believe our ability to provide these services reliably and securely is second to none.

I’m proud to have joined a team that’s not only fully committed to security and reliability but one that has the experience, processes and technical roadmap to deliver on that promise. And I’m pleased that now our SLA explicitly demonstrates that commitment.