Join 15,000+ Subscribers
Get tactical growth tips once a week

Monday's incident report

On Friday morning at 7:40 am, we experienced a global outage across all services. Clearbit was down for approximately 29 minutes. For that, we’re extremely sorry. We understand that Clearbit plays a large role in many sales, marketing, and product processes and we deeply apologize for any inconvenience caused.

Issue Summary

From 7:41 AM to 8:10 AM PT, requests to Clearbit APIs resulted in unauthorized error response messages. The root cause of this outage was the incorrect setup of one of our developer machines that gave direct access to the production database while running automated tests. This briefly removed access to Clearbit’s accounts as automated tests rely on an empty database.

Timeline (all times Pacific Time)

  • 7:40 AM: Test run finished
  • 7:41 AM: Pagers alerted team
  • 7:45 AM: Complete outage after our caching system was invalidated
  • 7:48 AM: Accounts database backup started from a snapshot
  • 7:55 AM: Root cause discovered
  • 8:08 AM: Account backup restored
  • 8:10 PM: 100% of traffic back online.

Root Cause

At 7:40 AM PT, one of our developers started a test run with a setup that inadvertently had the development environment configured to use our production cluster. While preparing the test run, our code completely drops all rows from our authentication database.

Resolution and recovery

At 7:41 AM PT, the monitoring systems alerted our engineers who investigated and quickly escalated the issue.

By 7:48 AM, the team started a backup to recover accounts but the root cause was still not clear.

At 7:55 PM, the root cause was found by the developer that caused the issue after noticing how the timeline of events matched the test run.

These problems were addressed and we successfully recovered from a backup at 8:10 PM.

Corrective and Preventative Measures

The following are actions we are taking to address the underlying causes of the issue and to help prevent them from happening again:

  1. Improve checks to the testing environment to prevent usage of production settings.
  2. Change database recovery process to be more time efficient.
  3. Increase backups to an hourly cadence.

We appreciate your patience and again apologize for any inconvenience. We thank you for your business and continued support.

Share this post
Join 15,000+ Subscribers
Get tactical growth tips once a week