Web Agencies exist in a unique position in the web development market. We're not talking enterprise-level application development - instead, agencies are focused on high-quality marketing websites with basic to intermediate-level functionality.

A lot of tips, tricks, and best practices that you'll find online are focused on the enterprise market rather than the agency market. It's good info, but it's oftentimes not practical, applicable, or cost effective for introducing reliability and disaster recovery into an agency.

First, The Risk

In agency world, you face a somewhat unique risk profile - customers. Let's take a realistic scenario as an example. Your customer paid you to design and build their website. You spent 2 weeks meticulously designing the layout, the content pages, the contact page, the article layouts, and crafting a top tier SEO strategy. The customer loved their website - it blew the competition out of the water. You then spent 2 hours training them on how to use the backend to edit content. Exactly 1 month later, they frantically call and email you to tell you their website is broken. If you run an agency, you already know who broke the website and it almost certainly wasn't you. Even in the simplest CMS, customers can easily get lost and make mistakes, creating downtime.

This is a somewhat different paradigm than the one faced in the enterprise development world where infrastructure issues and developer mistakes are more likely to be at fault for outages. You, of course, also face infrastructure and developer-driven risks in your agency. The best way to handle downtime is to, of course, attempt to prevent it altogether. That's not realistic, though, and outages will almost certainly occur.

Handling Backups

Having a set of solid backups is critical to downtime recovery. Unlike enterprise-level application development, the heart of an agency's product exists in the website itself and the database that powers it. A good amount of your backup strategy is going to be dependent upon your hosting strategy. It's common in the agency world to use managed control panels like WHM/cPanel and Plesk. These systems are a life-saver for an agency, providing the ability to easy onboard, offboard, and manage client websites.

A huge benefit of these management platforms is that they offer robust backup handling. Ensure that you have backups enabled AND that you've setup an offsite destination like S3 or Azure Blob Storage. This is critical in the event that you have a server-wide outage.

You should assume that your backups don't work until you've actually tested them. This is a critical point. Even if you're confident that your backups are configured correctly, you have absolutely 0 reason to believe that until you've run a full end-to-end test on them. That means that you should build a website, point a domain name at it, wait for the backup to run, and then destroy that website and attempt to recover it from the backups.

Seriously, don't skip that step. This is going to be your quickest pathway to a full recovery from an outage.

Fallbacks & Error Pages

In a perfect world, each website you build would have a custom, well-designed placeholder page to show in the event of, for example, widespread 500s. It would be behind a load balancer that is evaluating target health and redirecting to that static error page that exists in object storage (S3, etc). Let's be real, though, you probably don't have time for that.

If you do take the time to set that up at the server level, avoid the urge to brand that error page with your company's brand. The customers have 0 interest in advertising your company when their own website is down. You also don't want potential clients' first engagement with your company to be a page detailing how the website you built and host is currently hard down. You are only going to worsen the situation. Instead, go with unbranded error pages when server-level outages occur. Ask your design team to design a nice, professional, and straightforward error screen notifying visitors that "we are currently experiencing an issue and are working to solve it".

Messaging

How you message the outage is going to be critical, particularly if you have some, ahem, difficult customers.

You have a decision to make at this point - do you notify the customers that their website is down or do you avoid "poking the bear"? This is going to be up to you and dependent upon how much attention you want to bring to the issue. A general rule of thumb, though, is that you should avoid notification if the website has been and will only be down for 15 minutes or less. Any more than that and the likelihood of the customers becoming aware goes up significantly.

Should you be facing an extended outage, you should use a few strategies in your messaging of the outage. Below is an example downtime notification email template for web agencies to use. Read through it and then we'll break it down.

Dear [Customer Name],

We wanted to let you know that your website, [Website Name], is currently experiencing downtime as of [Time and Date of Outage]. Our monitoring systems have detected this issue and the team jumped on it immediately.

What We Are Doing:

  • Our technical team is already investigating the cause of this downtime.

  • We are taking all necessary steps to restore the website as quickly and safely as possible. It is our top priority.

What You Can Do:

  • Please refrain from making any changes to your website's backend during this period.

  • Check your email for updates as we will continue to provide you with the latest information regarding this issue.

We absolutely understand the criticality of having your website running and are committed to resolving this issue as quickly as we can. Our entire team is working to bring it back online.

Estimated Time to Resolution:

  • While we are currently unable to provide an exact timeline, we will keep you updated on the progress of our efforts.

We apologize for any inconvenience this may have caused and appreciate your patience and understanding. If you have any immediate concerns or need further assistance, please do not hesitate to call or email me.

[Your Name]
[Your Position]
[Your Contact Information]
[Web Agency Name]

A couple of key things are happening in this email template.

We want to convey that we absolutely understand the criticality of the issue. Avoid robotic, overly-corporate language. Don't say, for example, "we are writing to inform you" and instead say "I wanted to let you know." Robotic language may only worsen the issue; it shows no empathy and no understanding of the very real impact that the outage could have on the client's business.

We want to assure the client that this is our top priority. Don't try to introduce overly-technical language into this. Don't attempt to explain to clients the actual technical root-cause of the issue at this point. Doing so will only open you up to further questions in the middle of the outage.

We want to make sure they don't try to make changes. Some clients will attempt to login to their backend. Maybe they want to attempt to fix it, maybe they're just curious, or maybe they were already attempting to make a content update. In the event that you need to restore from backups, you do not want a bad situation to get worse by also introducing data loss.

We do not want to commit to a resolution time unless explicitly asked. Even the best resolution time estimate has a very high likelihood of being incorrect in either direction. If you quote too high, you're setting the expectation that the site is going to be down a long time when it may, in fact, not be. If you undershoot on the outage duration estimate then you will, with 100% certainty, end up with angry clients. If a customer specifically asks you for an estimated time to recovery, tell them that "the team believes that it will be back up within the next X + 15-30 minutes." Point being - whatever your estimate is, pad it.

We want to make sure they know they can contact us. Most clients will understand that outages occur at times. They should feel like they can respond to you and ask questions. You may not want further communication from clients as you attempt to mitigate the outage, but you should prioritize responding even if it takes you away from the downtime briefly. Ignoring their responses is a bad, bad, bad idea. Aside from restoration, your other job is going to be empathy and understanding.

The Entire Internet Is Down

Well, that's not good. When AWS, Cloudflare, or other major hosting provider outages occur, the downtime is obviously completely out of your hands. This can be challenging to message correctly to your customers. They don't really care whose fault it is - they pay you to host the website, not AWS. The distinction between those two things is clear to you, the agency, but is not necessarily clear to your client. Don't assume they know how modern, cloud-based hosting works or what AWS really even is.

In these instances, you should always let the customer know that the outage is in the underlying host, AWS/Digital Ocean/etc which you use to host their website. This doesn't make the client magically say "oh, ok, no big deal then" but it does reinforce that you're not necessarily in direct control.

You should also not attempt to abdicate yourself of all blame. Again, the customer pays you for hosting - not your underlying infrastructure provider. Using the email template above as an example, you can delicately message this as:

Our underlying hosting provider, {INSERT NAME}, is experiencing widespread technical issues. They have assured us that they are on top of the issue and that it will be resolved quickly. We'll continue to monitor their progress and report back to you as the situation changes.

Detecting Downtime

As an uptime monitoring service, we can't help you mitigate and recover, but we can help you detect! We're obviously biased, but we believe that our website monitoring is the best on the market for web agencies. Sign up for free and see how we perform for your agency.