The SDL Cloud delivers technology solutions that support the whole customer journey from the pre-purchase stages (SDL Web and SDL Campaigns) through to the purchase stage (SDL Ecommerce) and then the post-sale and securing repeat business (SDL Knowledge and SDL Campaigns) parts of the customer journey – all delivered as SaaS and managed by SDL on behalf of our customers – specifically supported by the SDL Cloud Services team. The SDL Cloud Services team is nearly 100 strong and spread across Europe, the US and Australia with our Central Operations Centre in Bangalore, India and collectively we’re responsible for the Service Level Agreement and performance of the different software services.
Tis the season
During peak purchasing times like Black Friday and the run up to Christmas, these Software as a Service (SaaS) products become even more business critical to our customers as peaks in consumer purchasing occur across the globe. This peak in consumer activity adds a lot more load to the infrastructure and applications as deal hunters and Christmas shopping are all in full swing and any minor issue can become much more impactful and service affecting.
Black Friday and its ilk are receiving greater attention and focus from retailers. And whilst some are eschewing the PR riddled nightmare scenario of instore fisticuffs over discounted televisions, the online battle for hearts, minds and wallet-share very much continues. In the UK alone, paid search spend increased by 88% year-on-year, according to data from Kenshoo. And Cyber Monday is the day to watch for record busting sales in the US. Finally, owing to the nature of the ‘Flash Sales’ it is ‘Commuter Commerce’ and mobile over desktop which will win out.
The problem is ‘change’
Combine the peak load with a high percentage of staff taking time off around the Thanksgiving and Christmas period and you potentially have a crucial situation on your hands and staff spread too thinly. So, how does a SaaS provider like SDL mitigate this?
For many problems, change is often the solution. Within the IT industry it’s a known fact the biggest cause of incidents is change. Change, despite how well its planned and architected attracts risk & uncertainty [to the service] and can impact our monitoring benchmarks as to what is ‘normal’ and in turn therefore reducing the monitoring and detection capability. Whilst SDL’s change record is exceptional and we run at a 99%+ successful change rate, by reducing the change rate on the production platforms we can mitigate risk and reduce the likelihood of service impacting issues from arising.
The notion of ‘technical debt’
This known ‘change’ fact causes a dilemma in the SDL Development and Cloud Services teams; as predominately technical people we are generally pro progressive change, and in the fast paced industry of IT to simply not change is to build technical debt. Like any debt, technical debt attracts interest and the longer you allow technical debt to not be paid down, the more effort and resources it takes to eventually pay it down. Conversely, there’s a generally accepted view that making changes regularly and safely can attract positive change compound interest. That is to say, the faster you evolve and adapt and the less technical debt you carry, the faster you can evolve and deliver positive change to your customers – in the form of improved services or features that customers find useful.
How SDL balances the need for stability and reduced change (with the desire to continue delivering positive change to our customers?)
First, it’s important to differentiate between different types of changes, within SDL our change control process follows the ITIL guidelines and we have four different types:
- Standard changes – Changes which are routinely executed, well documented as standard procedures and have proven themselves to carry extremely low risk.
- Normal changes – Any change that does not fall into the other categories.
- Emergency changes – Any change which is necessary to recover a service which is degraded or not available or protects from an imminent thread which could result in security breach or impact to service availability.
- UAC (user access change) – Any request for new or additional privileges to be granted to users infrastructure or application accounts.
By splitting changes like this, you can then start to assess their potential to have an impact on a running service.
- UAC and Standard change examples are extremely low risk and not likely to result in any impact.
- Normal changes, often contain the most risky changes as these are not regularly undertaken and are more likely to result in unexpected consequences.
- Emergency changes should not be common and are your mechanism of resolving an issue which is imminent but are deliberately restrictive in terms of what changes qualify under this category.
Further considerations to undertake; review the services you are running, look at metrics on load & consumption historically and identify peak times. Consider the type of service you are running and consider whether you customers are likely to be in a period of high or lower usage based on upcoming events in the calendar.
When SDL Cloud Services made an assessment for H2 2015, we identified the following:
- All of our services need to continue their stability over the Christmas period – even when we have reduced staffing levels
- Some specific services like SDL Campaigns and SDL Ecommerce will be highly utilised by our customers on the run up to and during Black Friday, through to Cyber Monday.
For the eCommerce and Campaigns products, we therefore implemented change freezes from a couple of days before Black Friday, running until after the Thanksgiving weekend where we prevent any Normal or Standard changes from taking place. After 1st December our change pattern continues as normal.
For ALL SDL SaaS services on the run up and throughout Christmas, we implement additional change freezes with Normal changes being halted 11th December until the New Year and standard changes being halted a week before Christmas until very early in the New Year, as these carried a reduced risk level.
By doing so, you balance the need for positive change with the risk of adverse effects, you maintain the ability to execute emergency change to avoid imminent service impacts while maintaining stability for the customer in times of their peak demand. Additionally, not only do you avoid incidents coming in and adding strain to a lower staffed level time of year but by not executing the change you defer it to a later time and thus reduce the impact further on the team for the period of lower staffing levels.
Key Takeaways and considerations for your own IT Operations
- Segment your changes into different types and be aware of which types attract the highest risks.
- Consider your customer usage profile AND staffing levels when planning when changes (or change freezes) should be in play.
- Don’t make your change freezes longer than required, by doing so you build backlog, change peaks and technical debt which you will have to pay down.
- Change Freezes not only reduce incidents and thus relieve pressure on staff working during the festive periods, but not having to execute them during times of lower staff attendance removes pressure as well with the added benefit of allowing them to pivot to intensify monitoring of customer solutions during their time of need.
Image: Elliot Brown