Yes, another example of an incident. In this case a very visible global outage, due to infrastructure misconfiguration. If you are not familiar with the Salesforce outage, read more.
Before you sound the trumpet for more DevOps automation, of which I am a strong supporter, what I would like to highlight is the need for visibility to the configuration changes (or lack thereof), for completed change requests.
Visibility to all configuration changes across your entire infrastructure: networks, servers, apps, containers, and cloud is essential for effective incident response, to quickly narrow down and isolate the root cause. Change monitoring; however, can also be used as a proactive tool to prevent incidents by correlating configuration changes with planned change requests, in order to make it easier to see what configuration was changed from the completed work.
During the post implementation review stage of the change process, reviewers can examine whether the resulting configuration changes have errors and/or are missing. This may give the slight chance needed, to prevent an incident or outage from occurring.
Regarding the Salesforce DNS misconfiguration outage, the change was indeed automated, but there was a bug in the automation that caused the problem. Making it easy to review the configuration changes will enable reviewers to check for errors, especially for manual changes.
If you’re responsible for your company’s infrastructure operations, come visit us at SIFF.IO to find out “What the #%&$ changed?!”