Learning from Safety – Concepts (1)
If you follow me on Twitter, you’ll have seen that I’ve recently been doing a lot of reading on Safety and how the practices have been evolving in the past 30 years. There are 3 authors in particular that I’m finding fascinating, Erik Hollnagel, Sidney Dekker and Nancy Leveson.
This post will mostly rely on work by Hollnagel, as will this series of blog posts. We’ll be exploring what Hollnagel refers to as Safety-I and Safety-II and how we can apply many of these ideas to our Cyber Security practices. I have no doubts that you’ll find many similarities and that you’ll find, as I did, that in overall approaches and literature the Cyber security discipline is circa 30 years behind from many of the developments in Safety, and that there’s a LOT we can learn from it.
In this blog post, we’ll focus on some concepts that are useful for the discussion I’ll attempt to start in the next 3-4 blog posts about this subject.
I’ll start off by quoting Hollnagel:
What we mean in general by ‘being safe’ is that the outcome of whatever is being done will be as expected. In other words, that things will go right, that the actions or activities we undertake will meet with success. But strangely enough, this is now how we assess or measure safety– Hollnagel
This is a true paradigm shift, both in Safety and in Cyber Security, as we tend to look at how ‘secure’ we are (and I’ll leave the issue of asking 100 people what secure means and we’ll certainly get at least 90 different versions which is a problem in itself) by the absence of incidents, breaches or clear indications of vulnerabilities or known weaknesses. Furthermore, this is compounded by the fact too many security professionals have no idea on “what it takes to get the job done” from the point of view of our Engineering teams or even business teams for that matter, and this leads to a failure to appreciate what goes right, namely what happens when it goes right. And as such, our security programme management approaches are also a reflection of this, in that we focus on preventing or avoiding things from going wrong, not necessarily to support the organisation in making operations go right. This may look like a subtle difference, but I hope to make a clear case that it isn’t and it’s a huge insight.
On feeling secure
One of the main reasons for the dominant interpretation of safety as the absence of harm is that humans, individually and collectively, have a practical need to be free from harm as well as a psychological need to feel free from harm
But we also need to feel free from harm because a constant preoccupation or concern with what might go wrong is psychologically harmful – in addition to the fact that it prevents us from focusing on the activities at hand
This has great implications to Information Security as well, and something we often neglect. When our company’s engineers are writing app and infrastructure code, a constant preoccupation with security would prevent them from focusing on the task at hand. Yes, they shouldn’t neglect security and needs to be integral to the work they do, butt having a constant preoccupation with it detracts them from focusing on their craft and we can’t realistically expect everyone to be a security expert due to cognitive load considerations. Their jobs are already hard as they are, without adding 100s of lines of security requirements on top.
This hints that we should be focusing on providing easy methodologies to identify and treat vulnerabilities (lightweight threat modelling) coupled with easy to consume capabilities (such as libraries or integrated tools in pipelines) so they can keep focusing on what their job actually is.
But another problem posed by “understanding what goes right” is the fair question of “how can we focus on what does not happen?” or how and why would we focus on what we currently pay no attention to, when we already have a day job trying to prevent bad stuff ? or phrased differently “how do we count or event notice a non-event?”
Hollnagel calls this “the measurement problem”, and it starts by changing our definition of what “secure” means and start considering our events as activities which succeed and go well, and that we’re secure when they happen. Here are a few quotes which provide more clarity on what this means in practical terms:
it must be possible for different individuals to talk about safety in such a manner that they can confirm that they understand it in the same way.
others can confirm or verify that their experiences of the phenomenon, their understanding of safety, ‘fits’ the description
intersubjective verification means going beyond the lack of disagreement (‘I don’t know what this means’) to an explicit act of communication in order to establish that the term is not just recognised but that it actually means the same to two or more people
This is why I stress a small number of activities (like threat modelling plus consumable baselines) as the minimum viable product of security to most organisations, focusing on easy consumption and vocabulary to support it, and I’m confident that will do more for security awareness than your annual programme no one wants to be part of.
This is likely not going to address all possible threats you may be exposed to, but if that’s what you’re aiming for, I’d argue that’s already outside the realm and focus of managing cyber risk.
The Regulator Paradox
Another important concept is that of the Regulator paradox, which if we’re focusing on ‘badness’ and ‘avoidance of badness’ to what it means to be ‘secure’ (think incident counts) will become an issue, which is a big part of why in many companies people don’t focus on cyber security. Hollnagel explains it as such:
But quantifying safety by measuring what goes wrong will inevitable lead to a paradoxical situation. The paradox is that the safer something (an activity or system) is, the less there will be to measure.
We may, for instance, in a literal or metaphorical sense, be on the right track but also precariously close to the limits. Yet, if there is no indication of how close, it is impossible to improve performance
This lack of indications is then typically used to divert scarce resources away from or as justification to reduction in cyber security activities.
Where many of us miss it, is that these measurements don’t necessarily have to be negative events or even quantitative but can instead be seen through a lens of “efficiency-thoroughness trade off”, in that we could “do all the security things and have all the security tools” which would make it thorough. but not efficient, or focus on “a limited set of indicators or signs [which] can be recognised and tallied” which would make it efficient but not very thorough. This again, IMO, strengthens the argument for a threat modelling plus consumable baseline approach as “the way we do things”.
There are 2 more concepts that I think are important to introduce as primer for thee blog posts that will follow and in the context of discussing Safety-I (which we’ll define in the next section of this blog post).
A big problem of associating cyber security with things that go wrong (incidents and breaches) is the “unintended but unavoidable consequence of associating [cybersecurity] with things that go wrong is a lack of attention to things that go right” (adapted from Hollnagel).
But in order to understand how “things go right”, we need to make a conscious effort to understand how others actually make things work, and the type of variability which is not only present in everyday work, but which is responsible for successes, operational stability but also, when unmanaged, failures. Understanding what goes right means understanding the practices as performed by others and their habituation.
Habituation is a form of adaptive behaviour that can be described scientifically as non-associative learning. Through habituation we learn to disregard things that happen regularly simply because they happen regularly.
Habituation is then ‘a response decrement as a result of repeated stimuli’. Anyone involved in incidents, and both Safety and Resilience Engineering, will tell you that many incidents have been years in the making, and that overtime people stopped paying attention to certain types or aspects of “normal work” because we’ve always done it like this and doesn’t typically lead to catastrophic failure. This also relates to the concept of “normalization of deviance”, which the Space Shuttle Challenger accident is the usual “poster child” on the type of catastrophic effects it can lead to over time.
There will likely be activities which are part of normal work in our organisations which can one day lead to a “perfect storm” of conditions or sequence of events which can lead to serious incidents or breaches, but in order to even begin to understand and manage those, we need to understand what normal work looks like, and that’s unfortunately not something most security professionals do (either by lack of skill, training or even at the most basic level, them/us not seeing it as our jobs).
Let me illustrate this with an example of not understanding “normal work”. I’ve seen companies spend loads of resources in licensing and training to add lots of security checks at the IDE level when engineers are developing code, and doing this upfront committing to significant resources (money and attention) to make such transformation successful. But they failed to realise that many engineers have an habit of hitting “Ctrl+s” or whatever key combo their IDE of choice has, to save their code every time they write a couple of lines of code, and that those security checks then get triggered every 10-30 seconds having a SIGNIFICANT impact to their productivity, and what WILL happen is that engineers over time will disable those security plugins because that security control is not considerate of what normal work looks like. In Complexity terms, and as Dave Snowden (creator of Cynefin framework) usually says “when you over constrain a system that is not naturally constrainable, you’ll get gaming behaviour” and people will find ways around it.
Sharp-end and Blunt-end
If we’re serious about understanding how “things go right”, the sharp end (sometimes referred to work-as-done) and people at the sharp end (engineers) know that in order to get that feature shipped, they have to continuously adjust what they do to the situation (variability is understood and expected). But in our world of security processes and procedures (thee blunt end), the work is seen as quite different in that there’s an assumption that all pieces of work is discretely repeatable and that there’s no need for continuous daily adaptation to what the situations require (this is work as imagined), and we tend to assume what people do or should do. Which also leads to the lazy problem of attributing incidents to human error or lack of following procedures as “the mother of all problems”, because our processes and procedures cannot possible “anticipate all the possible conditions that can exist”. When we find differences between work as imagined and work as done, we have a convenient way to explain any and all possible things that went wrong.
As such, we need different ways of framing our challenges and problems so it can lead us to different solutions and approaches.
the traditional human factors perspective is based on comparing humans to machines […]
It describes the world as seen from the blunt end, many steps removed from the actual workplace – both geographically and in time.
If these assumptions are correct then any variability in human performance is clearly a liability and thte resulting inability to perform in a precise (or machine-like) manner is a disturbance and potential threat
And this won’t get better until we stop blaming our users for security outcomes. Not that there aren’t malicious users out there, they do exist, but because it’s not a useful frame to improve the resilience of our organisations. As Nancy Leveson says “a system which fails on human error is a system in need of re-design”.
Safety-I -Avoiding things that go wrong (Cyber-I?)
To finish this blog post, Hollnagel termed ‘Safety-I’ as defining safety as a condition where the number of adverse outcomes (accidents/incidents/breaches) is as low as possible, which in practical terms means “as low as we can afford it” considering cost, ethic, public opinion, contractual and regulatory obligations etc. And with this framing, we have but 2 states: everything works or something fails. So we have 2 strategies to deal with Safety or Cyber Security with this framing: find and fix what can go wrong or prevent transitions from “normal” to “abnormal” states typically by adding controls or mitigations.
This leads us to Reactive management of Cyber Security or Safety, where we first measure harms (by way of risk assessments), understand their causes, identify solutions, evaluate impacts and then improve the care being provided
The unintended consequence of this framing is that Cyber Security is but a cost, as it will be seen as competing for resources with other areas (production, development, marketing, etc), and “investment in safety [and cyber] are seen as necessary but unproductive costs” and we often find it hard to justify or sustain them. If we haven’t been breached, why invest to prevent breaches ? and this accounting attitude is somewhat justifiable, because in Safety-I we focus on preventing adverse outcomes and as such competing with productivity.
I think this is a good situational assessment of where we currently are as an industry, and that understanding it is the first step to do better.
There are better frames (such as Safety-II and even beyond it, which we’ll discuss in the next few blog posts), and as I mentioned Safety literature is decades ahead of us, so hope you keep following this series to learn more about it