CensusFail 2016 – Planning for Failure and Failure to Plan

To err is human, but to really foul things up requires a computer. —Bill Vaughan

Error

In my book “Chaos to Success” I devote a full chapter to the special kind of failure hell reserved for IT projects – the kind no doubt being endured at this very moment in the hallowed halls of the ABS, Canberra, IBM and Revolution IT among others. (Disclaimer – I was a consultant / analyst for IBM between 1997 and 2012, but never worked on the eCensus).

There are at least three common root causes behind this type of problem – and they’re most likely NOT what you think. Why? Because at their root they problems stem from people and processes, rather than technology. Each root cause is very fixable, but to be addressed they have to be acknowledged and treated with the priority they deserve.

1. High-Speed Idiots

I was told many years ago that computers are “high-speed idiots.” They are inanimate objects. They are very good at following pre-programmed instructions very fast (or occasionally not so fast, much to our annoyance!). Computerised systems and processes have only as much flexibility, resilience, performance and stability as was designed into them in the first place. If these aspects were not a big consideration for those who put the system together in the first place, then, without further change to the system, it will remain inflexible.

You’d have to believe that stability, resilience, performance, and ability to fend off the type of attacks being blamed were a consideration from the outset. There is talk now of attacks originating from the USA – this seems hard to believe given that geo-blocking (blocking Internet traffic from outside of Australia) should have been a baseline security consideration. No-one outside of Australia had any cause to be completing the Census. This should not have been an optional “switch” – it should have been a fundamental design element of the solution. If this came down to one piece of failed hardware – which hearsay would suggest is the case – then this was certainly NOT treated as the high priority concern it should have been.

2. Some Solution Development Approaches Make Flexibility Difficult and Expensive

Software and systems solutions are all about design—figuring out how to solve a problem. Once the problem is solved in the designer’s head, it’s usually a fairly straightforward – not to say simple – matter to convert that solution into the reality itself. This used to apply largely only to software. However with virtualised “cloud” solutions to hardware, this reality also increasingly applies to the platforms on which the software runs. Using a scaleable cloud-provided platform, processing capacity can be expanded dramatically with little or not impact on or visibility to users.

What I’m saying—and this is supported by industry experts—is that changing the way a system works is not, in and of itself, a complex and necessarily disproportionately expensive exercise. So the root source of the problem, the complexity, and the expense in making a change to an industry-grade IT system is not fundamentally a technical one. It is a process-related issue. The source of the problem is that the approaches that many groups use are convoluted, rigid, onerous, and frankly, outdated.

In the case of the Census solution, some investigations to date suggest that, at least on the Web-facing end of the solution, scalability may have been capped to no more than 10 servers. Early statements also suggested that the ABS was happy that they could process up to 500,000 transactions per hour, with a stretch up to 1,000,000. That might have been adequate had the public been encouraged to get online any time over a one week or one month period. The message that came through loud and clear to the general population was, however, that August 9 is the night we pause and fill in the Census, or risk a fine of $180 per day. Yes, I know there was fine print further clarifying these messages, but that was the message that came through the loudest – as I expect was intended.

So, could there have been 2,000,000 to 6,000,000 households attempting to submit their Census around 7.00-7.30 on Tuesday night August 9, 2016? It seems perfectly likely to me.

3. Business People Don’t “Get” IT, and IT People Don’t “Get” Business

Communication—real communication—involves at least two parties having an effective exchange of ideas, such that each really grasps the meaning, intent, and reality of what the other is saying. When is the last time you saw this occur with business and IT?

There are stories emerging in the press of third party network providers offering DDoS protection services to IBM and ABS but this offer being declined. Who made these decisions and were their implications fully understood?

How to Avoid a Repeat of this Problem

These issues can be overcome through addressing each of the three sources of the problem.

1. Implement Flexible-First Systems and Processes

In business, government and technology, as in life in general, one of the few certainties is change. So why wouldn’t you build flexibility into your systems and processes? That is the amazingly simple (yet notoriously difficult to achieve) solution to the problem of inflexible systems. Flexibility must become a core, underlying requirement from the ground up. To support your people and your organisation to provide the best service to clients, you should always provide for exceptions—there should be a process for handling the situation when standard scenarios don’t apply. That way, excellence in client service will be supported, and no one will have to work outside of the systems and processes to get the job done.

500,000 transactions per hour was expected, and 1,000,000 a maximum peak. Did anyone consider the outside possibility that with around 10,000,000 households being channeled towards August 9, what would happen, after months of publicity, if they, outrageously, all decided to do what was asked and attempt to submit their Census that evening? If it was considered and raised, as I would expect those on the technology side would have done, then who made the decision that this did not need to be addressed?

2. Use Modern, Iterative, High-Transparency Implementaion Approaches

I find that often businesspeople treat IT as a black box. That is, they put money and requirements in one end and hope to get the results they want out of the other end, with close to zero visibility of the process in between. I have seen IT organisations that prefer to keep things that way.

Modern, efficient, and effective IT organisations, however, embrace transparency in their processes with the business and, in doing so, build trust and client satisfaction with their businesspeople. This is not about the business micromanaging IT; rather, it is about trust, transparency, and flexibility. Key to these modern approaches is that both the business and IT embrace change as a catalyst for improvement as opposed to an inconvenience to be resisted. Modern IT processes minimise archaic “change-resistant” processes to better support the business.

The lack of transparency in information flowing to the public after the massive #censusfail, would suggest there is little real business or political understanding of the solution, the events that unfolded, and their root causes. Will this change following the promised investigation? Time will tell.

Will this be the last time such an event occurs – not likely. Is it acceptable – definitely not.

3. Prioritise Deep Business-to-IT Communications and Understanding to Minimise Risk

Associated with effective and efficient change processes is the imperative to ensure that the business and IT understand each another. Using modern, transparent IT processes certainly aids that communication process. There is, however, a need to ensure that the business-IT divide is bridged by an effective translator. I’ve spent much of my last twenty years performing precisely that role, and I can tell you that both business-focused people and IT-focused people need someone who can do (and enjoy) the work of translating between those two worlds.

Blind trust is no substitute for deep mutual understanding. IT solutions need to be understood in business terms, and defined in that way, from the ground up.

Finally – Plan for worst case scenarios.

What is the worst case scenario in terms of number of transactions on Census night? 10,000,000 submissions in one hour? What would it cost to cope with that? How should the system respond. If that is unfeasible to achieve for the fixed budget provided, how else can the problem be solved? Assign different households a time-slot and date for their Census submission to attempt to spread load evenly? A little communication and social engineering like this can sometimes be part of an overall solution which is more than technology-only.

At the end of the day, success of solutions like the Census are about managing risk to ensure success. If you want to commit yourself all-or-nothing to a single evening on a single day, for a first-of-a-kind (in terms of “digital-first”) eCensus, you are channeling all of your risk down to that one moment.

History says that is a very dangerous approach

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *