Resilience, Part Eleven (Questions)

“We, as a community, need to identify the core values of our field.”

The talk’s theme was health care, and it had begun with a history of medical error.  Human error, it turns out, is a relatively recent phenomena in medicine.  For years, the presenter advised, the term used had been ‘risk’, and risk had been framed as the price we pay for medical progess.  In addition, there was a shared understanding that the dangers of new methods must be accepted, that benefits and risks are inseparable, with loss rates remarkably low.  These views had persisted even after the 1974 publication of Ivan Illich’s Limits of Medicine*, when the author’s comparison of medical morbidity rates to those of traffic and industrial accidents were considered an attack on the field. But slowly, beginning in the late 1980s, this belief began to shift.

The shift, the ER physician and researcher explained, had begun for several reasons.  The first, of course, was the desire of the profession to improve.  Another was the annexation of processes from other domains: TQM from business, CRM from aviation, root cause analysis methods from engineering and design.  But, the presenter theorized, one of the main drivers of the shift from ‘risk’ to ‘error’ was the increasing industrialization of medicine, and the resulting shift from physician-led hospitals, seen as a greater good, to MBA-led institutions and their focus on shareholder value. “Technocratic, ‘scientific-bureaucratic’ managers” (as he called them) strived to use standardization to improve scheduling and economic efficiency, and used ‘error’ to enhance their authority by undermining clinical expertise.

But, he continued, hospital operations are surprisingly complex.  These operations, for the most part, succeed; and succeed in no small part due to the everyday adaptations of doctors, surgeons, nurses and other expert caregivers.  These experts, he suggested, know how to improve the system, and health professionals should find a way to maintain control of healthcare and enable the evolution of improved practices.  Key to this, he stated, is that health professionals, as a community, need to identify the core values of their field.

I perked up in my seat. I had been a safety professional for ten years, and this was the first time, other than the occasional professional group’s mission statement, that I had considered what the core values of the safety field might be.  To do no harm?  To maximize good?  To advance the field?  Do they include day-to-day actions, such as individual courage, or treating others with dignity?  And how does this work in fields such as mining, military operations, or autonomous entities (including artificial intelligence) where proper use of the product fundamentally and permanently alters the environment?  My mind swirled in delight.

After a quick break and a talk on defining the resilience engineering (RE) problem space and developing a framework for RE tools, we moved on to an open forum to share our views on the Symposium.  The week’s presentations had generated many trains of thought:  How do we engineer systems to be resilient?  Do we need control of our systems for them to be resilient?  How does resilience work in systems that are already there?  How do changes to existing systems impact resilience?  How can we take advantage of existing resilience?  We batted these questions around, and more, in a thoroughly engaging conversation.  Then, my favorite idea, by a lion of the field, “If we, as system designers, are relying on emergency procedures as the last control between system instability and failure, we need to give more respect to our operators.”

For those not familiar with the design and build processes, there is a structure used when safety analyses identify issues with a design.  The structure, known as the system safety order of precedence, states that when a hazard is identified during the design process, the preferred option is to revise the design to eliminate the risk.  If this can’t be done, the risk should be reduced by selecting the least risky option, or by designing in redundancies or fail-safe features that reduce the probability the risk will occur.  When this does not eliminate the risk, barriers or controls designed to reduce the spread or escalation of the risk are included, and periodic checks are included in operating instructions to ensure these features are working effectively.  If design or safety devices cannot be counted on (or are considered impractical) to effectively reduce a risk, warning signals, placards, training, and routine and emergency procedures are implemented to counter the shortcomings of a design.  Thus, the adaptability of front line operators becomes the key design feature counted on to prevent disaster.  Usually this strategy is successful, but in some cases it is not.  And when a front-line operator (or a team of front-line operators) cannot adapt at the pace required to fill gaps in engineering or design imagination, a program manager’s budget, or an operational schedule, they, and not the underlying design or operating processes, are blamed for the accident.  So I sat, grateful that someone else had said out loud the thought I had been pondering for the last few years.

Unfortunately, there were no quick and easy answers.  We spoke of adding buffers to increase flexibility, the efficiency versus thoroughness trade off, and the notion of optimizing a system for recovery.  We drifted back to problem domains, and considered the perspective of solution domains.  Someone brought up that to solve these questions, or have the opportunity to solve these questions, we would need to increase the credibility of resilience engineering as a field.  And then our time was up, and the symposium over.  With a promise to meet again, we broke for lunch in the courtyard one last time.

Week 18: Resilience, Part 10 (Third Morning)

The Resilience Engineering Association has, in concert with the Symposium, a Young Talents program.  This program invites graduate-level students pursuing research related to system resilience to share their work with the wider resilience community.  On the day prior to the Symposium the ten selected Young Talents had met with thought leaders in the field, presented their work, and received feedback.  This morning, the final morning of the Symposium, the students shared their work with our greater audience.

Still punch-drunk from my erratic and insufficient sleep patterns, I sat, taking in their presentations.  They varied in domain; some in transportation, some in medicine, one disaster response, another social services.  The presentations also varied in topic: how do we create a forward-looking accountability (as in, how do we hold leaders accountable in the future for the effects of decisions made today), how do we reconcile the different goals within an organization, how moments of success can create the obstacles and challenges in our next adaptive cycle(s).  And then a gem, from a female student from Japan: are some organizations more lucky than others?

I had first noticed luck as a component of successful (or at least less disastrous) outcomes reading about United Flight 232.  The accident sequence began with an uncontained engine failure, the debris from which severed lines serving all three hydraulic systems, rendering the flight controls of the DC-10 unresponsive.  The cockpit crew, led by Al Haynes, could have been overwhelmed by the situation.  One of the factors that contributed to the survival of so many on board was that a senior instructor pilot with the airline, one who just happened to practice and teach the use of differential thrust (turning an aircraft by reducing or adding power on one side and not the other) was riding as a passenger.  His assistance managing the engine controls is cited as the critical factor that enabled the crew to triage the situation, control the aircraft, and guide the disabled airliner to the runway at Sioux City, Iowa.  Over the years I had heard other stories that suggested luck: an engineering student who asked the right question of the right engineer; an oil platform manager who happened to notice an odd combination of readings in a control room; a pilot or disaster manager who happened to have heard a story (or otherwise learned some tribal wisdom) that provided the key to averting an adverse outcome.  I had thought I was the only one to consider this, and was encouraged by its mention.

The lecture led to lively conversation as the next student set up his presentation: Do the components of resilience generate luck as a by-product?  If so, which components are key, and can they be measured, designed in, or taught?  Is it the extra resources available when an organization relaxes efficiency so a key measure of slack is available during non-standard operations?  Is it the comprehensive mental models (developed over long periods of time and that include the key dependencies and interrelations between systems) that experts use to predict, prepare for, and call on in times of unease?  Is it the presence of (and the culture that supports the presence of) requisite imagination (defined by Westrum as the fine art of imagining what might go wrong) that can see failure paths that others cannot?  Or a combination of the these three, or others we had not considered?  I sat, grateful that someone else had given voice to a question I had been pondering for the last few years.

By then, the next speaker was ready to begin, and we returned to the Young Talents program.

More Soon!

* Instructors and mentors take note:  additional information about the Young Talents program, including application requirements and deadlines, are available here: http://www.resilience-engineering-association.org/blog/2016/11/15/rea-talent-program-2017-now-open/  The deadline for submissions for the June symposium is 26 January 2017.

Week 18: Resilience, Part Nine (Belém)

A cool breeze from the Atlantic wafted over us, a welcome change from the heat earlier in the day.  We were on a rooftop garden, sipping drinks and discussing research, the banquet room to our west providing shade from the setting sun.  Below us, people explored the geometric gardens and fountains of the Jardim Praca do Império (Empire Square).  Jesus, in the form of a 260 foot stone monument, watched over us from just past the 25 de Abril Bridge.  And to our south were the Padrão dos Descobrimentos (Monument to the Discoveries) and the Tagus River.  We were at the Cultural Center, a combined performing arts, exhibition and conference center originally constructed to accommodate Portugal’s European Union Presidency, for the Symposium’s banquet.

The Cultural Center is located at the mouth of the Tagus, in the parish of Santa Maria de Belém.  A natural lagoon, the harbor historically provided safe anchor for mariners, and the lowland fishing and agriculture fed the nearby city of Lisbon.  Construction of the Jeronimos Monastery, a blocks-long complex just north of the Cultural Center, began in the fifteenth century, and, as the Manueline style structure matured, it came to represent Portuguese expansionism.  After the 1755 earthquake, the royal family evacuated to a large estate on the hills above the monastery (and much of Lisbon to its grounds).  Over time, the buildings were expanded and renovated and, in the late 19th century, became the royal palace. Today it is the President’s residence; and this night, the cabbie who drove us to the Cultural Center had explained, its salmon-pink walls were guarded by extra police and soldiers due to a state dinner.

As the sun set, we drifted inside, again organizing by geography and shared language.  The American table quickly filled and I ended up a stray at the Scandinavian table.  There, a kind gentleman from the cruise ship industry took pity on my language skills and included me in the conversations.  Dinner was a local fish, rice and veg dish, followed by an ice cream confection.  And, of course, this all came with a hearty offering of local wines and port.

Fortified with good food and budding friendships, we broke for the evening to make our way back to the hotel.  We were told that if there were no taxis outside the building, we should make our way to the far side of the square, where a taxi line was available to take late evening revelers to their destinations.  The near sidewalks bare, we made our way around the water fixture of the garden to a lonely ‘taxi’ sign on the designated praca (street).  Here also, no taxis, just a long line of black town cars.  Worse, as we waited, the taxis we did see would not stop.  Our hotel was a good ten miles away, what were we to do?

We stood, tired, on the corner, considering our options.  Should we continue walking?  Should we ask one of the palace guards?  Then it hit me; the limos were waiting for the dignitaries at the state dinner.  The locals knew this, and were staying away.  I also knew security would not want a bunch of tipsy tourists loitering in the area, distracting them from their mission.  I pantomimed to one of the drivers, asking him to call for cars for us.  A few moments later they appeared like magic: three taxis for our group of twelve.  We piled in according to our destination (four in the backseat, two in the front of the one to my hotel) and were on our way.

It was with bittersweet relief my head hit the pillow: relief because I was so, so tired; bittersweet because tomorrow would be the last day of the conference.

Week 18: Resilience, Part Eight (Late Afternoon)

If I were to ask you to name some safety-critical domains, what are the fields that come to mind?  Airline operations?  Nuclear power, spaceflight or submersibles?  Do disaster preparation and response, hospitals, power transmission, rail systems, pipelines or the roads and bridges that make up our surface transportation system make the list?  These are the type of domains that resilience engineering is most interested in: high-tempo, complex systems with a high cost of failure.  The afternoon plenary reviewed one that I had until then not considered.

The internet was initially conceived during the Eisenhower era as a way to interconnect the main computers at Cheyenne Mountain, the Pentagon, and Strategic Air Command Headquarters (Strategic Air Command is the Air Force Major Command in charge of strategic bombers and intercontinental ballistic missiles).  This would allow  information sent by one of these terminals would be available to any of the other terminals in the event of nuclear war.  The first ten years were slow, as the DOD-funded teams encountered many technical challenges including the development of a common language to use between terminals and the process of message blocks to increase the survivability of data in reduced-bandwidth connections.  Finally, an initial connection was made between computers at UCLA and Stanford.  Terminals at UC Santa Barbara and the University of Utah were quickly added to the array, and similar projects emerged in the UK, France and the state of Michigan. Over the next dozen years, the arrays grew and combined in to the DARPANet.  In1976, after several years of debate regarding operating rules, the International Telecommunications Union Standard x.25 was approved.  The internet, as we know it, grew from there.

The two afternoon presenters were both Chief Technical Officers of internet service providers, one an online marketplace, the other a stock exchange.  Their talk began with an introduction to software development and operations (DevOps), a domain has gradually broadened the focus from developing and shipping code to maintaining online operations.  The reason was simple: system outages of more than a few minutes not only increase customer frustration, but also become national news (think Facebook, Netflix, Amazon).  The online marketplace CTO then shared an experience when, after loading new software, nine of the company’s ten servers failed.  This required rapid identifying and fixing of the code that had caused the failure while valiantly maintaining degraded service with the one operational server.  (I say valiantly because part of the code to keep the system on line involved a ‘handshake’ between servers.  Since there was only one, the handshake was accomplished by having an intern press ‘Enter’ on a keyboard every five seconds for the duration.)  In the end, the failure was traced to a flaw in the upgrade, new code was written, the software was successfully loaded and deployed.  The event had lasted for two hours, and thousands of customers were affected.

The talk then turned to financial markets.  In financial markets, transactions are performed at the millisecond level, and software or server outages can have an impact on financial markets. Until recently, when folks thought of Wall Street, they pictured traders scrambling on the trading floor.  These days, almost all transactions are performed by computers executing high-frequency conditional algorithms.  He then told the story of the 2010 ‘Flash Crash’. On May 6, 2010, a young man, reportedly working out of his parents’ suburban house, initiated a series of ‘spoofed’ stock trades.  These transactions were so frequent and vast they triggered the selling algorithms of major mutual funds. Within nine minutes major equity markets had dropped 300 points, and by the time trading was stopped five minutes later, trading was down nearly 1000 points.  Once trading resumed, the market regained some, but not all, of these losses.  This event lasted 2.16 billion milliseconds (36 minutes) and affected every mutual fund investor on earth.

These events, combined with the U.S. State Department’s request that Twitter delay routing maintenance to help anti-government protesters during the 2009 Iran uprising, had led the two speakers to realize that the internet has become a critical resource, and that more attention should be spent on supporting operations once a site was on line.  They described the challenges of maintaining and upgrading software across deep server structures based on 1985 technology while it was running live.  In the end, in addition to scaring the pants off the investors in the room, we came to appreciate the internet as more than just ordering books and watching cat videos.  “Computers are awful,’ one of the presenters stated, “and this is why we drink.”  So after some closing comments, we broke for the day to do just this.

Related Links:

John Allspaw: http://www.kitchensoap.com/2015/06/26/reflections-on-the-6th-resilience-engineering-symposium/

Zoran Perkov: https://www.youtube.com/watch?v=wVtpZgn9_W4

Laura Bell (not related to the talk directly, but it has informed my systems and safety thinking: https://www.youtube.com/watch?v=r2IX9QvmDIM&index=1&list=PL055Epbe6d5Y86GSg3nhUH3o_v62FGpCI