If I were to ask you to name some safety-critical domains, what are the fields that come to mind? Airline operations? Nuclear power, spaceflight or submersibles? Do disaster preparation and response, hospitals, power transmission, rail systems, pipelines or the roads and bridges that make up our surface transportation system make the list? These are the type of domains that resilience engineering is most interested in: high-tempo, complex systems with a high cost of failure. The afternoon plenary reviewed one that I had until then not considered.
The internet was initially conceived during the Eisenhower era as a way to interconnect the main computers at Cheyenne Mountain, the Pentagon, and Strategic Air Command Headquarters (Strategic Air Command is the Air Force Major Command in charge of strategic bombers and intercontinental ballistic missiles). This would allow information sent by one of these terminals would be available to any of the other terminals in the event of nuclear war. The first ten years were slow, as the DOD-funded teams encountered many technical challenges including the development of a common language to use between terminals and the process of message blocks to increase the survivability of data in reduced-bandwidth connections. Finally, an initial connection was made between computers at UCLA and Stanford. Terminals at UC Santa Barbara and the University of Utah were quickly added to the array, and similar projects emerged in the UK, France and the state of Michigan. Over the next dozen years, the arrays grew and combined in to the DARPANet. In1976, after several years of debate regarding operating rules, the International Telecommunications Union Standard x.25 was approved. The internet, as we know it, grew from there.
The two afternoon presenters were both Chief Technical Officers of internet service providers, one an online marketplace, the other a stock exchange. Their talk began with an introduction to software development and operations (DevOps), a domain has gradually broadened the focus from developing and shipping code to maintaining online operations. The reason was simple: system outages of more than a few minutes not only increase customer frustration, but also become national news (think Facebook, Netflix, Amazon). The online marketplace CTO then shared an experience when, after loading new software, nine of the company’s ten servers failed. This required rapid identifying and fixing of the code that had caused the failure while valiantly maintaining degraded service with the one operational server. (I say valiantly because part of the code to keep the system on line involved a ‘handshake’ between servers. Since there was only one, the handshake was accomplished by having an intern press ‘Enter’ on a keyboard every five seconds for the duration.) In the end, the failure was traced to a flaw in the upgrade, new code was written, the software was successfully loaded and deployed. The event had lasted for two hours, and thousands of customers were affected.
The talk then turned to financial markets. In financial markets, transactions are performed at the millisecond level, and software or server outages can have an impact on financial markets. Until recently, when folks thought of Wall Street, they pictured traders scrambling on the trading floor. These days, almost all transactions are performed by computers executing high-frequency conditional algorithms. He then told the story of the 2010 ‘Flash Crash’. On May 6, 2010, a young man, reportedly working out of his parents’ suburban house, initiated a series of ‘spoofed’ stock trades. These transactions were so frequent and vast they triggered the selling algorithms of major mutual funds. Within nine minutes major equity markets had dropped 300 points, and by the time trading was stopped five minutes later, trading was down nearly 1000 points. Once trading resumed, the market regained some, but not all, of these losses. This event lasted 2.16 billion milliseconds (36 minutes) and affected every mutual fund investor on earth.
These events, combined with the U.S. State Department’s request that Twitter delay routing maintenance to help anti-government protesters during the 2009 Iran uprising, had led the two speakers to realize that the internet has become a critical resource, and that more attention should be spent on supporting operations once a site was on line. They described the challenges of maintaining and upgrading software across deep server structures based on 1985 technology while it was running live. In the end, in addition to scaring the pants off the investors in the room, we came to appreciate the internet as more than just ordering books and watching cat videos. “Computers are awful,’ one of the presenters stated, “and this is why we drink.” So after some closing comments, we broke for the day to do just this.
- You can learn more about the flash crash here: http://www.nanex.net/20100506/FlashCrashAnalysis_Intro.html and trading algorithms here: https://www.ted.com/talks/kevin_slavin_how_algorithms_shape_our_world
Related Links:
John Allspaw: http://www.kitchensoap.com/2015/06/26/reflections-on-the-6th-resilience-engineering-symposium/
Zoran Perkov: https://www.youtube.com/watch?v=wVtpZgn9_W4
Laura Bell (not related to the talk directly, but it has informed my systems and safety thinking: https://www.youtube.com/watch?v=r2IX9QvmDIM&index=1&list=PL055Epbe6d5Y86GSg3nhUH3o_v62FGpCI