Why 40-Year Old Tech Is Still Running America's Air Traffic Control

wglb · on March 3, 2015

Duplicate of https://news.ycombinator.com/item?id=9105363

Someone1234 · on March 3, 2015

> The agency is primarily a regulatory body, responsible for keeping the national airspace safe, and yet it is also in charge of operating air traffic control, an inherent conflict that causes big issues when it comes to upgrades.

There's no conflict there.

Now NSA meant to keep US infrastructure safe, and attack US companies/people is a conflict. But the FAA being responsible for both safety and ATC is far from a conflict.

I'd prefer the FAA moved slowly and got it right. This isn't an area where young-hip 20 somethings re-invented the tech' in the latest web-2.0 fad. This is an area where you want mathematically provable correctness and unit testing up the ying-yang.

Get it right slow, don't get it wrong fast.

ufmace · on March 3, 2015

Based on the article, it sounds like they're getting the worst of both worlds - getting it wrong slow. They've been working on it for how many years, and it can't handle seeing a plane at 60k ft without crashing the whole thing?

I have some sympathy for the idea that most modern software companies, especially VC-funded startups, tend to prioritize getting to market fast over making everything exactly right. I'm not even saying it's wrong for them and their market. But if we're looking for something that's well-designed and carefully tested before going to market, along with a process designed to produce that, this doesn't sound like it at all.

Fuxy · on March 3, 2015

I completely agree with you safety should never be compromised just for the sake of modernization however when it takes 40 years to upgrade you're doing it wrong.

There should be ways of introducing new tech and features while slowly fading out the old ones however you don't switch from one to the other until it has been reasonably field tested.

One problem is they went to a defense contractor to build software instead of a software company.

I'm sure their software is great in their spy airplanes where it will never see the light of day and you can cover it up if it fails majorly however I wouldn't want it in the air traffic control computer.

nfoz · on March 3, 2015

> One problem is they went to a defense contractor to build software instead of a software company.

That's not a problem. That's why it ever worked properly in the first place, let alone working correctly for 40 years. Why shouldn't your system last for 40 years?

FTA:

> Modernization, a struggle for any federal agency, is practically antithetical to the FAA's operational culture, which is risk-averse, methodical, and bureaucratic.

I don't understand the tone of this article. Risk-averse, methodical, bureaucratic.... that's exactly what I want them to be!!

Someone1234 · on March 3, 2015

> One problem is they went to a defense contractor to build software instead of a software company.

Devil's advocate here:

Most software companies simply aren't up to the task. They're just not suited to building provable software from design through implementation often requiring the use of functional languages.

If you look at who does most aircraft software (or similar safety systems) it is mostly by certified engineers who just happen to "engineer" in software, rather than programmers who try to be engineers.

I'm a programmer, I am not ashamed of that. But realistically the way I work is in no way transferable to a safety critical system. In that situation every single function needs to have a unit test, it needs to have reliable input and output, and ideally it needs to self-identify hardware-caused undefined state and fail-safe.

forgottenpass · on March 3, 2015

I'm a programmer, I am not ashamed of that. But realistically the way I work is in no way transferable to a safety critical system.

I'm a programmer on a safety critical system. It doesn't require getting a PE, it requires a good quality system and a sizable amount of scorn towards the forces that want to reduce our profession to nothing more than a trendhopping slapdash self-congratulatory shitshow.

0xdeadbeefbabe · on March 3, 2015

I'll guess that you don't use "provably correct" functional programming out of scorn for trendhopping etc. Am I right?

foldr · on March 4, 2015

It's possible to prove the correctness of non-functional code too. It also tends to be much easier to reason about the time and space behavior of code written in, say, C. In a system with hard real time requirements that's very important.

You also have to consider the compiler. Proving that your Haskell code is correct is largely pointless if you then compile it with a compiler that hasn't been proven correct. At a minimum that's going to mean compiling the code with a lot of the fancier optimizations turned off, which can lead to a rather severe degradation in performance in the case of a pure functional language.

My guess is that there are not currently any functional languages that (i) are significantly easier to formally verify than C or Ada, (ii) have predictable time and space behavior, and (iii) have implementations that are both extremely well-verified and reasonably performant.

frik · on March 3, 2015

Most such critical software is written in Ada or C and run on an real time operating system.

A functional programming style or (almost) pure functional language would help to with (semi) automatic static code proofing.

viccuad · on March 3, 2015

I studied CS, and mathematical software verification and functional languages where mandatory in my uni.

Unless you are an engineer by accident, I would say you should be as prepared as you could be (albeit new to the field).

Honestly, the stance you share in this topic is why normal people historically has not seen computers and software as a reliable thing and continue to use half-assed software such as Windows for example.

frik · on March 3, 2015

True.

But a bad example. Microsoft uses automatic proofing for their (and third party) device drivers. Windows NT series is very reliable and has one of the best OS designs. The Win32 subsystem and shell comes with a lot of legacy and is a different story.

viccuad · on March 3, 2015

That's a bad example, true. In my defense I have to say that I jumped ship before NT, but I remember how people looked at NT's Blue Screen of Death(™) as if it were a normal thing to have in computers and software.

emodendroket · on March 3, 2015

Windows, as it exists today, has stronger engineering and reliability than the vast majority of software on the market, despite being saddled with legacy baggage. You definitely could have chosen a better example.

mikeash · on March 3, 2015

Isn't safety the whole point of ATC? Planes don't need ATC to get from place to place. They need ATC to keep them from bumping into each other while they do it. How is it anything but safety?

jbza · on March 3, 2015

I don't see why we can't move fast and get it right. My idea is to use formal methods to have verified communication protocols and plane operations. While we are at it, we might as well get rid of pilots as well.

Still a humongous undertaking, don't get me wrong. But at least it is a 5 year plan versus a 20-30 year plan which is the current thought plan as I see it.

frobozz · on March 3, 2015

Agreed. Even more, how can you keep the national airspace safe without operating ATC? Here are two options:

* Don't bother, just let the planes fly wherever they want, but levy a fine when they crash.

* Put it out to tender every 5 years and fine commercial ATC operators when crashes occur.

Both of them shut the stable door after the horse has bolted, and would probably require them to operate at least a passive sort of ATC themselves in order to assess blame.

steego · on March 3, 2015

If you don't feel like reading the whole thing, here's the gist: Another government contractor (Lockhead Martin) fucks up now it's the FAA's fault for having standards.

My one experience working on a government contract was positive. Basically the government set a deadline, some basic performance metrics and added clauses that fined us if we didn't meet those performance metrics. If were early and/or exceeded those metrics, we got a bonus. The end result was simple: Our company did everything it could to make sure we met the deadline and we delivered a quality solution.

How much do you want to bet that a big company like Lockhead Martin is really good at using its political leverage to make sure the FAA doesn't impose fines in the contracts when projects don't perform up to key metrics?

Jtsummers · on March 3, 2015

> How much do you want to bet that a big company like Lockhead Martin is really good at using its political leverage to make sure the FAA doesn't impose fines in the contracts when projects don't perform up to key metrics?

Ding ding ding!

I watched Northrop do exactly that on a system. They used their clout to persuade the program office on a system to rate them more favorably than they deserved (truly, they were consistently delivering late but ended up with positive marks across the board). Fortunately, this resulted in a big shakeup once it was discovered and reported. Unfortunately, NG continued to hold the contract (no one else could, practically speaking) and all that really changed was the personnel in the program office. I was gone by the time thing got "fixed" so I have no idea if it's actually improved on that particular project.

aburan28 · on March 3, 2015

The project specifications demanded by the FAA are absolutely ludicrous. This is not the F35 program.

steego · on March 3, 2015

I honestly can't comment on the specifications, but seeing how you're familiar with them, can you elaborate? What makes them ludicrous? How would you structure the specifications to be more reasonable? How would you layout the incentive structure to ensure the contractor delivers a safe, secure and reliable system?

strictnein · on March 3, 2015

My experience with trying to get even small tech things approved by the FAA: It will take at least a year and you will never know what stage you're at. Basically, once enough people have breathed on your request, it will be approved, assuming they can find absolutely nothing wrong. If they do, you get to start over.

If you build enough rapport with some of the workers there then they will actually respond to your emails in a relatively timely fashion, otherwise you'll be completely in the dark. Once every so often you find someone who actually responds quickly and does their job well. You guard that person's contact information and only give it to people you trust.

To be quite blunt, almost everyone you encounter intends to be a lifer at the FAA and knows that simply sticking around will almost guarantee them higher pay and a fairly decent retirement package. There is zero upside for them.

drzaiusapelord · on March 3, 2015

This is the problem with making government jobs unionized or making them overly-cush jobs. I was talking to someone about benefits and he listed his federal benefits. I was floored. He also claimed he's more or less unfireable. I don't know how the federal government gets anything done and when it does get done, it gets done over-budget and incorrectly.(Obamacare website is probably a good example here)

They're playing retirement min-max'er and I'm playing honest citizen. They're winning at the expense of us all.

undergrowth54 · on March 3, 2015

Obamacare's site wasn't built by federal employees--it was built by contractors.

drzaiusapelord · on March 3, 2015

That's still the same problem. These federal employees chose and managed those contractors. The ultimate responsibility falls on them.

logfromblammo · on March 3, 2015

No. No, it doesn't.

They get the credit when things go right. Whenever things go wrong, it's all on the contractors.

Based on my anecdotal experience, the government employees are simply untouchable. Even when they blatantly break the law, they still keep their jobs, and all the contractor and subcontractor employees lose theirs. Maybe they get a letter in their personnel file that keeps them from certain transfers and promotions, but they don't get fired.

The worst thing that typically happens to them is a furlough, which is a lot like a layoff in the private sector, except the workers can actually expect to be called back to work at some point. Sometimes they even get paid something for their inconvenience.

I have never seen even a single one of them assume personal responsibility for anything even remotely negative. It does not happen. The buck never stops anywhere.

emodendroket · on March 3, 2015

I don't think so. The very structure of contract work encourages slapdash products (after all, if you've checked off the boxes, you've "delivered" and can be paid, regardless of whether you meet the customer's actual needs). I bet that if the US government hired a team of programmers who answered only to their government bosses quality would go up (I understand the UK did this and that was the result they found).

emodendroket · on March 3, 2015

It's not entirely clear to me that they're wrong to move slowly on this.

fennecfoxen · on March 3, 2015

Moving somewhat slowly is fine, but 40 years (going on 50) may be excessive. And as long as it adds hassle and expense to air travel, it means people will be more likely to drive and in turn to get in car accidents. (Of course, those lives aren't on the FAA scorecard, so they obviously don't matter. They probably don't even fly first class. </snark>)

And it would be very handy in a variety of cases for "one air traffic control center to take over for another with the flip of a switch".

khuey · on March 3, 2015

Those sorts of issues are on the FAA radar. Convincing families to fly on vacation instead of driving is basically the only justification for the infant-in-arms rule.

collyw · on March 3, 2015

Agreed. If it ain't broken don't fix it.

darksaints · on March 3, 2015

It's not broken in the "regularly kills people" sense, which is good. It is broken in the "inefficient and capacity limited" sense, which means flights cost more, have fewer departures when demand is highest, and we are still stuck with human air traffic controllers doing the most stressful aspects of the job, causing high turnover and, when they walk out on strike, airport shutdowns.

crpatino · on March 3, 2015

And what's exactly wrong with that???

All that means that it's crystal clear to Upper-Management that air traffic control is Hard!, and it must be properly resourced. It makes me feel safer (even if probably more risky due to human error).

The minute automation comes in, an army of Crazy! PMs will begin to demand that everything has to be done due Thursday evening, so QA can test over the weekend, and then the Bean-Counters come in and slash every specification of hardware because its not quote-Efficient!-unquote.

Would you really want to fly in a world like that????

unreal37 · on March 3, 2015

Also, the original article mentions a contractor who committed suicide by taking down the Chicago ATC system. Isn't suicide a leading cause of death in air traffic controllers as well? Perhaps the systems aren't working so great now given the real human cost.

somerandomone · on March 3, 2015

I would argue it is broken. >It can handle a limited amount of traffic, and controllers can't see anything outside of their own airspace—when they hand off a plane to a contiguous airspace, it vanishes from their radar.

To me there's no way to tell if the plane flies out of their space or drops out of the sky at that moment. If that's the case the search and rescue won't be better had MH370 happened in the US.

Someone1234 · on March 3, 2015

Even with this new system MH370 could happen to an aircraft leaving the US. Both the pacific and atlantic oceans have giant radar black spots where aircraft aren't trackable and instead they rely on HF to guestimate where aircraft are [0].

[0] http://www.technologyreview.com/news/533871/could-passenger-...

foldr · on March 3, 2015

>To me there's no way to tell if the plane flies out of their space or drops out of the sky at that moment.

The problem there is primary radar coverage, not the ATC system. (MH370 turned off its transponder so the only way to detect it was via primary radar.) There's no point in spending huge amounts of money expanding primary radar coverage just to get a marginal benefit in an enormously unlikely scenario.

drzaiusapelord · on March 3, 2015

Who gets to define broke? How about the tax dollars paying for legacy equipment, legacy labor, etc? How about the lack of an uprade path and current limitations being permanent?

I inherited a system from a guy with this mentality. I had a rack of 12 year old Solaris boxes in production running software almost as old. It was a complete nightmare. Part of things not being "broke" is making sure they work well into the future as well. You're just building up a massive technical debt with a "aint broke" mentality.

emodendroket · on March 3, 2015

Something like air traffic control has a much, much higher potential cost for software bugs than most applications do and, unfortunately, that also means a much slower upgrade cycle.

Spooky23 · on March 3, 2015

Why is it there? It exists, and it works. Replacing big systems like this is incredibly difficult and expensive.

Personally, I think some of the old, expensive tech like ground based radar is a better solution than using GPS. End of the day, we're over-dependent on GPS, the system is very vulnerable to attack, and any disruption of it would global and catastrophic, while ground-based stuff is at least somewhat localized.

The MH370 case is a red herring IMO. Say we knew where that plane crashed. What does it mean? Everyone's dead in any case, and the details of precisely what happened doesn't necessarily offer a great deal of value.

jpalawaga · on March 3, 2015

In fact, the simplicity of the older systems usually lead to greater stability. Not to mention that systems based on pieces of different ages (as is the case during upgrades) is likely much more unreliable, because you do not know how the equipment will interact with each other.

Take for example Vancouver's Skytrain. Last year, there was a failure of some equipment that resulted in a system-wide shutdown. Media got in and was appauled that Skytrain still uses equipment run on floppies. The old software/hardware wasn't what broke down, though--it was some NEW equipment more recently installed to manage the PA system.

Why fix what ain't broke? ESPECIALLY when people's lives are on the line.

drzaiusapelord · on March 3, 2015

>In fact, the simplicity of the older systems usually lead to greater stability.

Remember when that control tower in Chicago got attacked by some crazed employee? Do you know why there isn't just a simple DR switch over? Because the system is so antiquated, something like that is too problematic to implement. Thus all the cancelled flights for a week at the world's busiest airport.

>ESPECIALLY when people's lives are on the line.

Its more risky to have a pig with lipstick system run by COBOL graybeards one foot in the grave than to run things properly in a modern system with modern redundancy and modern features. Last I heard, the code that runs many of these systems is non-auditable because its just a pile of spaghetti code on top of spaghetti code. Its "works" in the same way an 80 year lady on her third hip can "walk."

logfromblammo · on March 3, 2015

If government contracting is involved in the replacement, the new code is all spaghetti too, because the requirements are "make it work exactly like the old system", and the work is spread out to 5 different subcontractors: JoeCo, LarryCo, MoeCo, CurlyCo, and ShempCo.

mikeash · on March 3, 2015

The point of accident investigation is to be able to learn and improve from what went wrong, to reduce the risk of the same type of accident happening again.

One could certainly make the point that occasional mysteries cost less than sealing up all the holes, but it's silly to say that knowing what happened to MH370 would be pointless because everybody is dead anyway.

sosborn · on March 3, 2015

> details of precisely what happened doesn't necessarily offer a great deal of value.

Maybe the do, maybe they don't. I'd rather find the plane and figure this out than assume.

ufmace · on March 3, 2015

Sounds like a lot of the conflict is the old Agile vs Waterfall debate. Waterfall isn't very well regarded in the software engineering community, but it seems to work much better in the Aerospace community, especially for stuff like flight control avionics.

I think the difference is that Waterfall is actually a better solution - as long as you actually know all of your requirements ahead of time, and they will actually never change. This may be the case in for embedded stuff like avionics, but is almost never actually the case for more business-like software that automates or assists with human processes, which is what Air Traffic Control actually is.

So it kinda sounds like they used a traditional aerospace Waterfall project plan, but the requirements gathering was a total fail. Sounds like they need a team more experienced in Agile development, even though it has more of a reputation for putting things into production before they've been really carefully tested. Of course, this is based off a lot of shaky assumptions from a vague article, so who knows.

kens · on March 3, 2015

According to Wikipedia [1], the FAA's Host system was upgraded in 1999 to IBM 9672 computers, so the computers aren't that old. It seems inaccurate to call this a 40 year old system when the computers are less than 20 years old.

As for the original IBM 9020 computers running Host, these were IBM 360 variants. The technology in these computers is interesting - instead of newfangled ICs, the computers were built from SLT modules. These are metal modules kind of like an integrated circuit, except there's a small circuit board inside with discrete transistors, diodes, and printed resistors. (I was looking at some SLT modules yesterday.)

[1] http://en.wikipedia.org/wiki/IBM_9020

Jtsummers · on March 3, 2015

My office has 30-40 year old systems running on 10 year old hardware. We don't claim that the system is 10 years old, because all we did was move from VAX hardware to VAX VMs.

stcredzero · on March 3, 2015

I suspect this article is saying lots of things that are true or mostly true, but with fluffy or nonexistent justifications. As pointed out by kens, the hardware is actually 20 years old, not 40.

Think of Uber, Lyft, and Airbnb: Outdated regulations slowed them down, but consumer demand is forcing the law to evolve. This back-and-forth is what lets tech companies move fast and break things without risking our safety.

As it is for those companies, it was also with early aerospace: Risks weren't eliminated. Rather, at first, risks were sequestered to advantageous subpopulations and taken by volunteers who, in theory, understood what they were about to do.

These issues screwed up HealthCare.gov and are screwing up the Department of Veterans Affairs and a dozen other agencies that need computers and software that work.

Probably true but lacks substantiation in this article.

To the author's credit, she's not a free market cultist in this article, and points out that other countries have better government managed air traffic control.

knz · on March 3, 2015

I didn't see anything in the article about how political concerns over technology such as RNAV (http://www.faa.gov/air_traffic/publications/atpubs/aim/aim01...) have held back the transition to a nextgen system. I work at an airport where community concern over any potential changes to where aircraft fly has held back implementation of new technology that would make the whole process more efficient. My understanding is that many other major airports in the US face similar community and political issues about RNAV. I suspect this has also had a major role in the delay of the nextgen roll out.

zb · on March 3, 2015

Note that this is the FAA's second attempt to replace the system. The first one, in the 1980s, went even worse - one participant called it "the greatest failure in the history of organised work": http://www.baselinemag.com/c/a/Projects-Processes/The-Ugly-H...

tezzer · on March 3, 2015

It's also the disaster that kicked off the field of software engineering. They looked at the mess and thought maybe there should be a formalized approach to doing something this vast in software.

aburan28 · on March 3, 2015

A close family member of mine is one of the project leads on this nextgen program and their is little to no concept of security built into this program. There are so many points of failure and data integrity is based upon non-crypto hash functions such as fletcher 32 bit.

lisa_henderson · on March 3, 2015

Robert L. Glass has compiled a book of entertaining and instructive stories regarding the massive software project failures of our time: http://www.amazon.com/Software-Runaways-Monumental-Disasters...

Software Runaways: Monumental Software Disasters.

Anyone with an interest in the management of software projects should read it.

The worst of these project failures, the most expensive and the most ambitious, was the attempt made by the Federal Aviation Authority to modernize the computer system it uses to keep track of what planes are in the air. The effort began in 1981 and ended in complete failure in 1994. The government hired IBM to do the actual work, and over the course of 14 years, IBM burned through $3.7 billion dollars. Nothing was accomplished. The project was finally shut down by Congress. Nothing came out of the project, not a single piece of software, nor even a line of code, was ever used for anything.

From the book:

It has been noted by everyone from the New York Times to the Vice-President of the United States that the main problem on the Advanced Automation System was "changing requirements". For those involved in large-scale computer systems, that is nothing new. No one can perfectly surmise the shape and feel of a system years in advance. Even replacing some aspect of a system you know by heart is not immune from thinking twice about it. ... [But] the requirements churn (it was called) on the Advanced Automation System was not normal. It was the result of our enchantment with the computer-human interface, the CHI. The new controller workstation, fronted by a 20" by 20" color display, because it was capable of seemingly endles variety of presentations, mesmerized the population of AAS like the O.J. Simpson trial mesmerized the nation...

The project was handed over to human factor pundits, who then drove the design. Requirements became synonymous with preferences. Thousands of labor-months were spent designing, discussing, and demonstrating the possibilities: colors, fonts, overlays, reversals, serpentine lists, toggling, zooming, opaque windows, the list is huge. It was something to see. (Virtually all of the marketing brochures - produced prematurely and in large numbers - sparkled with some rendition or other of the new controller console.) It just wasn\'t usable...

Rummaging through one of the closets at the far end of the hall on the fifth floor one day, looking for some standards document, I found an envelope left by someone who left the company - as many did after so many years advancing against stone, while the wheels of commerce were accelerating on what everyone referred to as "the outside". It contained "A Brief History Of The Advanced Automation System". It was printed by hand and left, perhaps inadvertantly, or perhaps with the hope that some anthropologist might some day discover it and make a pronouncement. In every important way, it is the truth:

"A young man, recently hired, devotes years to a specification written to the bit level for programs that will never be coded. Another, to a specification that will be replaced. Programmers marry one another, then divorce and marry someone in another subsystem. Program designs are written to severe formats, then forgotten. The formats endure. A man decides to become a woman and succeeds before system testing starts. As testing approaches, she begins a second career on local television, hosting a show on witchcraft. An architect chases a new technology, then another, then changes his mind and goes into management. A veteran programmer writes the same program a dozen times, then transfers. The price of money increases eight times. Programmers sleep in the halls. Committees convene for years to discuss keystroking. An ambitious training manager builds an encyclopedia of manuals no one will ever use. Decisions are scheduled weeks in advance. Workers sit in the hallways. Notions of computing begin in the epoch of A, edge toward B, then come down hard on A + B. Human factors experts achieve Olympian status. The Berlin Wall collapses. The map of Europe is redrawn. Everything is counted. Quality becomes mixed with quantity. Morale is reduced to a quotient, then counted. Dozens of men and women argue for thousands of hours: What is a requirement? A generation of workers retire. The very mission changes and only a few notice. Programming theories come and go. Managers cling to expectations, like a child to a blanket. Presentations are polished to create an impression, then curbed to cut costs. Then they are studied. The work spikes and spikes again. Offices are changed a dozen times. Management retires and returns. The contractor is sold. Software is blamed. Executives are promoted. The years rip by with no end in sight. A company president gets an idea: make large small. Turn methods over to each programmer. Dress down. Count on the inscrutability of programming. Promote good news. Turn a leaf away from the sun. Maybe start over.