Pips in Their Rows

Due to my shameful 2-11 record in fantasy football this season, I welcomed the league to my house on Sunday for the game (it’s been a tradition for the last place finisher to host the Super Bowl every year). Our league started 15 years ago with a bunch of recently-transplanted Boeing engineers straight out of college. Most of us have left the company or been laid off, but a few of the original crew are still at the company. In talking to one of them this weekend, it’s clear that the battery problems with the 787 are a serious headache for Boeing, but many of the problems with the 787 were things that a lot of folks (even going back to when I was there and it was still the 7E7) were fully anticipating. And it sounds like some are starting to voice that a little more.

I left the company in 2000, but even prior to that, the vision of a global outsourcing model to build the next big Boeing plane had already been articulated. Some of the older folks in my group (I was a flight control test engineer) were nervous. Others were beyond nervous and predicting doom. One of them fired off a long, angry email to the all-company email list. He was reprimanded, but not fired. I’d love to read that email today, as I’d expect it would be like reading the scrolls of Nostradamus.

Outsourcing has occurred in a lot of different work environments. And in some of them, it’s arguably achieved its objectives – lower production costs with equal or near-equal production. But within the tech world, nearly all the attempts at outsourcing I’ve seen have been disasters. I’m not talking about just tech support or research or some specialized one-off skill, I’m talking about efforts to design, build, and test a large-scale development project with different project groups located around the globe. The logistical difficulties and communication issues involved quickly overwhelm your ability to move at the pace you need to move at.

Let me give a simple example more related to the world I currently inhabit, the world of software services and big data. This’ll be familiar to more people than the innards of a jumbo jet, but I promise I’ll get back to Boeing afterwards.

Let’s say you’re a software tester on a small team. It’s Monday morning and you’re supposed to get a web application build from the developer to install on your test web server box. You get in at 8am and a link to the build is in your inbox from late Friday night. You begin to install it but it throws some kind of configuration error. You suspect the developer didn’t package it up right, so you grab some coffee for a bit and wait for the developer to get in. When he gets in at 9am, you stroll over to his desk and tell him. He makes a quick fix, you get it installed, and you start up your testing.

By 10am, you’ve done a good amount of poking around in the application, which is supposed to read from a database, do some calculations, and show some fancy graphs related to the data. Because you’re a diligent and prepared tester, you have a whole long list of test cases that you came up with in anticipation of the handoff. The first thing you notice is that when some fields in the database are zero, the graph isn’t displaying the data in a way that makes sense. At the 10am stand-up meeting, you bring this up to the program manager. The program manager looks at it and you convince him it’s wrong, but the developer isn’t convinced. A whiteboard is summoned and the program manager convinces the developer he needs to make a change. You file a bug, and the developer starts to fix the code.

In the meantime, you continue testing. You discover that in order to get to certain views in the web application with particularly sensitive data, you need to pass in some credentials. The program manager isn’t sure what they should be, and the developer was testing against a mock local version of the DB, so he never had to worry about this. The database admin is still hungover from the concert he went to last night, but he looks up the credentials and gets them to you. They work, but it’s 12:30pm and you need to grab some lunch.

After lunch, you start checking out how the application is showing the sensitive data (financial records from various countries). You notice that in a few of the countries, the numbers are blank, but you’re pretty sure the database has data. You ask the database admin for help, and he gives you direct read only access to the database so that you can look everything up yourself and he can go back to playing foosball.

You discover that the data in the database looks totally fine, but the database is set up in a weird way in order to accommodate data coming from different places. You end up having to track down a different developer who wrote the ETLs (a fancy acronym for programs that extract data from one source, transform it, and load it into a database of some kind) and she realizes after some debugging that she’s not doing the encoding right, which is messing up how the other developer reads it. Her code was being tested separately, but that tester had little understanding of how encoding works and never knew it was wrong. You file a bug and the developer starts coding the fix (it’s a simple fix) before they release that into the production system on Wednesday. It’s 4:30pm now and you’ve had a pretty good Monday.

Now let’s imagine that same scenario in an environment where the development work has been outsourced to India and the data warehousing work is being done in Russia. You and the program manager are still in Seattle.

You get to your desk Monday morning and you have the link to the web application build in your inbox, sent two hours ago from India, where it’s Monday evening now. Normally, the program manager and the offshore dev lead will talk at the end of the Indian workday and the beginning of yours, but by the time you discover that the build is broken, it’s almost 9pm in India. So you spend your Monday catching up on some test documentation, surfing Reddit, and honing your foosball skills.

Tuesday morning in India, the developer corrects his mistake and fires off an updated build which you’ll get in about 10 hours when it’s 8am in Seattle. You get in, install it, and you’re off and running. You were prepared for the test effort, and maybe even moreso now because of your extra day to prepare, so you quickly notice the error with the way the graph is displaying zero values and bring it up to the program manager. The program manager is convinced it’s a bug and puts some notes in the bug report you filed against the developer.

It’s now Tuesday at 1pm and you discover that you’re unable to see the sensitive data pages without additional credentials. The database is now being administered in Russia, where it’s the middle of the night. The program manager looks through his email to see if he can dig up the password you need. He can’t, so you file a work ticket in their system and take a second look at some test cases for the rest of the afternoon.

Wednesday morning in India, the developer looks at the bug report you filed, but doesn’t quite understand how the program manager wants it changed. He talks to his dev lead down the hall, who isn’t even convinced it’s a real bug. He tells the developer that he’ll talk to the program manager at the end of the day when it’s morning in Seattle.

In Russia, they receive the change ticket to provide your database credentials, but it’s a busy day there. All the new ETL’s are going into production so even though your change ticket is marked as a “Blocker”, it’s only blocking testing, so it doesn’t get acted on until late in the day, and then because it involves providing credentials to a sensitive database, it has to go through additional approval that won’t happen until next day.

Wednesday morning at 6am in Seattle, the program manager gets on the phone with the Indian dev lead and tries to explain how the graphs should be displayed. Without a whiteboard and with the language barrier, this is difficult, but eventually the program manager conveys it across, but the developer has already left for the day, so it won’t get implemented until Thursday.

You get in at 8am and discover that you’re still blocked from being able to do anything. You sigh, sit back in your chair, and wonder about all the other things you could’ve done today if you just had the balls to pretend to be sick. You spend enough time at the foosball table on Wednesday that you start feeling confident enough to take on Steve from marketing. He rolls you, 5-1.

On Thursday, the Indian developer fixes the graph bug leaving you with an hour or two of regression testing since his change affects nearly every graph in the system. And you also receive your test credentials from Russia. So you’re feeling pretty good about being able to make some progress as you settle in at your desk Thursday morning. You finish up the regression testing by about 11am and you log into the secure section to look at the parts of the application you’ve been blocked from. You notice the problem where there’s blank data in places where you’re certain the data exists. You try to use the same credentials to look directly into the database, but it won’t let you. You file another ticket against the admins in Russia. You spend the rest of the day watching YouTube videos and playing Angry Birds on your phone.

By Friday morning, you have your updated read-only access directly to the database and you confirm what you suspected, that the data is correct in the database, but being displayed as empty values in the application. You suspect there might be a problem with the ETL’s, and to your delight, the Indian programmer who wrote the ETLs works a late shift and is still online. It’s Friday night in India, so you feel kind of bad making him work a little late, but you really want to get this figured out. He quickly discovers the problem with the encoding. But since they went to production with this broken code on Wednesday, he’s screwed. He can fix your test database, but not until he fixes production first.

So you continue to test whatever you can, but you’re looking at the same things 3 or 4 times and getting bored. After lunch, you’ve given up. You notice that the cute new office admin is hanging out at the foosball table, so you head in there and play a few games. You finally get a rematch with Steve and you take him to 4-4 before he shoots a laser past your hapless goalie from his back line. But you’re happy with your improvement and the admin reminds you that everyone’s going to happy hour, as it’s the outsourcing consultant’s last day before moving on to his next job. So you cut out a little early, do some shots with the admin and her boyfriend, and then head home for the weekend having made less progress in a week than a tester in a co-located work environment makes in a day.

Ok, so this is a bit of a jokey example, but anyone who’s worked in an environment like this can tell you how close to home it hits. And one thing that I wanted to make clear is that the problems with tech outsourcing often have nothing to do with the quality of the offshore employees. In the times I’ve had to work with offshore workers, they’ve all been very good. The problem is one of logistics and communication. Any project that involves a large number of integration points is going to require a lot of coordination and communication. And just as with my example, even small failures can expand your timeline exponentially. This is where there’s a lot of similarity between large-scale IT endeavors and building an airplane.

Boeing had teams all over the globe, each doing their own design for their own parts. There was a belief that if you work hard to define and understand the integration points, you could make this chaos work. But ever since the 777, modern airplanes are essentially large-scale flying computer systems. And this plan worked as poorly as it does when the development of any other large scale computer system is spread out across the globe. A tech project requires people to be nimble, to be able to aggressively take charge and sometimes step out into different roles. To use a soccer metaphor, sometimes you need a midfielder to play striker for a bit, or for more people to come back into the box to defend a corner kick. It requires everyone to be flexible and work together. But offshored development teams turn the playing field into a foosball table, with everyone stuck in their rows kicking the ball around with limited control over where they go and what they do.

The rest of the story at Boeing is becoming well known. Delays started to mount as the complexity of testing all these integration points started to manifest. The normal set of rigorous oversight was mostly bypassed as the FAA threw up their hands trying to do their normal fault analyses on foreign suppliers. And years after the first Dreamliner was rolled out for the world to see, they’re all currently sitting dormant around the world as engineers scramble to figure out why the lithium-ion batteries are catching fire.

A few years back, I was a test manager at a company that was in the process of outsourcing a lot of its development and testing work. I scheduled a lunch with a consultant who said she had expertise in making offshore workgroups more agile. During the lunch, I asked her if she knew of any companies that’ve been successful with their offshoring projects. She stopped for a second, thought about it, and said “they must be successful, everyone’s doing it”. Hopefully not for much longer.

Comments

  1. 1

    Roger Rabbit spews:

    Managements who try to build highly complex machines with outsourced parts and non-union labor get what they pay for. I’m surprised no wings have fallen off yet.

  2. 2

    rhp6033 spews:

    I couldn’t agree more. I’ve mentioned before my brother-in-law, who while working at HP/EDS spent the past two years as project manager overseeing a group of developers in India. There were time & distance problems, communications issues, and while they were highly educated, there was a serious gap between their experience level and that of my brother-in-law. He ended up re-doing a lot of their work himself, it was just easier than trying to train them how to do it. (Of course, he also realized he was being required to train his eventual replacements).

  3. 3

    rhp6033 spews:

    Of course, part of the reason why Boeing wanted to farm out much of it’s production work was because they wanted other companies to share in the financial risks. It made their own financial statements look better, because earnings and expenses are compared against a corporations total assets, and by offshoring expenses (and a portion of the profit), while divesting itself of assets (tooling, production facilities), ends up in higher stock prices (short-term) and big bonuses to the senior executives.

    Also, Boeing was unwilling to invest in the equipment needed to make large carbon-composite assemblies – a decision which made it entirely dependendent upon the vendors of large assemblies.

    But even the most basic production processes were made under a flawed Boeing plan.

    (1) Boeing seemed to think that if a firm had made a portion of an aluminum & titanium structure before, it could multiply that 100% and build a lot more, larger, carbon-composite structures.

    (2) Boeing also imposed a schedule which was driven by market timetables, not engineering realities.

    (3) Boeing needed to have inspection teams at each of the vendor’s facilities with experienced Q.C. to buy off each major assembly before it is shipped to Boeing – something it didn’t do until well after it was clear that the plan had failed and the 787 deliveries were going to be years behind schedule.

    (4) Finally, management totally dismissed the amount of re-work required in the Everett factory to correct the vendor mistakes – it would have been much quicker to simply have Boeing machinists do all the work themselves.

    The fault lies primarily with Henry Stonecipher, the McDonnel-Douglas CEO who went on to take over Boeing after the merger. Although he had an engineering degree, he was primarily a finance guy who had killed M-D by cost-cutting, deferring new airplane programs until their existing aircraft were obsolescent, and allowing the accountants to run the factory instead of the engineers and production managers. Stonecipher’s vision was that these outsourced componants would arrive at Boeing, with only three days of assembly before the Boeing name tag was attached to the door frame, and then the aircraft would be rolled out to the customers. The current C.E.O. McNerney was from the same “Jack Welch” management school (named for the former head of G.E. who dismantled much of G.E.), although McNerney is learning – a bit too late.

  4. 4

    rhp6033 spews:

    By the way, what’s a permissable failure rate of an aircraft part? 1%? 2%? Not hardly. Once the part passes Q.C., the failure rate of a critical aircraft part has to be much smaller than that. Of course, you have redundancies built in for critical systems, but you don’t want to have to rely upon that.

    Which is why outsourcing requires just as much on-site inspection and management as if the parts were built in-house.

  5. 6

    ArtFart spews:

    @3 Anybody know if Stonecipher was around (he’s certainly old enough) calling some of the shots at McD on the DC-10 project? I knew an engineer who had worked there at that time, and he described some very similar decisions that were made–particularly “market-driven” timetables and massive design changes du jour driven by guesses as to what the competition was doing. The DC-10, particularly in its early years in service, racked up a pretty dismal record for safety and serviceability.

  6. 7

    phil spews:

    Speaking of GE, here’s a great article about them bringing manufacturing back to the good old USofA.

    http://www.theatlantic.com/mag....._page=true

    So a funny thing happened to the GeoSpring on the way from the cheap Chinese factory to the expensive Kentucky factory: The material cost went down. The labor required to make it went down. The quality went up. Even the energy efficiency went up.

    GE wasn’t just able to hold the retail sticker to the “China price.” It beat that price by nearly 20 percent. The China-made GeoSpring retailed for $1,599. The Louisville-made GeoSpring retails for $1,299.

  7. 8

    spews:

    @6
    The engineer that I referenced in the second paragraph (who sent the company-wide email) had also worked at the McD offices in SoCal (post-merger). That experience was a major part of why he was so concerned.

    RHP,
    Thanks for the long and very thoughtful comments, as always.

  8. 9

    spews:

    @ 7

    The article describes several reasons that US production became cost-competitive. Among them:

    In dollars, wages in China are some five times what they were in 2000—and they are expected to keep rising 18 percent a year.
    American unions are changing their priorities. Appliance Park’s union was so fractious in the ’70s and ’80s that the place was known as “Strike City.” That same union agreed to a two-tier wage scale in 2005—and today, 70 percent of the jobs there are on the lower tier, which starts at just over $13.50 an hour, almost $8 less than what the starting wage used to be.
    U.S. labor productivity has continued its long march upward, meaning that labor costs have become a smaller and smaller proportion of the total cost of finished goods. You simply can’t save much money chasing wages anymore.

    I somehow think that isn’t what you intended to emphasize. But there it is.

  9. 10

    spews:

    @4
    By the way, what’s a permissable failure rate of an aircraft part? 1%? 2%?

    It’s not the failure of each part that has certain hard and fast thresholds. It’s the failure of systems. The FAA generally does fault tree analyses that look at the probability of several systems failing at once and those probabilities have to be something like 1e-9 or something really, really, low.

    But as I understand it, the FAA hasn’t been able to accurately determine the failure likelihoods on all of these outsources components, so doing a fault tree analysis is impossible.

  10. 11

    rhp6033 spews:

    Speaking of market-driven timetables for the 787:

    I acknowlege (who wouldn’t?) that new aircraft have to hit the market at the right time. The 747 amost bankrupted Boeing, with deliveries hitting the market just as the rescession of the early 1970′s took hold and the ST program was canceled, leading to the famous “Will the last person to leave Seattle please turn out the lights?” billboard.

    The decision to proceed with the 787 program was delayed for too long under Stonecipher’s reign, as he always preferred to make profits selling existing aircraft and to put off expensive new aircraft programs indefinately (a decision which was largely the cause of the M-D collapse). Boeing found itself in trouble when 767 orders dropped to a trickle and former Boeing customers chose the Airbus competing aircraft because it was newer.

    So by the time Boeing sought approval from the Board of directors to launch the program, it was already running behind and needed to catch up fast. It’s target date was set by the marketing people as July 8, 2007 (7/8/07), for P.R. purposes, with a first delivery by the following year.

    So how was Boeing going to meet this hurry-up deadline? What it really needed to do was hire and train thousands of new workers, build new production facilities for carbon-composite major assemblies, and expand it’s parts production. Instead, it farmed out almost all of this to sub-contractors, without providing for more buyers, inspectors, etc. to manage the sub-contractors.

    The end result was an airplane which was years late in delivery to the first customer. Boeing simply could not control the quality and schedules of it’s sub-contractors, and didn’t know how bad things were going until after delivery schedules were sliding ot the right.

    It’s much the same strategy which IBM used to enter the PC market in the 1980′s. It decided to use off-the-shelf parts (and license existing software from an infant Microsoft company) to make it’s first P.C.s But within just a few years, other companies were able to make clones of the IBM machines at a fraction of the cost, driving IBM into only a fraction of the P.C. and laptop market.

  11. 12

    rhp6033 spews:

    # 10: I guess if a system fails, you can always pull the airplane over to the side of the flight path, park it, and wait for AAA to give you a tow!!!!

  12. 13

    rhp6033 spews:

    Cont. of # 11: One of the really ridiculous things about the 787 program was that as sub-contractor delivery dates slid, Boeing management was struggling to find a way to meet the original schedule. Each decision ended up being a costly mistake.

    (1) It authorized suppliers to send in incomplete assemblies for Boeing machinists to complete in the factory – a process which ultimately required the machinists to dis-assemble and re-assemble parts already installed on the airplane several times as new defects in the production process were discovered.

    (2) It relied upon existing software data for previous airplanes to predict the performance of the 787, even though it was made almost entirely of diffent materials which reacted quite differently under stress from prior models.

    (3) It made a decision to start the assembly line for customer aircraft even before the static testing was started, much less any flight testing. When those tests revealed fundamental changes had to be made, the re-work was an expensive and time-consuming process on the dozens of planes which had already rolled out of the factory by that time.

  13. 14

    ArtFart spews:

    Boeing has always used subcontractors, but in the past took direct responsibility for what they did. When they were developing the 747 they contracted with Iron Fireman to make landing gear assemblies, and literally sent Boeing engineers and QA personnel to be resident in Gresham to make sure things were done right. (I personally knew one of those guys.) Eventually they bought the plant and ran it themselves.

    It seems in the case of the Dreamliner, Boeing didn’t do this so much as to send the subs specifications (which might themselves be subject to change as the overall program moved along) and trusted them not only to understand and comply with every detail, but to figure out by themselves how to do the stuff that had never been done before, and to comprehend the overall technical “context” through the filters of differing corporate and national cultures and languages.