Showing posts with label disaster recovery. Show all posts
Showing posts with label disaster recovery. Show all posts

Monday, January 16, 2023

Can Someone Give me a Call!


It is a familiar aphorism in intellectual circles that history does not repeat, but it often rhymes. Last week, I talked about how technical debt and poor maintenance helped ground Southwest Airlines during one of the year's busiest travel periods. This week, the FAA software, which helps maintain air traffic safety, crashed, causing planes to halt for the better part of the morning. 

The system in question is called the NOTAM, which stands for "notice to Air Missions." It is a general-purpose system that tells pilots about hazards they might encounter moving commercial flights from place to place. An air show could fill up the airspace around Wichita, or a flock of birds could clog up a flight path to Cleveland. Air traffic controllers and pilots count on this system to avoid accidents. When the system went down, air traffic came to a crawl, and customers were grounded. 

As the crisis unfolded, it became clear that a software update caused the crash and upset the system. Someone uploaded a corrupt file, killing the central NOTAM system and the backup. The responsibility for uploading that system was with an outside contractor, not an FAA employee. My heart sank when I learned this because I made a mistake like this on a much smaller scale. I broke 56 credit card transactions out of 1800 for TOMS shoes during the holiday shopping season in 2008. It was a fraction of a single day's business, but it was enough of the hassle that TOMS shoes threatened to stop paying the consulting company for the trouble caused. Our company was withheld payment from TOMS for December even though we fixed the problem within twelve hours. A week before Christmas, my company fired me, and to this day, I still harbor some ill will toward TOMS and its brand. 

I imagine a consultant is looking for a new job after making a similar mistake. Unfortunately, I think that mishaps like this happened because the FAA and the contractor created a NOTAM system that was fragile and easy to break. According to John Cox, an aviation safety expert, an outage like this has not happened in 53 years. Fifty-three years without an outage is a mighty impressive record in the IT industry. Reliability like this gets CIOs extensive bonus checks. Still, it looks terrible that air traffic controllers and pilots are grounded because a software update did not go well. 

Delta airlines CEO Ed Bastian called the shutdown unacceptable but said that the incident was not the fault of the FAA but instead the result of a lack of funding. Senator Maria Cantwell of Washington said that Congress would hold a hearing on the subject. In my last blog, I said moments like this were necessary to point out system problems and call attention to maintenance and technical debt. A moment of clarity should focus the minds of members of Congress and members of the executive branch toward improving the system.

So far, the FAA website has one update on January 11th about the outage, and the news has moved on to other subjects. Transportation Secretary Pete Buttigieg has said that they have made some changes to prevent this from happening again. Still, I suspect it will take more than changing the procedures to ensure something like this does not happen again. I am suspicious that decades of technical debt and outdated servers are at the heart of the NOTAM system. It looks like the perfect project for an Agile coach and change agent to make a difference. Secretary Buttiegieg, give me a call, and we can talk about how I and CAPCO can help update your systems. 

I must confess that I am a little glib about the subject, but technical debt is a big deal. We have spent plenty of time and energy making systems efficient, so they are not resilient when bad things happen. Making systems faster, better, and cheaper is essential, but it undermines our trust in those systems if they are not resilient. Eventually, that lack of confidence will hurt the airline industry and the country. That is something that no one wants. 

Until next time. 


Monday, June 26, 2017

Developing the professional scrum master

If you think this is ugly try
hiring an amateur plumber to fix it. 
The business world has a saying, “If you think hiring a professional is expensive, wait until you hire an amateur.”  The obvious meaning being a poorly trained amateur will cost the company more money than someone who is more expensive but better qualified.  This week I want to talk about the minimum standards of professionalism you should expect from a scrum master.

I am a big believer that with enough time and training anyone can develop a useful skill.  If I devoted ten years of my life learning to be a plumber I could become competent.  Unfortunately, I know myself well enough to know that I need to call a professional when my water heater breaks.  A bonded plumber is worth the time and expense for me to have hot water.

When you get into other activities training is only a small part of the equation.  You can practice piano for years and still not be good enough to entertain an audience not composed of parents.  Jazz musicians refer to the quality of being able to improvise and perform in front of an unpredictable crowd as “chops.”  The idea is that anyone can learn to play the notes, but a real musician has chops.  Hard work, combined with talent makes a jazz musician successful.

I feel the same way about scrum mastery.  Everyone can be trained to do the job, but only a minority can do the job well.  It is the difference between having a high school student perform at your night club and having Elton John setting up a residency.  Fortunately, there are plenty of good programs to train scrum masters.  I am particularly fond of the Scrum Alliance Certified Scrum Master certification because it teaches the basics of the job along with the more touchy-feely skills which come with the job.

Once they have received some training, they can then lead a scrum team.  I recommend putting a rookie scrum master with an experienced product owner. This way the scrum master can gain experience with someone who can show them the ropes of the business and the particulars of a project.  With a year or two of experience, a scrum master can help a product owner learn their trade.  Much like the ideas proposed in extreme programming an experienced veteran should partner with a rookie so they both gain from each other’s experience.

With a little luck, you will find someone who is outgoing, a good communicator, empathic, has grace under pressure and can act as team therapist.  Then and only then do you have a scrum master with chops who can take your team to the next level.  So take the time to train your scrum masters.  Next, pair them with experienced developers and product owners, so they gain confidence and experience.  Finally, make sure you find people who possess the talents which will make them successful in the job.  If you do this, you will not have to pay extra for an amateur managing your scrum teams.

Until next time.

Monday, February 13, 2017

When to Unleash the Kracken

Make sure you mean it!
I have called being a Scrum Master many things.  It is a coaching role.  Being a scrum master can resemble being a therapist.  I have even compared being a scrum master to a parish priest.  The job is not easy.  A scrum master has plenty of late nights and early mornings.  One of the things they discuss in training for Scrum Masters is an important responsibility.  A scrum master can terminate a sprint.  Trainers don’t often consider this awesome responsibility but this week I will.

When things go badly in spaceflight mission control can terminate the mission.  When that happens, the astronauts turn the ship around and head back to earth via the shortest route.  When things go wrong on a scrum team, we have what is called an abnormal termination.  The sprint ends immediately, the team does a retrospective and plans a new one.

There is a contentious debate in the Agile community about who has the right to terminate a sprint.  I belong to the ideological camp that a scrum master should have this terrible duty.  I feel this way because the scrum master is the protector of process and improvement of the team.  When the Product Owner or the development team is in a hopeless situation, it is up to the Scrum Master to recognize it and take draconian measures.

Terminating a sprint is a huge deal.  Do not do this lightly.  It should be the plan “C” or “D” for any sprint.  Terminating a sprint creates all sorts of attention in an organization and is profoundly disruptive.  When a Scrum Master kills a sprint, it is the equivalent of Zuse saying, “Release the Kracken!”  Do not do it unless you are certain.  Otherwise, it is like pulling a fire alarm because you do not want to go to class.

The following are the rare reasons why a scrum master ought to terminate a sprint.

Project funding has changed drastically-

In an uncertain global economy, projects get canceled, and new ones are spun up.  These events impact your plans.  If this happens, the scrum master may want to terminate the sprint.  The termination will give the Product Owner, development team, and Scrum Master a chance to take stock and decide what the next steps to pursue.

The Product Owner is fiddling with a sprint while it is in progress-

I working with a product owner who was under so much pressure they could not prioritize work.  Sprints would begin, and stories would be swapped out with others.  Things which were a priority during sprint planning would be dropped days before the end of the sprint and replaced with stories of equal size.  The development team was forced to try and do the same amount of work in half the time.  Confronted with this bad faith behavior, I hit the self-destruct button and brought it to the attention of my Vice President. Leadership replaced the Product Owner in the aftermath.  The development team resumed sprinting.

The team grossly underestimated the work they could do in the sprint-

We have all had the “ten-minute change,” conversation. A product owner or someone from the business asks for a change to the system.  The developer listens to the request absentmindedly and says the change takes approximately ten minutes to perform.  A week later the developer admits the change affected other systems and it would take longer than expected.  Soon the rest of the development team is sucked into correcting the situation.

As my mentor, Angela Duggan would say, “developers are afraid of looking incompetent or unskilled.” This fear pushes software developers to underestimate work.  The team commits to something and then realizes it is more complicated than first expected.  If your development team has with six weeks of work in a three-week sprint, it is acceptable to hit the reset button.

These are three of the rare situations where you might terminate a sprint.  If you have other suggestions, I would love to hear them.

Until next time.



Monday, August 15, 2016

If it isn't broke you better fix it.

This didn't have to happen.
I have been off line for a week as I attended the Gen-Con game fair in Indianapolis and tried to get back into the swing of things at work.  While I was away, I had a chance to recharge my batteries and have a good time playing board games with friends.  I also got to have a little fun with the people at Big Potato Games which is seems like a fun group of people who are making a big splash in the industry.  When I came back two things happened which got my attention which illustrated the paradox of contemporary business and modern technology.

The first was the problem with Delta Airlines and its reservation system which grounded the company for two days.  The second was a small article in the technology press about Windows 10 updates.  Both articles illustrate to me that the business maxim, “…if it isn’t broke don’t fix it,” is seriously wrong.  If you are a company in the 21st century if you want to remain in business it is your responsibility to upgrade your technology infrastructure and applications.

First, Delta airlines relies on its reservation system to be managed on AS/400 systems and mainframes using the IBM Transaction Processing Facility software.  The software was last upgraded by IBM ten years ago and the only people who can fix something if anything goes wrong are IBM consultants.  If something goes wrong a CIO and their company is forced to call IBM to make changes and corrections.  In the same ten year span, Microsoft has had four operating systems; Windows 7, Vista, Windows 8 and Windows 10.  Presently, there is an entire ecosystem of developers outside of Microsoft who can alter, improve or fix these systems.  So if an airline wants more availability to labor and more up to date systems they should go with a Microsoft solution.

This did not happen for a few reasons.  First, airlines for all their talk of customer service and being high tech are notoriously stingy with money to upgrade and improve their technology infrastructure.  So what they did is graft other technology systems on to their old IBM infrastructure.  If the AS/400 went down, it would create a cascading effect which would shut down the airline.  According to the news, that is exactly what happened as numerous technology professionals scrambled to get the systems back up and running.  It also lead to the CEO of the company publicly admitting they are doing the best they could to fix the problem without knowing exactly what went wrong.    Next, the people who make the decisions about the funding felt this risk was so unlikely that they decided that the system was not broken and so they did not need to make improvements.

This kind of thinking is foolish.  Software is like any other machine but it manufactured out of ones and zeros instead of steel.  Machinery needs to be maintained or it will break down.  Fail to change the oil in your car and see what happens after 100,000 miles.  That is the exact situation which happened at Delta. The people driving the organization put off or ignored routine maintenance to its systems because it would cost money to do so.  As long as everything was working, there was no need to do maintenance and upgrades.  As you can see, this cost the company millions of dollars when the system failed and hurt its reputation for quality service.

The other new item I saw this week was a brief blurb about how Windows 10 updates are not an iron clad guarantee that a system will not be compromised by hackers because people generally do not upgrade the other software on their machines.  As a technology professional we have seen people with Windows 10 machines with copies of Office 2007 on them.  This mixing and matching of software in the real world is common because people don’t have the money to upgrade everything.  This creates openings for hackers and people willing to do bad things.

This is short sided like a person not changing the oil in their car.  When you upgrade an operating system you should be able to update the software which is on that operating system.  This is why I am a big fan of Google Documents and Microsoft’s Office 365 software because these cloud based systems update automatically and do not rely on the user purchasing and installing upgrades.  The burden is no longer on the consumer but on the company providing the software which is what it should be.

So in one week the world witnessed an object lesson in why the phrases, “…if it isn’t broke don’t fix it,” is wrong.  Old and outdated software which was not maintained properly failed spectacularly.  The only people who could fix the software was a third party vendor which was not responsive.  The pennies saved on upgrades and improvements became millions of dollars in technical debt which shut down the company.  Finally, the reputation of the company was hurt by this kind of thinking.

It is also clear that just upgrading operating systems is not enough the applications which run on those operating systems need to be improved.  I understand that in the world of technology bragging about your new data center or software upgrades to your core business is not as glamorous as web application or phone app but it is just as important because when those systems fail they fail in an embarrassing and spectacular fashion.  So it is up to everyone from the largest company to the personal consumer to pay attention to how they maintain their software.  If not, expect to be grounded.

Until next time.

Monday, June 1, 2015

Great Failure Yields Great Wisdom

This guy can teach us about Agile
It is nice to take time off.  It feels slightly decadent to have nothing to do but enjoy time passing.  This week with the Memorial Day holiday behind me I wanted to talk about how agile professional can draw some inspiration from our armed forces to make our teams better.

The American Heroes Channel has be celebrating the end of the Second World War with the seventieth anniversary of VE day.  I think what was more informative was the more muted commemoration of the fortieth anniversary of the fall of Saigon.  That tragic period of history has a deep resonance with me.  Instead of “Peace with Honor” we had the collapse of a nation.  In my mind, the panic, chaos, and missed opportunities of the fall of Saigon seem like a fairly good metaphor for a failing IT project.  People scrambling for the exits, leaders trying to make the best of a bad situation, and stories of commitment and courage all come to the forefront when talking about how to deal with a desperate situation.

I think what sticks out in my mind most is the story of Richard Armitage.  Many people might remember him as the Deputy Secretary of State with Collin Powel during George W. Bush’s presidency.  But in 1975 he had served three tours in Vietnam with the Republic of Vietnam Navy.  He was also involved with the CIA gathering intelligence.  When the end come for South Vietnam, Armitage helped save 30,000 refugees crammed on ships escaping Saigon.  He did this against the wishes of the Philippine and American governments.

What can an agile specialist learn from this story?  First, when the chips are down doing what is “right” is more important than what your boss wants.  Armitage, was supposed to liberate a few officers from the Republic of Vietnam Navy and make sure the frigates and destroyers docked in Saigon did not fall into communist hands.  Instead, he defied orders and not only save the ships and officers but the sailors and their families.  Next, when confronted with an impossible situation make the impossible choice.  There was no way that Armitage was going to save everyone he could but he did the best that he could, given the circumstances.  Not everyone was saved from the communists but 30,000 people were able to breathe free thanks to the impossible choices Armitage made.  Finally, let the work and effort speak for itself.  When the war ended, Armitage’s career was over with the Navy but because of his reputation and experience he quickly found work with the Department of Defense.  Eventually, Armitage would work with a fellow Vietnam Vet named Collin Powel in the State Department.

It is easy to look at military victories and find lessons in them.  I find defeat and failure to be much more informative.  By looking at those tragic efforts we can see where we can improve and how we can do better the next time.  Many of the lessons of Vietnam informed the officers who fought their and guided their decision making steps when it came to future efforts in Iraq and Afghanistan.

So as an agile practitioner remember doing what is right is more important than what your boss wants, make impossible choices in impossible situations, and finally let your efforts speak for itself.

Until next time.


Monday, June 17, 2013

Zombie Proof your Business

Zombies are coming,  is your business ready?
This week Brad Pitt is putting his money and reputation on the line with the release of his blockbuster production World War Z.   I have been reading the book in anticipation of the movie and I am looking forward how they are going to transform Max Brooks’ book into popular entertainment.  While I was reading along, it struck me that a great deal wisdom can be gleaned from a fictional zombie apocalypse. As a small business person the unforeseen and the unexpected happen all the time. If you are not careful your business can wind up like the walking dead.  In this post, I want to talk about how cloud based services and changes in the technology sphere can help you avoid this tragic fate.

The biggest challenge for any business is how to deal with disaster recovery.  If a fire, earthquake or flood hits your business how are you going to get back up and running.  Larger companies have decided that they are going to create large data centers resembling Fort Knox.  Massive Batteries, multiple generators, and numerous systems are in place.  In fact these facilities are filled with backups and redundancies that the only way to really stop business is the have an asteroid directly strike the building or a massive civil disruption, like a zombie apocalypse, to shut down the entire power grid.

As a software developer at the turn of the century, I though all these precautions were an over-reaction.   My opinion changed overnight in the aftermath of the September 11th terrorist attacks on the world trade center.  Merrill Lynch lost four employees to the attacks but the terrorist nearly destroyed the company causing it loose over $98 million in the course of a tragic morning.  Data-centers and disaster recover became a very serious business. No one anticipated the terrorists flying a plane into the building with your data but now it was a very real possibility.

Over the last decade, as data centers grew they had surplus space and power.  Being good capitalists, these companies began leasing out their extra space to smaller companies.  Cloud computing was born.  Soon companies like Amazon, Google, and Microsoft got into the act and became the principle suppliers of cloud services.  Prices fell and up-time increased.  This was technology and the marketplace doing what Adam Smith said it would.

As a small business myself for a few hundred dollars a year I have the computing power which would have cost several hundred thousand dollars during the giddy days of the dot-com boom.  It allows me to stay in business as I look for customers.  I also know that my data is secure because it exists in a data-center in the Pacific Northwest and that if something happens to me or my business, like a zombie attack, the survivors will be able to access the data and information with a few passwords.  You can have this security too.  At E3 systems we have inventory management and fleet management software safely hidden away on the cloud for you to use so if something happens to your facility you will be ready to do business the next day.

Contact us today and we will show you how this is more exciting than a zombie apocalypse.

Until next time.