Not a Production Quality Software
A while ago I worked at a bank, doing stuff there, and I was exposed to their internal IT structure. As a result of that experience, I decided that I will never put any money in that bank. I am in no way naive enough to think that the situation is different in other banks, but at least I didn't know how bad it was. In fact, that experience has led me to the following observation:
There is a direct reverse relationship between the amount of money a piece of code handles and its quality.
The biggest bank in Israel just had about 60 hours of downtime. Oh, and it also provide computing services for a couple of other banks as well, so we had three major banks down for over two days. The major bank, Hapoalim, happen to be my bank as well, and downtime in this scenario means that all of the systems in the bank were down. From credit card processing to the internal systems and from trading systems to their online presence and their customer service.
From what I was able to find out, they managed to mess up an upgrade, and went down hard. I was personally affected by this when I came to Israel on Sunday's morning, I wasn't able to withdraw any money, and my credit cards weren't worth the plastic they are made of (a bit of a problem when I need a cab to go home). I am scared to think what would have happened if I was still abroad, and my bank is basically in system meltdown and inaccessible.
I was at the bank yesterday, one of the few times that I actually had to physically go there, and I was told that this is the first time that they had such a problem ever, and the people I was speaking with has more than 30 years of working for the bank.
I am dying to know what exactly happened, not that I expect that I ever will, but professional curiosity is eating me up. My personal estimate of the damage to the bank is upward of 250 million, in addition to reputation & trust damage. That doesn't take into account lawsuits that are going to be filed against the bank, nor does it take into account the additional costs that they are going to incur as a result of that just from what the auditors are going to do to them.
Oh, conspiracy theories are flourishing, but that most damning piece as far as I am concern is how little attention the media has paid for this issue overall.
Leaving aside the actual cause, I am now much more concern with the disaster recovery procedures there...
Comments
I guess the disaster recovery procedure is something like: "collect all paperwork about this customer and try to reconstruct his/her account"
Um. What paperwork? Does anybody actually have any paperwork about customer's accounts these days? Especially transaction-related stuff?
Since Oren does not seem to complain about any of his money suddenly disappearing, I guess the bank did not actually loose the database. I guess 60 hours was basically them working around the clock to reinstall lots of stuff from scratch. Why didn't they have a DR site they can switch to when primary one becomes messed up is a good question.
Just like with restaurants :)
What you don't know can't hurt you eh?
I had the exact same experience, except I went to work for an airline. Boy, that was scary!
I have opted to remain anonymous :-)
I can confirm your observation. In a project, I've seen a class of 5000 lines of pure business logic used to quote shares in real-time. It was literaly dealing with hundreds of millions euros every day and had exactly 0 test. It's also worth mentioning that an error of a few cents could instantly result in a loss of a few millions euros.
The other interesting side of it is that nobody in the team was able to claim what the business rules were anymore. They were just there and that's it! Good stuff!
As more and more aspects of our lives are moved to data centers, we become more vulnerable to various errors or criminal activity. The disaster recovery is very hard, because verifying and fixing huge amounts of data after a crash is almost beyond human ability. If no one knows what the business rules are, how are they supposed to tell whether data is corrupt or not? I'm afraid the complexity and amount of data processed each day will some day make people slaves to technology - no human will be able to understand or reverse-engineer all the systems and virtual information will become the only reality - I mean the truth will be what the database says, not the opposite.
This started quite innocently - with Google - remember? "If it's not in Google, it doesn't exist". Yes, I know it sounds like matrix intro, but I don't want to live at some corporation's mercy...
Some credit cards worked, mind you; the ones with less total purchases per than month the limit your credit company (MasterCard, Visa, etc.) explicitly set when you got the credit card do not need bank confirmation; luckily, I have a card I don't normally use so I was way under that limit. Also, money could be withdrawn and transactions could be made by going to the bank, which was open (relatively) late.
I'd like to point that as far as I know, all transactions are printed as soon as they occur. I've seen a folder in my bank that had every bank transaction; not credit card transactions, as they are with the credit company, but they could know my exact balance without any computer aid; that much is a little comforting.
And now I'll stop my worthless jabber and say what I wanted to say: as I heard it, the problem was because of a flood in the server room. I could be dead wrong - but do you have any evidence or trustworthy leads to the contrary?
configurator,
To the best of my knowledge, the reason was system upgrade. No mention of flooding.
Also, when I was at the bank, at my own branch, they weren't able to tell me my current balance, only what it was several days ago, which was grossly inaccurate.
Maybe that's because I haven't had any transactions in the last few days :)
How many lines of code does it take to stuff money under the mattress? Now I am worried that US banks might have the same problem.
I'd like to suggest that you rephrase your observation, and name it Ayende's Law. My recommendation for the new phrasing would be:
"The quality of a software system is inversely proportional to the amount of money it handles."
I'd also like to offer Xenolinguist's Corollary to Ayende's Law:
"The quality of a software system is inversely proportional to the size of the company owning it when it was written."
It seems to me that the larger and organization, and certainly the more critical a system the less willingness there is to refactor code.
It's a case of don't fix it if it ain't broken. So a league of programmers over many years do nothing but change the one little piece they were instructed to change. Very little attention is paid to looking at the behavior of the code and realizing this could be refactored into something shared, etc. because that might break things.
This goes on for several years until you have a big mess nobody can understand, and everybody fears to change because they don't know what they might break.
Usually at this point some new technology comes along and everybody advocates it will solve all their problems and they should rewrite.
Good reason to make sure you always have money available from multiple banks, and carry some cash -- although whenever I carry cash, I spend it :-)
Comment preview