Code base size, complexity and language choice

time to read 7 min | 1276 words

Via Frans, I got into these two blog posts:

In both posts, Steve & Jeff attack code size as the #1 issue that they have with projects. I read the posts with more or less disbelieving eyes. Some choice quotes from them are:

Steve: If you have a million lines of code, at 50 lines per "page", that's 20,000 pages of code. How long would it take you to read a 20,000-page instruction manual?

Steve: We know this because twenty-million line code bases are already moving beyond the grasp of modern IDEs on modern machines.

Jeff: If you personally write 500,000 lines of code in any language, you are so totally screwed.

I strongly suggest that you'll go over them (Steve's posts is long, mind you), and then return here to my conclusions. 

Frans did a good job discussing why he doesn't believe this to be the case, he takes a different tack than mine, however, but that is mostly business in usual between us. I think that the difference is more a matter of semantics and overall approach than the big gulf it appears at time.

I want to focus on Steve's assertion that at some point, code size makes project exponentially harder. 500,000 LOC is the number he quotes for the sample project that he is talking about. Jeff took that number and asserted that at that point you are "totally screwed".

Here are a few numbers to go around:

  • Castle: 386,754
  • NHibernate: 245,749
  • Boo: 212,425
  • Rhino Tools: 142,679

Total LOC: 987,607

I think that this is close enough to one million lines of code to make no difference.

This is the stack on top of which I am building my projects. I am often in & out of those projects.

1 million lines of code.

I am often jumping into those projects to add a feature or fix a bug.

1 million lines of code.

I somehow manage to avoid getting "totally screwed", interesting, that.

Having said that, let us take a look at the details of Steve's post. As it turn out, I fully agree with a lot of the underlying principals that he base his conclusion on. 

Duplication patterns - Java/C# doesn't have the facilities to avoid duplication that other languages do. Let us take the following trivial example. I run into it a few days ago, I couldn't find a way to remove the duplication without significantly complicating the code.

DateTime start = DateTime.Now;
// Do some lengthy operation
DateTime duration = DateTime.Now - start;
if (duration > MaxAllowedDuration)
{
    SlaViolations.Add(duration, MaxAllowedDuration, "When performing XYZ with A,B,C as parameters");
}

I took this example to Boo and extended the language to understand what SLA violation means. Then I could just put the semantics of the operations, without having to copy/paste this code.

Design patterns are a sign of language weakness - Indeed, a design pattern is, most of the time, just a structured way to handle duplication. Boo's [Singleton] attribute demonstrate well how I would like to treat such needs. Write it once and apply it everywhere. Do not force me to write it over and over again, then call it a best practice.

There is value in design patterns, most assuredly. Communication is a big deal, and having a structured way to go about solving a problem is important. That doesn't excuse code duplication, however.

Cyclomatic complexity is not a good measure of the complexity of a system - I agree with this as well. I have seen unmaintainable systems with very low CC scores. It was just that changing anything in the system require a bulldozer to move the mountains of code required. I have seen very maintainable systems that had a high degree of complexity at parts. CC is not a good indication.

 Let us go back to Steve's quotes above. It takes too long to read a million lines of code. IDE breaks down at 20 millions lines of code.

Well, of the code bases above, I can clearly and readily point outs many sections that I have never read, have no idea about how they are written or what they are doing. I never read those million lines of code.

As for putting 20 millions lines of code in the IDE...

Why would I want to do that?

The secret art of having to deal with large code bases is...

To avoid dealing with large code bases.

Wait, did I just agree with Steve? No, I still strongly disagree with his conclusions. It is just that I have a very different approach than he seems to have for this.

Let us look at a typical project structure that I would have:

image

Now, I don't have the patience (or the space) to do it in a true recursive manner, but imagine that each of those items is also composed of smaller pieces, and each of those are composed of smaller parts, etc.

The key hole is that you only need to understand a single part of the system at a time. You will probably need to know some of the infrastructure, obviously, but you don't have to deal with it.

Separation of concerns is the only way to create maintainable software. If your code base doesn't have SoC, it is not going to scale. What I think that Steve has found was simply the scaling limit of his approach in a particular scenario. That approach, in another language, may increase the amount of time it takes to hit that limit, but it is there nevertheless.

Consider it the inverse of the usual "switch the language for performance" scenario, you move languages to reduce the amount of things you need to handle, but that scalability limit is there, waiting. And a language choice is only going to matter about when you'll hit it.

I am not even sure that the assertion that 150,000 lines of dynamic language code would be that much better than the 500,000 lines of Java code. I think that this is utterly the wrong way to look at it.

Features means code, no way around it. If you state that code size is your problem, you also state that you cannot meet the features that the customer will eventually want.

My current project is ~170,000 LOC, and it keeps growing as we add more features. We haven't even had a hitch in our stride so far in terms of the project complexity. I can go in and figure out what each part of the system does in isolation. If I can't see this in isolation, it is time to refactor it out.

On another project, we have about 30,000 LOC, and I don't want to ever touch it again.

Both projects, to be clear, uses NHiberante, IoC, DDD (to a point). The smaller project has much higher test coverage as well and much higher degree of reuse.

The bigger project is much more maintainable (as a direct result of learning what made the previous one hard to maintain).

To conclude, I agree with many of the assertions that Steve makes. I agree that C#/Java encourage duplication, because there is no way around it. I even agree that having to deal with a large amount of code at a time is bad. What I don't agree is saying that the problem is with the code. The problem is not with the code, the problem is with the architecture. That much code has no business being in your face.

Break it up to manageable pieces and work from there. Hm... I think I have heard that one before...