Zipf's Law and Software Engineering
Sunday, January 31, 2010
All well designed software applications are alike; each badly designed software application is badly designed in its own way.For some time now I've been looking at Zipf's Law and wondering if it applies to computer programs written in any modern computer language. In other words, is
Zipf's Law relevant when analyzing computer code? And if it is relevant, what does it say about the structure and the correctness of software?
Wentian Li summarizes Zipf's Law as "the observation that frequency of occurrence of some event (
P), as a function of the rank (
i) when the rank is determined by the above frequency of occurrence, is a power-law function P
i ~ 1/i
a with the exponent
a close to unity (1)."
For the sake of argument, let
P (a random variable) represented the frequency of occurrence of a keyword in a program listing.
On the surface, Zipf's Law is meaningless when talking about any program written in any contemporary programming language because every programming language has a limited number of keywords and some keywords are used more than others. For example, the keyword
goto
exists in most modern languages, but we shy away from its use, as we've been taught that gotos are evil--and they are, though they have their place. In all likelihood, this keyword has a low frequency of occurrence in most programs.
On the other hand, the keyword
for
is common in algorithms dealing with data, and likely to have a higher frequency of occurrence in source code. Therefore, I conclude, without empirical proof because it's an obvious finding, that any computer program written in any contemporary programming language has a power law distribution, i.e., some keywords are used more than others.
However interesting the frequency distribution of computer language keywords is, it has little practical value. Of much more interest to software engineering is the context and combination of all the keywords--entire stories can be told in computer programs. For instance, we create entities that don't exist except in computer memory at run time; we create logic nodes that will never be tested because it's impossible to test every logic branch; we create information flows in quantities that are humanly impossible to analyze with a glance; in sum, we create chaos and order at the same time. The second law of thermodynamics is at play everywhere: the more combination of keywords we add, the entropy of the state machine increases.
Because I'm arguing that what matters in software applications is the combination of keywords within the context of a solution and not their quantity used in a program, I need to explain what context means. This is not a trivial task because the context of an application is attached to the problem being solved and every problem to solve is different and must have a specific program to solve it. (Don't confuse reusable code here; even if a framework is used to solve a problem, the solution will be unique in its own way.)
Although a program could be syntactically correct, it doesn't mean that the algorithms implemented solve the problem at hand. What's more, a correct program can solve the wrong problem. Let's say we have the simple requirement of printing "Hello, World!" A syntactically correct solution in Java looks as follows:
public class SayHello {
public static void main(String[] args) {
System.out.println("Jose Sandoval!");
}
}
This solution is obviously wrong because it doesn't solve the original requirement. This means that the context of the solution within the problem being solved needs to be determined to ensure its quality. In other words, we need to verify that the output matches the original requirement.
You could argue that the scenario I've presented here doesn't happen in real life; however, you would be surprised as to how often it actually does happen.
In the past, I have coded a perfect solution to a non-existent problem, and the issue wasn't the requirement gathering phase. The issue was that once the application became an executable program, the problem I was originally solving wasn't a problem anymore--the requirements had evolved. It took another iteration to get the right solution.
This is a valuable lesson to learn, however; one that I wish every developer I come in contact with has learned the hard way (code the correct solution to a non-existing problem). There's no finger pointing in the process; what's interesting to me is to know what he or she does now to prevent the same mistake.
It's becoming clear that Zip's Law, as I'm using it here to count keywords, has nothing to say for my application above. Zip's Law can't even say too much about larger systems I've written if all I'm doing is grouping program statements. Interesting patterns, however, begin to emerge when you begin to look at an aggregate body of source files. The most important are reusability, coupling, and proper encapsulation. The length of classes and functions becomes important and it's easy to argue for short versus long functions--typically, a long function is problematic to debug and maintain, and almost impossible to extend.
So modern software engineering university courses preach the main three tenets of object oriented development as being the silver bullet for woes. Zip's Law seems to support this practice as exposed in the paper
Understanding the Shape of Java Software. This research looks at large software projects to try to understand what makes a system successful, and the authors seem to have found commonalities on the things that we've come accept as
good design principles of software systems.
We seem to take the three tenets of object oriented engineering at face value; however, there are solid theoretical reasons why they hold true in large scale systems. Yes, proper OOD is the way to go.
Coming back to my paraphrasing of Anna Karenina's first paragraph: successful projects do seem to have common processes; however, unsuccessful projects will be unsuccessful on their on ways.
Java vs. AS3 coding styles
Friday, January 01, 2010
Coding styles evolve with the times and are as different as there are developers and programming languages. Where do coding styles come from?
We all have our ingrained way of writing and formatting code. The coding styles I've followed throughout these years come from my time in university while an undergrad student (
University of Waterloo). Our programming assignments had specific requirements to follow (remember those pre/post statements for every function?). Depending on the programming class and project, skeletons of code were given to be filled with the real assignment. Because of this, I learned to write and comment code like my TAs and professors.
It was a vicious cycle: because everyone came from the same school, we all did the same things, for example, where the curly braces went, how many spaces between each line were required, where class members went. Coincidentally, the cycle continued into our co-op terms and first jobs around the Waterloo area: a large percentage of the senior developers in the software companies where we started our careers were also from Waterloo--and I believe that hasn't changed. We were a happy, uniformly trained coding family (mind you that this was a good thing).
I now have my own coding style and I notice when other developers do different things from what I do. Lately, I've been going through a lot of AS3 code and I have noticed, among other things, that curly braces are placed on their own lines. For example, a class definition may look as follows:
public class Button extends UIComponent
{
public function Button()
{
super();
}
}
There's nothing wrong with this class definition or the syntax. However, most, if not all, Flash developers follow this style. Why? Where did it come from? Do they use it because every other Flash developer codes with this style and all the code samples they got a hold of when learning the language looked like this?
I think this is it. Code is cheap in the internet: someone will post a piece of code and someone else will use, borrow, and steal it (it's the way of the modern developer).
I've used a few languages in the past and I never liked this particular coding style--a whole line for a curly brace. Because most of the apps I've worked on are Java enterprise apps, I adhere, almost religiously, to Sun's
Code Conventions for the Java Programming Language. I'm not a zealot, though I like to know that there are other programmers out there that adhere to same style standards I do and, therefore, I will know how to navigate their code when the time for maintenance comes--and that time will come.
By the way, the
Button
class above, should be written as:
public class Button extends UIComponent {
public function Button() {
super();
}
}
If the code works, does it make a difference where the curly braces go? As with everything, it depends on how you look at it. I think my way is a more concise way of writing code. I can have a few more lines in front of me per screen page. I know many developers and they like to have a lot of code on the screen at one time as well. What's more, vertical real state keeps shrinking, depending on the display you use. I'm a laptop user (ThinkPad X200), and the form factor keeps getting wider but shorter. Who said that a larger horizontal form factor is better? Maybe it's the way we're using our computers--entertainment systems--and we need a wider aspect ratio for better movie quality.
So, in the world of AS3 development, I doubt I will convince these programmers to write more compact code a la C, C++, Java, C#. I've asked a couple of them why they code that way, and the answer I get sounds the same every time: "it's the right way of doing it." Further more, like me, they can only guess where their coding style comes from--school, books, code samples.
Most samples of AS3 and Flex code I have found on the net have this elongated, vertically wasteful style. And because of our need to borrow and steal code from other developers I think future AS3 and Flex code will continue to waste screen real state with those extra carriage returns.
I'll do my best to change this habit, and this I promise: every sample of AS3 or Flex code I'll ever publish will have the more compact coding style familiar to C, C++, Java, C# developers.
On a final note, please use curly braces for one-line statement inside
for
,
while
,
if
, or
else
conditionals. Yes to this:
if (true) {
doSomethingAwesome();
} else {
doSomethingAwful();
}
No to this:
if (true)
doSomethingAwesome();
else
doSomethingAwful();
And never to this:
if (true) doSomethingAwesome(); else doSomethingAwful();
What about scripting language that don't require braces or semicolons? That's a whole different post.
Finally, yes, I know other C, C++, Java, C# developers use the elongated style. To them, I say, stop it.