I am working on a thesis about meassuring quality of a product. The product in this case is a website. I have identified several quality attributes and meassurement techniques.
One quality attribute is "Robustness". I want to meassure that somehow, but I can't find any useful information how to do this in an objective manner.
Is there any static or dynamic metric that could meassure robustness? Ie, like unit test coverage, is there a way to meassure robustness like that? If so, is there any (free) tool that can do such a thing?
Does anyone have any experience with such tooling?
Last but not least, perhaps there are other ways to determine robustness, if you have any ideas about that I am all ears.
Thanks a lot in advance.
In our Fuzzing book (by Takanen, DeMott, Miller) we have several chapters dedicated for metrics and coverage in negative testing (robustness, reliability, grammar testing, fuzzing, many names for the same thing). Also I tried to summarize most important aspects in our company whitepaper here:
http://www.codenomicon.com/products/coverage.shtml
Snippet from there:
Coverage can be seen as the sum of two features, precision and accuracy. Precision is concerned with protocol coverage. The precision of testing is determined by how well the tests cover the different protocol messages, message structures, tags and data definitions. Accuracy, on the other hand, measures how accurately the tests can find bugs within different protocol areas. Therefore, accuracy can be regarded as a form of anomaly coverage. However, precision and accuracy are fairly abstract terms, thus, we will need to look at more specific metrics for evaluating coverage.
The first coverage analysis aspect is related to the attack surface. Test requirement analysis always starts off by identifying the interfaces that need testing. The number of different interfaces and the protocols they implement in various layers set the requirements for the fuzzers. Each protocol, file format, or API might require its own type of fuzzer, depending on the security requirements.
Second coverage metric is related to the specification that a fuzzer supports. This type of metric is easy to use with model-based fuzzers, as the basis of the tool is formed by the specifications used to create the fuzzer, and therefore they are easy to list. A model-based fuzzer should cover the entire specification. Whereas, mutation-based fuzzers do not necessarily fully cover the specification, as implementing or including one message exchange sample from a specification does not guarantee that the entire specification is covered. Typically when a mutation-based fuzzer claims specification support, it means it is interoperable with test targets implementing the specification.
Especially regarding protocol fuzzing, the third-most critical metric is the level of statefulness of the selected Fuzzing approach. An entirely random fuzzer will typically only test the first messages in complex stateful protocols. The more state-aware the fuzzing approach you are using is, the deeper the fuzzer can go in complex protocols exchanges. The statefulness is a difficult requirement to define for Fuzzing tools, as it is more a metric for defining the quality of the used protocol model, and can, thus, only be verified by running the tests.
I hope this was helpful. We also have studies in other metrics such as looking at code coverage and other more or less useless data. ;) Metrics is a great topic for a thesis. Email me at ari.takanen@codenomicon.com if you are interested to get access to our extensive research on this topic.
Robustness is very subjective but you could have a look at FingBugs, Cobertura and Hudson which when correctly combined together could give you a sense of security over time that the software is robust.
Well, the short answer is "no." Robust can mean a lot of things, but the best definition I can come up with is "performing correctly in every situation." If you send a bad HTTP header to a robust web server, it shouldn't crash. It should return exactly the right kind of error, and it should log the event somewhere, perhaps in a configurable way. If a robust web server runs for a very long time, its memory footprint should stay the same.
A lot of what makes a system robust is its handling of edge cases. Good unit tests are a part of that, but it's quite likely that there will not be unit tests for any of the problems that a system has (if those problems were known, the developers probably would have fixed them and only then added a test).
Unfortunately, it's nearly impossible to measure the robustness of an arbitrary program because in order to do that you need to know what that program is supposed to do. If you had a specification, you could write a huge number of tests and then run them against any client as a test. For example, look at the Acid2 browser test. It carefully measures how well any given web browser complies with a standard in an easy, repeatable fashion. That's about as close as you can get, and people have pointed out many flaws with such an approach (for instance, is a program that crashes more often but does one extra thing according to spec more robust?)
There are, though, various checks that you could use as a rough, numerical estimate of the health of a system. Unit test coverage is a pretty standard one, as are its siblings, branch coverage, function coverage, statement coverage, etc. Another good choice is "lint" programs like FindBugs. These can indicate the potential for problems. Open source projects are often judged by how frequently and recently commits are made or releases released. If a project has a bug system, you can measure how many bugs have been fixed and the percentage. If there's a specific instance of the program you're measuring, especially one with a lot of activity, MTBF (Mean Time Between Failures) is a good measure of robustness (See Philip's Answer)
These measurements, though, don't really tell you how robust a program is. They're merely ways to guess at it. If it were easy to figure out if a program was robust, we'd probably just make the compiler check for it.
Good luck with your thesis! I hope you come up with some cool new measurements!
You could look into mean time between failures as a robustness measure. The problem is that it is a theoretical quantity which is difficult to measure, particularly before you have deployed your product to a real-world situation with real-world loads. Part of the reason for this is that testing often does not cover real-world scalability issues.
The problem with "MTBF" is that it is usually measured in positive traffic whereas failures often happen in unexpected situations. It does not give any indication of robustness or reliability. No matter if a web site stays always on in lab environment, it will still be hacked in a second in the Internet if it has a weakness.