I'm looking for this definition to make my HTML renderer conform a bit better. Currently it's guessing which whitespace to keep, which to collapse and what to throw. The SGML standard is hard to find and the HTML standard doesn't seem to treat the subject with the required depth for my needs.
Currently my renderer parses the HTML into a tree and then does a recursive layout pass to position all the elements and their content. I'm experimenting with throwing some whitespace out in the parse stage, i.e. not emitting whitespace only text chunks in certain circumstances. Which kinda works for the majority of cases, but there are a fair few edge cases that are getting hard to deal with.
(I'm also working on an editor subclass of the HTML control, and layout time solutions are proving to be a bit problem in the editor, hence me working on getting them into the parse stage. The layout information isn't available till reflow time, which is some time after you have edited the document.)
Fire away with linkage/flames.
So I think the closest I'm going to get for an answer on this is here: http://www.w3.org/TR/CSS2/text.html#white-space-model
I think the section 9.1 White space in the HTML 4 specification is what you’re looking for.
If you're writing your own HTML parser, then I strongly recommend you use the parsing algorithm in the HTML 5 spec. http://www.whatwg.org/html5 It covers a large number of edge and corner cases, and general browser weirdness. Browsers don't follow SGML rules, but they are all homing in on either doing what the HTML 5 spec says, or the functional equivalent of it. There are several open source parsers available that implement the algorithm, so it should have everything you need.