If you plan on allowing absolutely no tags, htmlentities will do the trick [EDIT: .. in most cases].
Believe it or not, almost all browsers out there today are “seriously flawed”. They make a lot of assumptions to render sloppy code.
So once you start allowing some tags (or modify to create tags - i.e. bbcode), you run into the lax browser rendering engines. Under the right conditions, browsers will accept entities without semicolons. Also, many accept the javascript directive in places where javascript simply does not belong.
A positive security policy is always the best approach (specify what is allowed). Always strip as much as you can and validate where possible. That means that a url field should be checked to contain a url. (! and if someone tells you it doesn’t, for god’s sake fix it !)
The xss_clean function primarily uses a very strict negative security policy (recognize script elements and other attack vectors and remove them). In some cases, there is a little bit of whitelisting, such as validating image and anchor tags. Since browser rendering is so weak and unstandardized, and blacklisting is such a tough thing to do, the only way for any of this to work effectively is to be very broad. That does create some false positive we just have to live with.
As for those test cases, the SVN version allows one character behind the &, because there is no entity that looks like that. I would agree though, that the whitespace character should be reinserted after the tag (at least \09, \10, and \13 - the other invisibles are stripped anyways).
