Astute observers of the recent Firefox 3.0.11 release will note that this release gains a point on the Acid3 test, specifically fixing test 68, involving a difference in how Mozilla processes UTF-16 from what the Unicode specification requires. (Indeed, it seems Wikipedia’s Acid3 article was updated to reflect this on the very day of the 3.0.11 release, and that edit has even been reverted and un-reverted once. One does wonder at times whether these people don’t have better things to do with their time. 🙂 ) What did test 68 check, why was it fixed, and why was it fixed in a dot release of a stable version? If you’re curious about this and are willing to dive into technical details at some depth, keep reading. If you aren’t, I write here for the pleasure of a presumptively-interested audience whether or not any such thing exists for the topic at hand, so you’ll forgive me for not caring too much what you do. 🙂
Representing text through Unicode and UTF-16
Without descending into too much detail, modern computer programs represent text as a sequence of numbers, one number for each “character” in the text. Unicode defines each such number as a code point and assigns each number a particular meaning (e.g. 99 is “LATIN SMALL LETTER C”, that is, the letter ‘c’). Such an arbitrary sequence can’t simply be represented as itself partly because computers can only efficiently handle numbers whose values are within small, fixed ranges, and code points cover too large a range (0 to 1114111) to handle directly. Unicode therefore defines a small number of well-known, widely adopted processes to convert an idealized sequence of code points into computer-friendly sequences.
UTF-16 is one way to represent a sequence of code points, using a sequence of 16-bit numbers. Broadly speaking, code points in the range 0 to 65535 are represented using the identical number; code points 65536 and greater are represented by a pair of 16-bit numbers.  There’s an obvious problem: how do you determine if a 16-bit number encodes a single code point or half of one?  Basically, not every 16-bit number is a code point; there are intentionally no code points from hexadecimal 0xD800 through 0xDFFF (55296 through 57343).  With a little care, we can use two (exactly two, no more, no fewer) numbers from this range to represent a code point.  For example, the hexadecimal number 0xD863, 55395 in decimal, does not correspond to a valid Unicode code point, nor does 0xDDDD, 56797 in decimal.  The sequence 0xDDDD 0xD863 also corresponds to no Unicode code point or sequence of code points.  However, 0xD863 0xDDDD represents the Unicode code point with the hexadecimal value 0x28DDD, 167389 in decimal.
Acid3 test 68 and its fix
The failing test 68 examined how to interpret a purportedly UTF-16 value which actually was in error. Specifically, consider a sequence of 16-bit values like so, to be interpreted as UTF-16:
| Index | Value description | Example | 
|---|---|---|
| 0 | Value appearing to be the first in a valid pair (a high surrogate) | 0xD863 (decimal 55395) | 
| 1 | Value not appearing to be the second in a valid pair (instead appearing to be a value in range [0, 65536) validly represented as itself) | 0x61 (decimal 97) | 
| 2 | Value in range [0, 65536) validly represented as itself | 0x61 (decimal 97) | 
How should such “UTF-16” data be interpreted? First, note that Unicode includes the concept of a “replacement character”, used to fill in for the malformed parts of ostensibly-UTF-16 data when interpreting it. With that concept in mind, we have three plausible ways to interpret this sequence of values:
- complain that the data wasn’t UTF-16 and refuse to interpret it
- interpret as:
- replacement character, identical value at index 1, identical value at index 2
- replacement character, identical value at index 2
 
In fact if you look at test 68 itself, you’ll see a variety of responses are (to put it as conservatively as possible, since the specs are currently insufficiently precise to admit only one correct behavior) “not prohibited” by the relevant specifications. Mozilla at the time chose to interpret such data as the last of the three possibilities, thinking that the pair of 16-bit values was invalid rather than merely the first of them (in which case the second of the “pair” would be interpreted as its own value or start of a pair). However, the Unicode standard didn’t permit this choice, and neither did Acid3; further considering that other browsers had correct (and just as important, different) behavior, it made sense to change our interpretation in this case. I pushed a fix to the mozilla-central repository very nearly a year ago, and I made no attempt to get it in 3.0, for two reasons. First, the bug didn’t matter in the real world (web developers are very unlikely to have relied on the previous behavior, and the fix enables no new, desirable functionality); second, I knew we didn’t have enough time before the release to sniff out all potential regressions from such a low-level change and be confident nothing had been broken.
Acid3 refldux
Fast-forward a year later, however, and suddenly the fix for test 68 is in 3.0.11. The test, while important to fix eventually for a number of reasons, is not especially important for real-world behavior (a flaw of many aspects of Acid3, notwithstanding its many useful tests of desirable functionality, but I digress), so why fix it now rather than in the 3.5 release? Surely such a non-essential bugfix is better left to 3.5 as it already had been, right? That’s certainly true enough if our assumptions are correct — but in this case, curiously, they’re not.
One of the things that makes changing how character encoding works so exciting is that it affects a lot of other code. A small bugfix to such code usually has no effect on properly formed input, because such input is the normal case that receives regular testing. Improperly formed input, however, may cause immediate problems (which are usually simple to smoke out through careful creation and use of automated tests) or problems in other code at miles of distance (which are far more difficult to discover). Suppose data is interpreted by two different decoding implementations, producing two different idealized representations — what are the consequences? Maybe the lengths will differ. If those implementations are used in code written by memory nazis, perhaps a string will be copied incorrectly and result in a buffer overflow. Algorithmic differences might also throw off hashing schemes that use string length in computing a hash key. Of course, if lengths differ the characters must differ as well. That might cause a CSS selector to not apply correctly, or it might introduce a vector for slipping forbidden characters through an anti-XSS filter. Character encoding and decoding is fundamental to any number of other systems which rely on precise, correct behavior in order to work properly. If you don’t have that behavior, all bets are off.
In this case, as it turns out, two separate bugs were uncovered which my proposed patch fixed: bug 490513 and bug 489041. If you read the details in each bug, you’ll note that there’s very little in either bug to suggest why my change might have any effect. To be sure, both testcases deal with strings containing problematic sequences as above, but nothing in the testcase explicitly suggests UTF-16 decoding is happening.
A useful first step in examining any bug in a bug database is to look at its ancestry. Curiously, bug 490513 is a Bugzilla clone of bug 439206, with the same steps to reproduce and the same testcase. From there we proceed to the fix for the bug. The patch is small, and if one understands the relevant code it’s similarly easy to understand, but —
The comment in the patch looks strangely familiar.
In fact I seem to remember exactly that comment in my patch for bug 421576 that fixed test 68. (Lest I be misconstrued, this comment-cribbing is perfectly acceptable in open source code, indeed is even stylistically better for demonstrating consistent intent in the code. However, your mileage may vary in other code or if perchance you happen to work for Microsoft.)
Also suspicious: this bug was filed roughly a week after I pushed 421576 to mozilla-central. A little more investigation confirms the obvious conclusion: bug 439206 was a regression from bug 421576, vindicating my initial thoughts in bug 421576 that “we couldn’t reasonably take this now and expect to be able to sniff out all possible regressions”. It seems that I updated most of our decoding code to handle lone high surrogates correctly, but I missed the spot being fixed in this patch. The code being patched here handles string hashing within Mozilla’s hash tables, and if you read the other bug comments you can see the testcase is causing a string to be hashed using two different algorithms (a failure mode I mentioned earlier), and as the computed hashes differed things went awry.
Here’s where Denmark turned rotten: this change was deemed important to fix for 3.0 point releases, but because no one noticed it was a trunk-only regression and thus didn’t need to be fixed in a security updates, it was backported to 3.0. The problem on trunk was that I only updated half the decoding algorithms in bug 421576, so fixing bug 439206 fixed the other half and brought them into sync. What did fixing bug 439206 in 3.0 do? It updated the other half of the decoding algorithms and took that half out of sync with what the other half was before 421576! Our trunk problem — that we changed one decoding algorithm but not a second that needed to be synchronized with it — had been ported in mirror image to 3.0 point releases. This problem, then, triggered the filing of bug 490513 (incidentally regressing bug 489041 as well), and for precisely the same reasons bug 439206 was marked as security-sensitive until being fixed, bug 490513 was marked as security-sensitive.
Let’s recap:
- I fix bug 421576 (and Acid3 test 68) in mozilla-central.
- This causes the security-sensitive regression bug 439206.
- Bug 439206 is investigated and fixed in mozilla-central.
- This fixes the potential vulnerability I introduced.
- People recognize 439206 as potentially dangerous but not as a regression, so the same fix is added to 3.0 point release code.
- This causes the security-sensitive regression bug 490513 (and bug 489041 as well), because we’re missing the first half of the code (in 421576) that caused 439206.
- That bug is investigated by the reviewer of my fix for 421576, who correctly hypothesizes that my fix will fix that bug without determining exactly why.
- 421576 is fixed in 3.0.11.
- This fixes the potential security vulnerability in 3.0.11 (and less importantly, Acid3 test 68).
- Both halves (421576 and 439206) are now fixed in mozilla-central and in 3.0 point release code.
…and people wonder why we’re so hesitant to fix anything except security bugs in updates to stable releases, instead deferring to the subsequent major release. The potential for error, even in code that’s been written and reviewed by four different people across the two bugs, introduces a far higher cost than can be offset by the value in fixing nearly any non-security, non-stability bug.
Lessons for the future
All this mess is now behind us with the 3.0.11 release. However, a number of the causes for failure here are not peculiar to the precise bugs fixed here. What can we learn from this that can be applied in the future?
Make string encoding/decoding code simpler
First, and most pertinent to the case at hand, string encoding and decoding are complex, and we need to do everything we can to make this code simpler.  A large part of the reason 439206 was missed when fixing 421576 was that the relevant code was not part of Mozilla’s string code — instead of residing in xpcom/string, it was in xpcom/ds.  It’s difficult to argue that a data structure used to hold atomized strings (that is, strings which uniquely identify a sequence of characters, making a comparison of two atomized strings as fast as comparing two numbers) shouldn’t reside in a data structures (ds) directory.  However, the code to compute the hash of the string (a single number that summarizes a string’s contents, hopefully uniquely but possibly non-uniquely) must simultaneously decode the string, and that code should be in xpcom/string with all other string decoding code.
In response to this state of relative disarray, I’ve filed bug 497204 to reorganize and consolidate Mozilla’s string code, with the primary goal of getting it all in one location (so that even if you’re unfamiliar with it, you at least know all the code you might need to read) and with the secondary goal of organizing it in a clearer and simpler fashion (to make the consolidated code easier to read). We may still end up making mistakes in how we handle string encoding and decoding even with that work complete, but those mistakes will be easier to find, diagnose, and completely fix.
Write more automated tests, and make it possible to include them in security fixes
Second, we need to move as much testing like this out of human hands and into “computer hands” as possible. One reason we introduced a regression was that we relied on fallible human testing to varying degrees throughout this whole process. I relied on some informal testing of my original patch in deeming it complete; the regression fix did likewise as manual QA testing verified the problem had been fixed on trunk and then later discovered the bug not to be fixed in 3.0.x releases (after the unnecessary backport). Yet I didn’t uncover my omission, and QA didn’t notice the backport caused the branch regression rather than fixed it. Some amount of manual testing, formalized through QA or otherwise through masses of nightly testers, is both desirable and unavoidable. Most of the time, however, we are much better off with automated tests that can be run with much less effort, on a much shorter time scale, with much greater rigor, and with much less chance of mistakes in their execution.
Now, to be fair, automated tests have less value here than in many other case. I included automated tests in my original patch, but they only addressed the bug at hand and not the regression (else that regression would never have occurred). It’s true that the followup would have been helped by an automated test, because when it was ported to branch it would have failed, and the backport would have been reverted pending further investigation. (Whether this investigation would have led to discovering bug 439206 to be a trunk-only regression is unclear, but it certainly seems plausible.) Here, however, we encounter the large problem that we currently can’t include automated tests in security bug fixes because we don’t want to tip our hands before a release with the fix is available. (In cases such as this one this worry is perhaps more paranoid than well-founded, but in many other cases where a testcase is half a step from a full exploit it’s vital; mrbkap can elaborate on this at length.) The testcase is often committed after the release, particularly when the problem is memory corruption in ways that appear difficult to control (the apparent case here), but by that time the damage has already been done.
Automated tests are not a panacea, as my original fix shows. Nevertheless, consistent use of them here would have at least eliminated the 3.0.11 regression if not the trunk regression. For this to happen we must have security bugs include automated tests that run, essentially, as soon as the fix lands. The solution that I believe best meets this need as I envision it consists of one separate, private Mecurial repository per actively maintained release to which security bug tests may be committed. Access to these repositories would be strictly limited to developers with access to security bugs plus accounts for use by tinderboxen. We would then add additional steps to the build process on those tinderboxen to pull and run the security tests, reporting either PASS or FAIL for the lot of them. Detailed information would not be publicly displayed but would be available in some unspecified manner to developers who cause security bug tests to fail. Then, as security bugs are fixed, we would require two-phase commits for security bugs: first the fix to the public repository, then the test to the restricted repository. As failures would turn the main tinderbox orange, this ensures regressions get attention between the initial commit to the security repository and their eventual migration to the main repository — early enough to make the difference in cases like this. Another plus: I believe this would dovetail nicely with work in bug 383136 to make it possible to run tests against prepackaged builds.
There may be a different solution which meets the needs I specify here; I’m less concerned about the process than about results. However, I haven’t heard another proposal that I believe would work well enough.
Manual testing: actually, I don’t know what the lesson is here
Third, we have manual testing, imperfect but still necessary to some degree. I don’t know what the lesson for QA and manual testing is. Sure, they could be more diligent about checking that a bug exists pre-commit and is fixed post-commit, but in isolation every such action is reasonable. Is the decrease in available time worth the benefit of potentially catching a problem like here every so often? I don’t know the exact processes they follow these days or how they otherwise spend their time, so I really can’t evaluate this. I’ll let QA consider this situation and decide how to adjust, because I’m certainly not qualified to do so.
Conclusion
The particular bugs at issue here are now fixed, so for the moment we’re back to steady state. As explained above, however, it’s possible that further errors might happen for the same basic reasons unless we make an effort to eliminate those reasons, and we still have work to do to fix the root problems. I have hope that this article may spur improvements in processes that will make mistakes like this much harder to make, but it will take more than just me to make these changes happen. In the meantime, to Firefox 3.0.11 users, enjoy the gift of an unintended Acid3 point; it came at a much higher cost than we’d have been willing to pay if we had known the full story and never made either 3.0.x backport.
If bug 439206 had been identified as a regression when it was filed, and that the culprit had been identified by finding the changeset that caused it and backing out the patch, this issue would have been avoided, right? It would then be clear that if that patch had to be backported, so did the first patch. Perhaps another lesson is if a bug is found on trunk, the bug that regressed it must be identified, or it should be shown not to be a regression.
[Also a reasonable conclusion. Again, I defer to people with more QA experience to say if following those steps is an overall win or not, because that is an awful lot of time tracking down regressions whose origin may not matter. Perhaps it should merely be a requirement for security bugs — that might reduce the cost enough to make it more obviously a win.]
Comment by Dan — 15.06.09 @ 12:25
Very interesting post. I’d say the key failure was that nobody realized 439206 was a regression. Though, if you don’t know that a bug is a regression, it is non-trivial to find out whether it is in fact a regression. For every bug you would need to at least check all the major releases. Theoretically, you would have to check every changeset to ensure that something is not a regression.
[Exactly what I alluded to in the previous comment, regarding the costs of doing so.]
Comment by anonymous — 15.06.09 @ 12:31
Indeed, perhaps enforcing what I suggested for security bugs only would be manageable. Another way to make it more manageable would be to have a maximum time period that you need to go back to (when deciding whether the bug is a regression or not). I don’t think it’s entirely unmanageable, many people already follow this “binary search” procedure when filing bugs.
Comment by Dan — 16.06.09 @ 12:33
[…] A bug in UTF-16 processing […]
Pingback by Where's Walden? » Whole-text DOM functionality and Acid3 redux — 18.03.10 @ 11:19