07.06.11

Introducing mozilla/RangedPtr.h: a smart pointer for members of buffers

Tags: , , , , — Jeff @ 15:17

Introduction

Suppose you’re implementing a method which searches for -- as part of an HTML parser:

bool containsDashDash(const char* chars, size_t length)
{
  for (size_t i = 0; i < length; i++)
  {
    if (chars[i] != '-')
      continue;
    if (chars[i + 1] == '-')
      return true;
  }
  return false;
}

But your method contains a bug! Can you spot it?

The buffer bounds-checking problem

The problem with the above implementation is that the bounds-checking is off. If chars doesn’t contain -- but ends with a -, then chars[i + 1] will read past the end of chars. Sometimes that might be harmless; other times it might be an exploitable security vulnerability.

The most obvious way to fix this in C++ is to pass an object which encapsulates both characters and length together, so you can’t use the characters except in accordance with the length. If you were using std::string, for example, accessing characters by [] or at() would generally assert or throw an exception for out-of-bounds access.

For one reason or another, however, you might not want to use an encapsulated type. In a parser, for example, you probably would want to use a pointer to process the input string, because the compiler might not be able to optimize an index into the equivalent pointer.

Is there a way to get “safety” via debug assertions or similar without giving up a pointer interface?

Introducing RangedPtr

We’re talking C++, so of course the answer is yes, and of course the answer is a smart pointer class.

The Mozilla Framework Based on Templates in mozilla-central now includes a RangedPtr<T> class. It’s defined in mfbt/RangedPtr.h and can be #included from mozilla/RangedPtr.h. RangedPtr stores a pointer, and in debug builds it stores start and end pointers fixed at construction time. Operations on the smart pointer — indexing, deriving new pointers through addition or subtraction, dereferencing, &c. — assert in debug builds that they don’t exceed the range specified by the start and end pointers. Indexing and dereferencing are restricted to the half-open range [start, end); new-pointer derivation is restricted to the range [start, end] to permit sentinel pointers. (It’s possible for start == end, although you can’t really do anything with such a pointer.)

The RangedPtr interface is pretty straightforward, supporting these constructors:

#include "mozilla/RangedPtr.h"

int nums[] = { 1, 2, 5, 3 };
RangedPtr<int> p1(nums, nums, nums + 4);
RangedPtr<int> p2(nums, nums, 4); // short for (nums, nums, nums + 4)
RangedPtr<int> p3(nums, 4); // short for (nums, nums, nums + 4)
RangedPtr<int> p4(nums); // short for (nums, length(nums))

RangedPtr<T> supports all the usual actions you’d expect from a pointer — indexing, dereferencing, addition, subtraction, assignment, equality and comparisons, and so on. All methods assert self-correctness as far as is possible in debug builds. RangedPtr<T> differs from T* only in that it doesn’t implicitly convert to T*: use get() method to get the corresponding T*. In addition to being explicit and consistent with nsCOMPtr and nsRefPtr, this will serve as a nudge to consider changing the relevant code to use RangedPtr instead of a raw pointer. But in essence RangedPtr is a pretty easy drop-in replacement for raw pointers in buffers. For example, adjusting containsDashDash to use it to assert in-rangeness is basically a single-line change:

#include "mozilla/RangedPtr.h"

bool containsDashDash(const char* charsRaw, size_t length)
{
  RangedPtr<const char> chars(charsRaw, length);
  for (size_t i = 0; i < length; i++)
  {
    if (chars[i] != '-')
      continue;
    if (chars[i + 1] == '-')
      return true;
  }
  return false;
}

(And to resolve all loose ends, if you wanted containsDashDash to be correct, you’d change the loop to go from 1 rather than 0 and would check chars[i - 1] and chars[i]. Thanks go to Neil in comments for noting this.)

A minor demerit of RangedPtr

RangedPtr is extremely lightweight and should almost always be as efficient as a raw pointer, even as it provides debug-build correctness checking. The sole exception is that, for sadmaking ABI reasons, using RangedPtr<T> as an argument to a method may be slightly less efficient than using a T* (passed-on-the-stack versus passed-in-a-register, to be precise). Most of the time the cost will be negligible, and if the method is inlined there probably won’t be any cost at all, but it’s worth pointing out as a potential concern if performance is super-critical.

Bottom line

Raw pointers into buffers bad, smart RangedPtrs into buffers good. Go forth and use RangedPtr throughout Mozilla code!

06.06.11

I feel the need…the need for JSON parsing correctness and speed!

JSON and SpiderMonkey

JSON is a handy serialization format for passing data between servers and browsers and between independent, cooperating web pages. It’s increasingly the format of choice for website APIs.

ECMAScript 5 (the standard underlying JavaScript) includes built-in support for producing and parsing JSON. SpiderMonkey has included such support since before ES5 added it.

SpiderMonkey’s support, because it predated ES5, hasn’t always agreed with ES5. Also, because JSON support was added before it became ubiquitous on the web, it wasn’t written with raw speed in mind.

Improving JSON.parse

We’ve now improved JSON parsing in Firefox 5 to be fast and fully conformant with ES5. For awhile we’ve made improvements to JSON by piecemeal change. This worked for small bug fixes, and it probably would have worked to fix the remaining conformance bugs. But performance is different: to improve performance we needed to parse in a fundamentally different way. It was time for a rewrite.

What parsing bugs got fixed?

The bugs the new parser fixes are quite small and generally shouldn’t affect sites, in part because other browsers overwhelmingly don’t have these bugs. We’ve had no compatibility reports for these fixes in the month and a half they’ve been in the tree:

  • The number syntax is properly stricter:
    • Octal numbers are now syntax errors.
    • Numbers containing a decimal point must now include a fractional component (i.e. 1. is no longer accepted).
  • JSON.parse("this") now throws a SyntaxError rather than evaluate to true, due to a mistake reusing our keyword parser. (Hysterically, because we used our JSON parser to optimize eval in certain cases, this change means that eval("(this)") will no longer evaluate to true.)
  • Strings can’t contain tab characters: JSON.parse('"\t"') now properly throws a SyntaxError.

This list of changes should be complete, but it’s possible I’ve missed others. Parsing might be a solved problem in the compiler literature, but it’s still pretty complicated. I could have missed lurking bugs in the old parser, and it’s possible (although I think less likely) that I’ve introduced bugs in the new parser.

What about speed?

The new parser is much faster than the old one. Exactly how fast depends on the data you’re parsing. For example, on Opera’s simple parse test, I get around 156000 times/second in Firefox 4, but in Firefox 5 with the new JSON parser I get around 339000 times/second (bigger is better). On a second testcase, Kraken’s JSON.parse test (json-parse-financial, to be precise), I get a 4.0 time of around 140ms and a 5.0 time of around 100ms (smaller is better). (In both cases I’m comparing builds containing far more JavaScript changes than just the new parser, to be sure. But I’m pretty sure the bulk of the performance improvements in these two cases are due to the new parser.) The new JSON parser puts us solidly in the center of the browser pack.

It’ll only get better in the future as we wring even more speed out of SpiderMonkey. After all, on the same system used to generate the above numbers, IE gets around 510000 times/second. I expect further speedup will happen during more generalized performance improvements: improving the speed of defining new properties, improving the speed with which objects are allocated, improving the speed of creating a property name from a string, and so on. As we perform such streamlining, we’ll parse JSON even faster.

Side benefit: better error messages

The parser rewrite also gives JSON.parse better error messages. With the old parser it would have been difficult to provide useful feedback, but in the new parser it’s easy to briefly describe the reason for syntax errors.

js> JSON.parse('{ foo: 17 }'); // unquoted property name
(old) typein:1: SyntaxError: JSON.parse
(new) typein:1: SyntaxError: JSON.parse: expected property name or '}'

We can definitely do more here, perhaps by including context for the error from the provided string, but this is nevertheless a marked improvement over the old parser’s error messages.

Bottom line

JSON.parse in Firefox 5 is faster, follows the spec, and tells you what went wrong if you give it bad data. ’nuff said.

16.03.11

JavaScript change for Firefox 5 (not 4): class, enum, export, extends, import, and super are once again reserved words per ECMAScript 5

Most programming languages have keywords or reserved words: names which can’t be used to name variables. Keywords have special meaning, so using them as variable names would conflict with such use. Reserved words are keywords of the future: names which might eventually be given special meaning, so they can’t be used now to ease future adoption.

JavaScript and the ECMAScript standard that underlie it historically have had an excessively large set of keywords and reserved words “inherited” from Java. ES5 partially loosened ES3‘s past keyword restrictions. For example, byte, char, and int were reserved in ES3 but aren’t in ES5.

Many years ago, before work started on ECMAScript after ES3, a few browsers stopped reserving all of ES3’s reserved words. In response browsers generally started to un-reserve many of these names. As it turned out this un-reservation went too far: ES5 un-reserved many of these words, but it didn’t un-reserve all of them. In particular, while some implementations un-reserved the names class, enum, export, extends, import, and super, ES5 did not.

Firefox un-reserved these names then along with some other browsers. But as ES5 corrects the over-reservation of ES3 without un-reserving these names, we are moving to align with ES5 by re-reserving class, enum, export, extends, import, and super in all code. (Firefox 4 reserves these names only in strict mode code.)

You can experiment with a version of Firefox with these changes by downloading a TraceMonkey nightly build. Trunk’s still locked down for Firefox 4, so it hasn’t picked up these changes just yet. (Don’t forget to use the profile manager if you want to keep the settings you use with your primary Firefox installation pristine.)

06.03.11

JavaScript change in Firefox 5 (not 4), and in other browsers: regular expressions can’t be called like functions

Callable regular expressions

Way back in the day when Netscape implemented regular expressions in JavaScript, it made them callable. If you slapped an argument list after a regular expression, it’d act as if you called RegExp.prototype.exec on it with the provided arguments.

var r = /abc/, res;

res = r("abc");
assert(res.length === 1);

res = r("def");
assert(res === null);

Why? Beats me. I’d have thought .exec was easy enough to type and clearer to boot, myself. Hopefully readers familiar with the history can explain in comments.

Problems

Callable regular expressions present one immediate problem to a “naive” implementation: their behavior with typeof. According to ECMAScript, the typeof for any object which is callable should be "function", and Netscape and Mozilla for a long time faithfully implemented this. This tended to cause much confusion in practice, so browsers that implemented callable regular expressions eventually changed typeof to arguably “lie” for regular expressions and return "object". In SpiderMonkey the “fix” was an utterly inelegant hack which distinguished callables as either regular expressions or not, to determine typeof behavior.

Past this, callable regular expressions complicate implementing callability and optimizations of it. Implementations supporting getters and setters (once purely as an extension, now standardized in ES5) must consider the case where the getter or setter is a regular expression and do something appropriate. And of course they must handle regular old calls, qualified (/a/()) and unqualified (({ p: /a/ }).p()) both. Mozilla’s had a solid trickle of bugs involving callable regular expressions, almost always filed as a result of Jesse‘s evil fuzzers (and not due to actual sites breaking).

It’s also hard to justify callable regular expressions as an extension. While ECMAScript explicitly permits extensions, it generally prefers extensions to be new methods or properties of existing objects. Regular expression callability is neither of these: instead it’s adding an internal hook to regular expressions to make them callable. This might not technically be contrary to the spec, but it goes against its spirit.

Regular expressions won’t be callable in Firefox 5

No one’s ever really used callable regular expressions. They’re non-standard, not all browsers implement them, and they unnecessarily complicate implementations. So, in concert with other browser engines like WebKit, we’re making regular expressions non-callable in Firefox 5. (Regular expressions are callable in Firefox 4, but of course don’t rely on this.)

You can experiment with a version of Firefox with these changes by downloading a TraceMonkey nightly build. Trunk’s still locked down for Firefox 4, so it won’t pick up the change until Firefox 4 branches and trunk reopens for changes targeted at the next release. (Don’t forget to use the profile manager if you want to keep the settings you use with your primary Firefox installation pristine.)

26.02.11

The proper way to call parseInt (tl;dr: parseInt(str, radix))

Introduction

Allen Wirfs-Brock recently discussed the impedance mismatch when functions accepting optional arguments are incompatibly combined, considering in particular combining parseInt and Array.prototype.map. In doing so he makes this comment:

The most common usage of parseInt passes only a single argument

Code-quality systems like JSLint routinely warn about parseInt without an explicit radix. Most uses might well pass only a single argument, but I could easily imagine this going the other way.

This raises an interesting question: why do lints warn about using parseInt without a radix?

parseInt and radixes

Like much of JavaScript, parseInt tries to be helpful when called without an explicitly specified radix. It attempts to guess a suitable radix:

assertEq(parseInt("+17"), 17);
assertEq(parseInt("42"), 42);
assertEq(parseInt("-0x42"), -66);
// assertEq(parseInt("0755"), ???); // we'll get back to this

If the string (after optional leading whitespace and + or -) starts with a non-zero decimal digit, it’s parsed as decimal. But if the string begins with 0, things get wacky. If the next character is x or X, the number is parsed in base-16: hexadecimal. Last, if the next character isn’t x or X…hold that thought. I’ll return to it in a moment.

Thus the behavior of parseInt without a radix depends not just on the numeric contents of the string but also upon its internal structure, entirely separate from its contents. This alone is reason enough to always specify a radix: specify a radix 2 ≤ r ≤ 36 and it will be used, no guessing, no uncertain behavior in the face of varying strings. (Although, to be sure, there’s still a very slight wrinkle: if r === 16 and your string begins with 0x or 0X, they’ll be skipped when determining the integer to return. But this is a pretty far-out edge case where you might want to parse a hexadecimal string without a prefix and would also want to process one with a prefix as just 0.)

But wait! There’s more

Beyond cuteness lies another concern. Let’s return to the leading-zero-but-not-hexadecimal case:

parseInt("0755");

In some programming languages a leading zero (that’s not part of a hexadecimal prefix) means the number is base-8: octal. So maybe JavaScript infers this to be an octal number, returning 7 × 8 × 8 + 5 × 8 + 5 === 493.

On the other hand, as I noted in Mozilla’s ES5 strict mode documentation, there’s some evidence that people use leading zeroes as alignment devices, thinking they have no semantic effect. So maybe leading zero is decimal instead.

The wonderful thing about standards is that there are so many of them to choose from

According to ES3, a leading zero with no explicit radix is either decimal or, if octal extensions have been implemented, octal. So what happens depends on who wrote the ES3 implementation, and what choice they made. But what if it’s an ES5 implementation? ES5 explicitly forbids octal and says this is interpreted as decimal. Therefore, parseInt("0755") is 755 in bog-standard ES3 implementations, 493 in ES3 implementations which have implemented octal extensions, and 755 in conforming ES5 implementations. Isn’t it great?

What do browsers actually do?

On the web everyone implements the octal extensions, so you’ll have to look hard to find an ES3 browser that doesn’t make parseInt("0755") === 493. But ES3 is old and busted, and ES5 is the new hotness. What do ES5 implementations do, especially as the change in ES5 isn’t backwards-compatible?

Surprisingly browsers aren’t all playing chicken here, waiting to see that they can change without breaking sites. On this particular point IE9 leads the way (in standards mode code only), implementing parseInt("0755") === 755 despite having implemented parseInt("0755") === 493 in the past. Before I saw IE9 implemented this (although I hasten to note they have not shipped a release with it yet), I expected no browser would implement it due to the possibility of breaking sites. After seeing IE9’s example, I’m less certain. Hopefully their experience will shed light on the decision for the other browser vendors.

Conclusion

Precise details of browser and specification inconsistencies aside, the point remains that parseInt(str) tries to be cute when parsing str. That cuteness can make parseInt(str) unpredictable if inputs vary sufficiently. Avoid edge-case bugs and use parseInt(str, radix) instead.

« NewerOlder »