27.02.17

A pitfall in C++ low-level object creation and storage, and how to avoid it

While doing a couple recent reviews, I’ve read lots of code trying to solve the same general problem. Some code wants to store an object of some type (more rarely, an object from among several types), but it can’t do so yet. Some series of operations must occur first: a data structure needing to be put into the right state, or a state machine needing to transition just so. Think of this, perhaps, as the problem of having a conditionally initialized object.

Possibly-unsatisfactory approaches

One solution is to new the object at the proper time, storing nullptr prior to that instant. But this imposes the complexity of a heap allocation and handling its potential failure.

Another solution is to use mozilla::Maybe<T>. But Maybe must track whether the T object has been constructed, even if that status is already tracked elsewhere. And Maybe is potentially twice T‘s size.

The power approach

Tthe most common solution is to placement-new the object into raw storage. (This solution is particularly common in code written by especially-knowledgeable developers.) mfbt has historically provided mozilla::AlignedStorage and mozilla::AlignedStorage2 to implement such raw storage. Each class has a Type member typedef implementing suitable aligned, adequately-sized storage. Bog-standard C++11 offers similar functionality in std::aligned_storage.

Unfortunately, this approach is extremely easy to subtly misuse. And misuse occurs regularly in Mozilla code, and it can and has triggered crashes.

A detour into the C++ object model

C++ offers very fine-grained, low-level control over memory. It’s absurdly easy to access memory representations with the right casts.

But just because it’s easy to access memory representations, doesn’t mean it’s easy to do it safely. The C++ object model generally restricts you to accessing memory according to the actual type stored there at that instant. You can cast a true float* to uint32_t*, but C++ says undefined behavior — literally any behavior at all — may occur if you read from the uint32_t*. These so-called strict aliasing violations are pernicious not just because anything could happen, but because often exactly what you wanted to happen, happens. Broken code often still “works” — until it unpredictably misbehaves, in C++’s view with absolute justification.

A dragon lays waste to men attacking it
Here be dragons (CC-BY-SA, by bagogames)

There’s a big exception to the general by-actual-type rule: the memcpy exception. (Technically it’s a handful of rules that, in concert, permit memcpy. And there are other, non-memcpy-related exceptions.) memcpy(char*, const char*, size_t) has always worked to copy C-compatible objects around without regard to types, and C++’s object model permits this by letting you safely interpret the memory of a T as chars or unsigned chars. If T is trivially copyable, then:

T t1; 
char buf[sizeof(T)];
memcpy(buf, &t1, sizeof(T)); // stash bytes away
// ...time elapses during execution...
memcpy(&t1, buf, sizeof(T)); // restore them
// t1 safely has its original value

You can safely copy a T by the aforementioned character types elsewhere, then back into a T, and things will work. And second:

T t1, t2;
memcpy(&t1, &t2, sizeof(T)); // t1 safely has t2's value

You can safely copy a T into another T, and the second T will have the first T‘s value.

A C++-compatible placement-new approach

The placement-new-into-storage approach looks like this. (Real code would almost always use something more interesting than double, but this is the gist of it.)

#include <new> // for placement new

struct ContainsLazyDouble
{
    // Careful: align the storage consistent with the type it'll store.
    alignas(double) char lazyData[sizeof(double)];
    bool hasDouble_;

    // Indirection through these functions, rather than directly casting
    // lazyData to double*, evades a buggy GCC -Wstrict-aliasing warning.
    void* data() { return lazyData; }
    const void* data() const { return lazyData; }

  public:
    ContainsLazyDouble() : hasDouble_(false) {}

    void init(double d) {
      new (data()) double(d);
      hasDouble_ = true;
    }

    bool hasDouble() const { return hasDouble_; }
    double d() const {
      return *reinterpret_cast<const double*>(data());
    }
};

ContainsLazyDouble c;
// c.d(); // BAD, not initialized as double
c.init(3.141592654);
c.d(); // OK

This is safe. c.lazyData was originally char data, but we allocated a new double there, so henceforth that memory contains a double (even though it wasn’t declared as double), not chars (even though it was declared that way). The actual type stored in lazyData at that instant is properly respected.

A C++-incompatible extension of the placement-new approach

It’s safe to copy a T by the aforementioned character types. But it’s only safe to do so if 1) T is trivially copyable, and 2) the copied bytes are interpreted as T only within an actual T object. Not into a location to be (re)interpreted as T, but into a T. It’s unsafe to copy a T into a location that doesn’t at that instant contain a T, then reinterpret it as T.

So what happens if we use ContainsLazyDouble‘s implicitly-defined default copy constructor?

ContainsLazyDouble c2 = c;

This default copy constructor copies ContainsLazyDouble member by member, according to their declared types. So c.lazyData is copied as a char array that contains the object representation of c.d(). c2.lazyData therefore contains the same char array. But it doesn’t contain an actual double. It doesn’t matter that those chars encode a double: according to C++, that location does not contain a double.

Dereferencing reinterpret_cast<const double*>(data()) therefore mis-accesses an array of chars by the wrong type, triggering undefined behavior. c2.d() might seem to work if you’re lucky, but C++ doesn’t say it must work.

This is extraordinarily subtle. SpiderMonkey hackers missed this issue in their code until bug 1269319 was debugged and a (partly invalid, on other grounds) GCC compiler bug was filed. Even (more or less) understanding the spec intricacy, I missed this issue in some of the patchwork purporting to fix that bug. (Bug 1341951 provides an actual fix for one of these remaining issues.) Another SpiderMonkey hacker almost introduced another instance of this bug; fortunately I was reviewing the patch and flagged this issue.

Using AlignedStorage<N>::Type, AlignedStorage2<T>::Type, or std::aligned_storage<N>::type doesn’t avoid this problem. We mitigated the problem by deleting AlignedStorage{,2}::Type‘s copy constructors and assignment operators that would always do actual-type-unaware initialization. (Of course we can’t modify std::aligned_storage.) But only careful scrutiny prevents other code from memcpying those types. And memcpy will copy without respecting the actual type stored there at that instant, too. And in practice, developers do try to use memcpy for this when copy construction and assignment are forbidden, and reviewers can miss it.

What’s the solution to this problem?

As long as memcpy and memmove exist, this very subtle issue can’t be eradicated. There is no silver bullet.

The best solution is don’t hand-roll raw storage. This problem doesn’t exist in Maybe, mozilla::Variant, mozilla::MaybeOneOf, mozilla::Vector, and other utility classes designed to possibly hold a value. (Sometimes because we just fixed them.)

But if you must hand-roll a solution, construct an object of the actual type into your raw storage. It isn’t enough to copy the bytes of an object of the actual type into raw storage, then treat the storage as that actual type. For example, in ContainsLazyDouble, a correct copy constructor that respects C++ strict aliasing rules would be:

#include <string.h> // for memcpy

// Add this to ContainsLazyDouble:
ContainsLazyDouble(const ContainsLazyDouble& other)
  : hasDouble_(other.hasDouble_)
{
  if (hasDouble_)
  {
    // The only way to allocate a free-floating T, is to
    // placement-new it, usually also invoking T's copy
    // constructor.
    new (data()) double(other.d());

    // This would also be valid, if almost pointlessly bizarre — but only
    // because double is trivially copyable.  (It wouldn't be safe
    // to do this with a type with a user-defined copy constructor, or
    // virtual functions, or that had to do anything at all to initialize
    // the new object.)
    new (data()) double; // creates an uninitialized double
    memcpy(lazyData, other.lazyData, sizeof(lazyData)); // sets to other.d()
  }
}

// ...and this to the using code:
ContainsLazyDouble c2 = c; // invokes the now-safe copy constructor

The implicitly-generated copy assignment operator will usually require similar changes — or it can be = deleted.

Final considerations

AlignedStorage seems like a good idea. But it’s extremely easy to run afoul of a copy operation that doesn’t preserve the actual object type, by the default copy constructor or assignment operator or by memcpy in entirely-separate code. We’re removing AlignedStorage{,2} so these classes can’t be misused this way. (The former has just been removed from the tree — the latter has many more users and will be harder to kill.) It’s possible to use them correctly, but misuse is too easy to leave these loaded guns in the tree.

If it’s truly necessary to hand-roll a solution, you should hand-roll all of it, all the way down to the buffer of unsigned char with an alignas() attribute. Writing this correctly this is expert-level C++. But it was expert-level C++ even with the aligned-storage helper. You should have to know what you’re doing — and you shouldn’t need the std::aligned_storage crutch to do it.

Moreover, I hope that the extra complexity of hand-rolling discourages non-expert reviewers from reviewing such code. I’d feel significantly more confident Mozilla won’t repeat this mistake if I knew that every use of (for example) alignas were reviewed by (say) froydnj or me. Perhaps we can get some Mercurial tooling in place to enforce a review requirement along those lines.

In the meantime, I hope I’ve made the C++ developers out there who read this, at least somewhat aware of this pitfall, and at best competent to avoid it.

27.05.16

Using Yubikeys with Fedora 24, for example for Github two-factor authentication

Tags: , , , , , , — Jeff @ 17:17

My old laptop’s wifi went on the fritz, so I got a new Lenovo P50. Fedora 23 wouldn’t work with the Skylake architecture, so I had to jump headfirst into the Fedora 24 beta.

I’ve since hit one new issue: Yubikeys wouldn’t work for FIDO U2F authentication. Logging into a site using a Yubikey (inserting a Yubikey USB device and tapping the button when prompted) wouldn’t work. Attempting this on Github would display the error message, “Something went really wrong.” Nor would registering Yubikeys with sites work. On Github, attempting to register Yubikeys would give the error message, “This device cannot be registered.”

Interwebs sleuthing suggests that Yubikeys require special udev configuration to work on Linux. The problem is that udev doesn’t grant access to the Yubikey, so when the browser tries to access the key, things go Bad. A handful of resources pointed me toward a solution: tell udev to grant access to the device.

As root, go to the directory /etc/udev/rules.d. It contains files with names of the form *.rules, specifying rules for how to treat devices added and removed from the system. In that directory create the file 70-u2f.rules. Its contents should be those of 70-u2f.rules, from Yubico‘s libu2f-host repository. (Most of this file is just selecting various Yubikey devices to apply rules against. The important part of this file is the TAG+="uaccess" ending the various lines. This adds the “uaccess” tag to those devices; systemd-logind recognizes this tag and will grant access to the device to the current logged-in user.) Finally, run these two commands to refresh udev state:

udevadm control --reload
udevadm trigger

Yubikeys should now work for authentication.

These steps work for me, and they appear to me a sensible way to solve the problem. But I can’t say for sure that they’re the best way to solve it. (Nor am I sure why Fedora doesn’t handle this for me.) If anyone knows a better way, that doesn’t involve modifying the root file system, I’d love to hear it in comments.

20.06.15

New changes to make SpiderMonkey’s (and Firefox’s) parsing of destructuring patterns more spec-compliant

Destructuring in JavaScript

One new feature in JavaScript, introduced in ECMAScript 6 (formally ECMAScript 2015, but it’ll always be ES6 in our hearts), is destructuring. Destructuring is syntactic sugar for assigning sub-values within a single value — nested properties, iteration results, &c., to arbitrary depths — to a set of locations (names, properties, &c.).

// Declarations
var [a, b] = [1, 2]; // a = 1, b = 2
var { x: c, y: d } = { x: 42, y: 17 }; // c = 42, d = 17

function f([z]) { return z; }
print(f([8675309])); // 8675309


// Assignments
[b, f.prop] = [3, 15]; // b = 3, f.prop = 15
({ p: d } = { p: 33 }); // d = 33

function Point(x, y) { this.x = x; this.y = y; }

// Nesting's copacetic, too.
// a = 2, b = 4, c = 8, d = 16
[{ x: a, y: b }, [c, d]] = [new Point(2, 4), [8, 16]];

Ambiguities in the parsing of destructuring

One wrinkle to destructuring is its ambiguity: reading start to finish, is a “destructuring pattern” instead a literal? Until any succeeding = is observed, it’s impossible to know. And for object destructuring patterns, could the “pattern” just be a block statement? (A block statement is a list of statements inside {}, e.g. many loop bodies.)

How ES6 handles the potential parser ambiguities in destructuring

ES6 says an apparent “pattern” could be any of these possibilities: the only way to know is to completely parse the expression/statement. There are more elegant and less elegant ways to do this, although in the end they amount to the same thing.

Object destructuring patterns present somewhat less ambiguity than array patterns. In expression context, { may begin an object literal or an object destructuring pattern (just as [ does for arrays, mutatis mutandis). But in statement context, { since the dawn of JavaScript only begins a block statement, never an object literal — and now, never an object destructuring pattern.

How then to write object destructuring pattern assignments not in expression context? For some time SpiderMonkey has allowed destructuring patterns to be parenthesized, incidentally eliminating this ambiguity. But ES6 chose another path. In ES6 destructuring patterns must not be parenthesized, at any level of nesting within the pattern. And in declarative destructuring patterns (but not in destructuring assignments), declaration names also must not be parenthesized.

SpiderMonkey now adheres to ES6 in requiring no parentheses around destructuring patterns

As of several hours ago on mozilla-inbound, SpiderMonkey conforms to ES6’s parsing requirements for destructuring, with respect to parenthesization. These examples are all now syntax errors:

// Declarations
var [(a)] = [1]; // BAD, a parenthesized
var { x: (c) } = {}; // BAD, c parenthesized
var { o: ({ p: p }) } = { o: { p: 2 } }; // BAD, nested pattern parenthesized

function f([(z)]) { return z; } // BAD, z parenthesized

// Top level
({ p: a }) = { p: 42 }; // BAD, pattern parenthesized
([a]) = [5]; // BAD, pattern parenthesized

// Nested
[({ p: a }), { x: c }] = [{}, {}]; // BAD, nested pattern parenthesized

Non-array/object patterns in destructuring assignments, outside of declarations, can still be parenthesized:

// Assignments
[(b)] = [3]; // OK: parentheses allowed around non-pattern in a non-declaration assignment
({ p: (d) } = {}); // OK: ditto
[(parseInt.prop)] = [3]; // OK: parseInt.prop not a pattern, assigns parseInt.prop = 3

Conclusion

These changes shouldn’t much disrupt anyone writing JS. Parentheses around array patterns are unnecessary and are easily removed. For object patterns, instead of parenthesizing the object pattern, parenthesize the whole assignment. No big deal!

// Assignments
([b]) = [3]; // BAD: parentheses around array pattern
[b] = [3]; // GOOD

({ p: d }) = { p: 2 }; // BAD: parentheses around object pattern
({ p: d } = { p: 2 }); // GOOD

One step forward for SpiderMonkey standards compliance!

11.12.14

Introducing the JavaScript Internationalization API

(also cross-posted on the Hacks blog — comment over there if you have anything to say)

Firefox 29 issued half a year ago, so this post is long overdue. Nevertheless I wanted to pause for a second to discuss the Internationalization API first shipped on desktop in that release (and passing all tests!). Norbert Lindenberg wrote most of the implementation, and I reviewed it and now maintain it. (Work by Makoto Kato should bring this to Android soon; b2g may take longer due to some b2g-specific hurdles. Stay tuned.)

What’s internationalization?

Internationalization (i18n for short — i, eighteen characters, n) is the process of writing applications in a way that allows them to be easily adapted for audiences from varied places, using varied languages. It’s easy to get this wrong by inadvertently assuming one’s users come from one place and speak one language, especially if you don’t even know you’ve made an assumption.

function formatDate(d)
{
  // Everyone uses month/date/year...right?
  var month = d.getMonth() + 1;
  var date = d.getDate();
  var year = d.getFullYear();
  return month + "/" + date + "/" + year;
}

function formatMoney(amount)
{
  // All money is dollars with two fractional digits...right?
  return "$" + amount.toFixed(2);
}

function sortNames(names)
{
  function sortAlphabetically(a, b)
  {
    var left = a.toLowerCase(), right = b.toLowerCase();
    if (left > right)
      return 1;
    if (left === right)
      return 0;
    return -1;
  }

  // Names always sort alphabetically...right?
  names.sort(sortAlphabetically);
}

JavaScript’s historical i18n support is poor

i18n-aware formatting in traditional JS uses the various toLocaleString() methods. The resulting strings contained whatever details the implementation chose to provide: no way to pick and choose (did you need a weekday in that formatted date? is the year irrelevant?). Even if the proper details were included, the format might be wrong e.g. decimal when percentage was desired. And you couldn’t choose a locale.

As for sorting, JS provided almost no useful locale-sensitive text-comparison (collation) functions. localeCompare() existed but with a very awkward interface unsuited for use with sort. And it too didn’t permit choosing a locale or specific sort order.

These limitations are bad enough that — this surprised me greatly when I learned it! — serious web applications that need i18n capabilities (most commonly, financial sites displaying currencies) will box up the data, send it to a server, have the server perform the operation, and send it back to the client. Server roundtrips just to format amounts of money. Yeesh.

A new JS Internationalization API

The new ECMAScript Internationalization API greatly improves JavaScript’s i18n capabilities. It provides all the flourishes one could want for formatting dates and numbers and sorting text. The locale is selectable, with fallback if the requested locale is unsupported. Formatting requests can specify the particular components to include. Custom formats for percentages, significant digits, and currencies are supported. Numerous collation options are exposed for use in sorting text. And if you care about performance, the up-front work to select a locale and process options can now be done once, instead of once every time a locale-dependent operation is performed.

That said, the API is not a panacea. The API is “best effort” only. Precise outputs are almost always deliberately unspecified. An implementation could legally support only the oj locale, or it could ignore (almost all) provided formatting options. Most implementations will have high-quality support for many locales, but it’s not guaranteed (particularly on resource-constrained systems such as mobile).

Under the hood, Firefox’s implementation depends upon the International Components for Unicode library (ICU), which in turn depends upon the Unicode Common Locale Data Repository (CLDR) locale data set. Our implementation is self-hosted: most of the implementation atop ICU is written in JavaScript itself. We hit a few bumps along the way (we haven’t self-hosted anything this large before), but nothing major.

The Intl interface

The i18n API lives on the global Intl object. Intl contains three constructors: Intl.Collator, Intl.DateTimeFormat, and Intl.NumberFormat. Each constructor creates an object exposing the relevant operation, efficiently caching locale and options for the operation. Creating such an object follows this pattern:

var ctor = "Collator"; // or the others
var instance = new Intl[ctor](locales, options);

locales is a string specifying a single language tag or an arraylike object containing multiple language tags. Language tags are strings like en (English generally), de-AT (German as used in Austria), or zh-Hant-TW (Chinese as used in Taiwan, using the traditional Chinese script). Language tags can also include a “Unicode extension”, of the form -u-key1-value1-key2-value2..., where each key is an “extension key”. The various constructors interpret these specially.

options is an object whose properties (or their absence, by evaluating to undefined) determine how the formatter or collator behaves. Its exact interpretation is determined by the individual constructor.

Given locale information and options, the implementation will try to produce the closest behavior it can to the “ideal” behavior. Firefox supports 400+ locales for collation and 600+ locales for date/time and number formatting, so it’s very likely (but not guaranteed) the locales you might care about are supported.

Intl generally provides no guarantee of particular behavior. If the requested locale is unsupported, Intl allows best-effort behavior. Even if the locale is supported, behavior is not rigidly specified. Never assume that a particular set of options corresponds to a particular format. The phrasing of the overall format (encompassing all requested components) might vary across browsers, or even across browser versions. Individual components’ formats are unspecified: a short-format weekday might be “S”, “Sa”, or “Sat”. The Intl API isn’t intended to expose exactly specified behavior.

Date/time formatting

Options

The primary options properties for date/time formatting are as follows:

weekday, era
"narrow", "short", or "long". (era refers to typically longer-than-year divisions in a calendar system: BC/AD, the current Japanese emperor’s reign, or others.)
month
"2-digit", "numeric", "narrow", "short", or "long"
year
day
hour, minute, second
"2-digit" or "numeric"
timeZoneName
"short" or "long"
timeZone
Case-insensitive "UTC" will format with respect to UTC. Values like "CEST" and "America/New_York" don’t have to be supported, and they don’t currently work in Firefox.

The values don’t map to particular formats: remember, the Intl API almost never specifies exact behavior. But the intent is that "narrow", "short", and "long" produce output of corresponding size — “S” or “Sa”, “Sat”, and “Saturday”, for example. (Output may be ambiguous: Saturday and Sunday both could produce “S”.) "2-digit" and "numeric" map to two-digit number strings or full-length numeric strings: “70” and “1970”, for example.

The final used options are largely the requested options. However, if you don’t specifically request any weekday/year/month/day/hour/minute/second, then year/month/day will be added to your provided options.

Beyond these basic options are a few special options:

hour12
Specifies whether hours will be in 12-hour or 24-hour format. The default is typically locale-dependent. (Details such as whether midnight is zero-based or twelve-based and whether leading zeroes are present are also locale-dependent.)

There are also two special properties, localeMatcher (taking either "lookup" or "best fit") and formatMatcher (taking either "basic" or "best fit"), each defaulting to "best fit". These affect how the right locale and format are selected. The use cases for these are somewhat esoteric, so you should probably ignore them.

Locale-centric options

DateTimeFormat also allows formatting using customized calendaring and numbering systems. These details are effectively part of the locale, so they’re specified in the Unicode extension in the language tag.

For example, Thai as spoken in Thailand has the language tag th-TH. Recall that a Unicode extension has the format -u-key1-value1-key2-value2.... The calendaring system key is ca, and the numbering system key is nu. The Thai numbering system has the value thai, and the Chinese calendaring system has the value chinese. Thus to format dates in this overall manner, we tack a Unicode extension containing both these key/value pairs onto the end of the language tag: th-TH-u-ca-chinese-nu-thai.

For more information on the various calendaring and numbering systems, see the full DateTimeFormat documentation.

Examples

After creating a DateTimeFormat object, the next step is to use it to format dates via the handy format() function. Conveniently, this function is a bound function: you don’t have to call it on the DateTimeFormat directly. Then provide it a timestamp or Date object.

Putting it all together, here are some examples of how to create DateTimeFormat options for particular uses, with current behavior in Firefox.

var msPerDay = 24 * 60 * 60 * 1000;

// July 17, 2014 00:00:00 UTC.
var july172014 = new Date(msPerDay * (44 * 365 + 11 + 197));

Let’s format a date for English as used in the United States. Let’s include two-digit month/day/year, plus two-digit hours/minutes, and a short time zone to clarify that time. (The result would obviously be different in another time zone.)

var options =
  { year: "2-digit", month: "2-digit", day: "2-digit",
    hour: "2-digit", minute: "2-digit",
    timeZoneName: "short" };
var americanDateTime =
  new Intl.DateTimeFormat("en-US", options).format;

print(americanDateTime(july172014)); // 07/16/14, 5:00 PM PDT

Or let’s do something similar for Portuguese — ideally as used in Brazil, but in a pinch Portugal works. Let’s go for a little longer format, with full year and spelled-out month, but make it UTC for portability.

var options =
  { year: "numeric", month: "long", day: "numeric",
    hour: "2-digit", minute: "2-digit",
    timeZoneName: "short", timeZone: "UTC" };
var portugueseTime =
  new Intl.DateTimeFormat(["pt-BR", "pt-PT"], options);

// 17 de julho de 2014 00:00 GMT
print(portugueseTime.format(july172014));

How about a compact, UTC-formatted weekly Swiss train schedule? We’ll try the official languages from most to least popular to choose the one that’s most likely to be readable.

var swissLocales = ["de-CH", "fr-CH", "it-CH", "rm-CH"];
var options =
  { weekday: "short",
    hour: "numeric", minute: "numeric",
    timeZone: "UTC", timeZoneName: "short" };
var swissTime =
  new Intl.DateTimeFormat(swissLocales, options).format;

print(swissTime(july172014)); // Do. 00:00 GMT

Or let’s try a date in descriptive text by a painting in a Japanese museum, using the Japanese calendar with year and era:

var jpYearEra =
  new Intl.DateTimeFormat("ja-JP-u-ca-japanese",
                          { year: "numeric", era: "long" });

print(jpYearEra.format(july172014)); // 平成26年

And for something completely different, a longer date for use in Thai as used in Thailand — but using the Thai numbering system and Chinese calendar. (Quality implementations such as Firefox’s would treat plain th-TH as th-TH-u-ca-buddhist-nu-latn, imputing Thailand’s typical Buddhist calendar system and Latin 0-9 numerals.)

var options =
  { year: "numeric", month: "long", day: "numeric" };
var thaiDate =
  new Intl.DateTimeFormat("th-TH-u-nu-thai-ca-chinese", options);

print(thaiDate.format(july172014)); // ๒๐ 6 ๓๑

Calendar and numbering system bits aside, it’s relatively simple. Just pick your components and their lengths.

Number formatting

Options

The primary options properties for number formatting are as follows:

style
"currency", "percent", or "decimal" (the default) to format a value of that kind.
currency
A three-letter currency code, e.g. USD or CHF. Required if style is "currency", otherwise meaningless.
currencyDisplay
"code", "symbol", or "name", defaulting to "symbol". "code" will use the three-letter currency code in the formatted string. "symbol" will use a currency symbol such as $ or £. "name" typically uses some sort of spelled-out version of the currency. (Firefox currently only supports "symbol", but this will be fixed soon.)
minimumIntegerDigits
An integer from 1 to 21 (inclusive), defaulting to 1. The resulting string is front-padded with zeroes until its integer component contains at least this many digits. (For example, if this value were 2, formatting 3 might produce “03”.)
minimumFractionDigits, maximumFractionDigits
Integers from 0 to 20 (inclusive). The resulting string will have at least minimumFractionDigits, and no more than maximumFractionDigits, fractional digits. The default minimum is currency-dependent (usually 2, rarely 0 or 3) if style is "currency", otherwise 0. The default maximum is 0 for percents, 3 for decimals, and currency-dependent for currencies.
minimumSignificantDigits, maximumSignificantDigits
Integers from 1 to 21 (inclusive). If present, these override the integer/fraction digit control above to determine the minimum/maximum significant figures in the formatted number string, as determined in concert with the number of decimal places required to accurately specify the number. (Note that in a multiple of 10 the significant digits may be ambiguous, as in “100” with its one, two, or three significant digits.)
useGrouping
Boolean (defaulting to true) determining whether the formatted string will contain grouping separators (e.g. “,” as English thousands separator).

NumberFormat also recognizes the esoteric, mostly ignorable localeMatcher property.

Locale-centric options

Just as DateTimeFormat supported custom numbering systems in the Unicode extension using the nu key, so too does NumberFormat. For example, the language tag for Chinese as used in China is zh-CN. The value for the Han decimal numbering system is hanidec. To format numbers for these systems, we tack a Unicode extension onto the language tag: zh-CN-u-nu-hanidec.

For complete information on specifying the various numbering systems, see the full NumberFormat documentation.

Examples

NumberFormat objects have a format function property just as DateTimeFormat objects do. And as there, the format function is a bound function that may be used in isolation from the NumberFormat.

Here are some examples of how to create NumberFormat options for particular uses, with Firefox’s behavior. First let’s format some money for use in Chinese as used in China, specifically using Han decimal numbers (instead of much more common Latin numbers). Select the "currency" style, then use the code for Chinese renminbi (yuan), grouping by default, with the usual number of fractional digits.

var hanDecimalRMBInChina =
  new Intl.NumberFormat("zh-CN-u-nu-hanidec",
                        { style: "currency", currency: "CNY" });

print(hanDecimalRMBInChina.format(1314.25)); // ¥ 一,三一四.二五

Or let’s format a United States-style gas price, with its peculiar thousandths-place 9, for use in English as used in the United States.

var gasPrice =
  new Intl.NumberFormat("en-US",
                        { style: "currency", currency: "USD",
                          minimumFractionDigits: 3 });

print(gasPrice.format(5.259)); // $5.259

Or let’s try a percentage in Arabic, meant for use in Egypt. Make sure the percentage has at least two fractional digits. (Note that this and all the other RTL examples may appear with different ordering in RTL context, e.g. ٤٣٫٨٠٪ instead of ٤٣٫٨٠٪.)

var arabicPercent =
  new Intl.NumberFormat("ar-EG",
                        { style: "percent",
                          minimumFractionDigits: 2 }).format;

print(arabicPercent(0.438)); // ٤٣٫٨٠٪

Or suppose we’re formatting for Persian as used in Afghanistan, and we want at least two integer digits and no more than two fractional digits.

var persianDecimal =
  new Intl.NumberFormat("fa-AF",
                        { minimumIntegerDigits: 2,
                          maximumFractionDigits: 2 });

print(persianDecimal.format(3.1416)); // ۰۳٫۱۴

Finally, let’s format an amount of Bahraini dinars, for Arabic as used in Bahrain. Unusually compared to most currencies, Bahraini dinars divide into thousandths (fils), so our number will have three places. (Again note that apparent visual ordering should be taken with a grain of salt.)

var bahrainiDinars =
  new Intl.NumberFormat("ar-BH",
                        { style: "currency", currency: "BHD" });

print(bahrainiDinars.format(3.17)); // د.ب.‏ ٣٫١٧٠

Collation

Options

The primary options properties for collation are as follows:

usage
"sort" or "search" (defaulting to "sort"), specifying the intended use of this Collator. (A search collator might want to consider more strings equivalent than a sort collator would.)
sensitivity
"base", "accent", "case", or "variant". This affects how sensitive the collator is to characters that have the same “base letter” but have different accents/diacritics and/or case. (Base letters are locale-dependent: “a” and “ä” have the same base letter in German but are different letters in Swedish.) "base" sensitivity considers only the base letter, ignoring modifications (so for German “a”, “A”, and “ä” are considered the same). "accent" considers the base letter and accents but ignores case (so for German “a” and “A” are the same, but “ä” differs from both). "case" considers the base letter and case but ignores accents (so for German “a” and “ä” are the same, but “A” differs from both). Finally, "variant" considers base letter, accents, and case (so for German “a”, “ä, “ä” and “A” all differ). If usage is "sort", the default is "variant"; otherwise it’s locale-dependent.
numeric
Boolean (defaulting to false) determining whether complete numbers embedded in strings are considered when sorting. For example, numeric sorting might produce "F-4 Phantom II", "F-14 Tomcat", "F-35 Lightning II"; non-numeric sorting might produce "F-14 Tomcat", "F-35 Lightning II", "F-4 Phantom II".
caseFirst
"upper", "lower", or "false" (the default). Determines how case is considered when sorting: "upper" places uppercase letters first ("B", "a", "c"), "lower" places lowercase first ("a", "c", "B"), and "false" ignores case entirely ("a", "B", "c"). (Note: Firefox currently ignores this property.)
ignorePunctuation
Boolean (defaulting to false) determining whether to ignore embedded punctuation when performing the comparison (for example, so that "biweekly" and "bi-weekly" compare equivalent).

And there’s that localeMatcher property that you can probably ignore.

Locale-centric options

The main Collator option specified as part of the locale’s Unicode extension is co, selecting the kind of sorting to perform: phone book (phonebk), dictionary (dict), and many others.

Additionally, the keys kn and kf may, optionally, duplicate the numeric and caseFirst properties of the options object. But they’re not guaranteed to be supported in the language tag, and options is much clearer than language tag components. So it’s best to only adjust these options through options.

These key-value pairs are included in the Unicode extension the same way they’ve been included for DateTimeFormat and NumberFormat; refer to those sections for how to specify these in a language tag.

Examples

Collator objects have a compare function property. This function accepts two arguments x and y and returns a number less than zero if x compares less than y, 0 if x compares equal to y, or a number greater than zero if x compares greater than y. As with the format functions, compare is a bound function that may be extracted for standalone use.

Let’s try sorting a few German surnames, for use in German as used in Germany. There are actually two different sort orders in German, phonebook and dictionary. Phonebook sort emphasizes sound, and it’s as if “ä”, “ö”, and so on were expanded to “ae”, “oe”, and so on prior to sorting.

var names =
  ["Hochberg", "Hönigswald", "Holzman"];

var germanPhonebook = new Intl.Collator("de-DE-u-co-phonebk");

// as if sorting ["Hochberg", "Hoenigswald", "Holzman"]:
//   Hochberg, Hönigswald, Holzman
print(names.sort(germanPhonebook.compare).join(", "));

Some German words conjugate with extra umlauts, so in dictionaries it’s sensible to order ignoring umlauts (except when ordering words differing only by umlauts: schon before schön).

var germanDictionary = new Intl.Collator("de-DE-u-co-dict");

// as if sorting ["Hochberg", "Honigswald", "Holzman"]:
//   Hochberg, Holzman, Hönigswald
print(names.sort(germanDictionary.compare).join(", "));

Or let’s sort a list Firefox versions with various typos (different capitalizations, random accents and diacritical marks, extra hyphenation), in English as used in the United States. We want to sort respecting version number, so do a numeric sort so that numbers in the strings are compared, not considered character-by-character.

var firefoxen =
  ["FireFøx 3.6",
   "Fire-fox 1.0",
   "Firefox 29",
   "FÍrefox 3.5",
   "Fírefox 18"];

var usVersion =
  new Intl.Collator("en-US",
                    { sensitivity: "base",
                      numeric: true,
                      ignorePunctuation: true });

// Fire-fox 1.0, FÍrefox 3.5, FireFøx 3.6, Fírefox 18, Firefox 29
print(firefoxen.sort(usVersion.compare).join(", "));

Last, let’s do some locale-aware string searching that ignores case and accents, again in English as used in the United States.

// Comparisons work with both composed and decomposed forms.
var decoratedBrowsers =
  [
   "A\u0362maya",  // A͢maya
   "CH\u035Brôme", // CH͛rôme
   "FirefÓx",
   "sAfàri",
   "o\u0323pERA",  // ọpERA
   "I\u0352E",     // I͒E
  ];

var fuzzySearch =
  new Intl.Collator("en-US",
                    { usage: "search", sensitivity: "base" });

function findBrowser(browser)
{
  function cmp(other)
  {
    return fuzzySearch.compare(browser, other) === 0;
  }
  return cmp;
}

print(decoratedBrowsers.findIndex(findBrowser("Firêfox"))); // 2
print(decoratedBrowsers.findIndex(findBrowser("Safåri")));  // 3
print(decoratedBrowsers.findIndex(findBrowser("Ãmaya")));   // 0
print(decoratedBrowsers.findIndex(findBrowser("Øpera")));   // 4
print(decoratedBrowsers.findIndex(findBrowser("Chromè")));  // 1
print(decoratedBrowsers.findIndex(findBrowser("IË")));      // 5

Odds and ends

It may be useful to determine whether support for some operation is provided for particular locales, or to determine whether a locale is supported. Intl provides supportedLocales() functions on each constructor, and resolvedOptions() functions on each prototype, to expose this information.

var navajoLocales =
  Intl.Collator.supportedLocalesOf(["nv"], { usage: "sort" });
print(navajoLocales.length > 0
      ? "Navajo collation supported"
      : "Navajo collation not supported");

var germanFakeRegion =
  new Intl.DateTimeFormat("de-XX", { timeZone: "UTC" });
var usedOptions = germanFakeRegion.resolvedOptions();
print(usedOptions.locale);   // de
print(usedOptions.timeZone); // UTC

Legacy behavior

The ES5 toLocaleString-style and localeCompare functions previously had no particular semantics, accepted no particular options, and were largely useless. So the i18n API reformulates them in terms of Intl operations. Each method now accepts additional trailing locales and options arguments, interpreted just as the Intl constructors would do. (Except that for toLocaleTimeString and toLocaleDateString, different default components are used if options aren’t provided.)

For brief use where precise behavior doesn’t matter, the old methods are fine to use. But if you need more control or are formatting or comparing many times, it’s best to use the Intl primitives directly.

Conclusion

Internationalization is a fascinating topic whose complexity is bounded only by the varied nature of human communication. The Internationalization API addresses a small but quite useful portion of that complexity, making it easier to produce locale-sensitive web applications. Go use it!

(And a special thanks to Norbert Lindenberg, Anas El Husseini, Simon Montagu, Gary Kwong, Shu-yu Guo, Ehsan Akhgari, the people of #mozilla.de, and anyone I may have forgotten [sorry!] who provided feedback on this article or assisted me in producing and critiquing the examples. The English and German examples were the limit of my knowledge, and I’d have been completely lost on the other examples without their assistance. Blame all remaining errors on me. Thanks again!)

(and to reiterate: comment on the Hacks post if you have anything to say)

03.12.14

Working on the JS engine, Episode V

From a stack trace for a crash:

20:12:01     INFO -   2  libxul.so!bool js::DependentAddPtr<js::HashSet<js::ReadBarriered<js::UnownedBaseShape*>, js::StackBaseShape, js::SystemAllocPolicy> >::add<JS::RootedGeneric<js::StackBaseShape*>, js::UnownedBaseShape*>(js::ExclusiveContext const*, js::HashSet<js::ReadBarriered<js::UnownedBaseShape*>, js::StackBaseShape, js::SystemAllocPolicy>&, JS::RootedGeneric<js::StackBaseShape*> const&, js::UnownedBaseShape* const&) [HashTable.h:3ba384952a02 : 372 + 0x4]

If you can figure out where in that mess the actual method name is without staring at this for at least 15 seconds, I salute you. (Note that when I saw this originally, it wasn’t line-wrapped, making it even less readable.)

I’m not sure how this could be presented better, given the depth and breadth of template use in the class, in the template parameters to that class, in the method, and in the method arguments here.

Older »