11.12.14

Introducing the JavaScript Internationalization API

(also cross-posted on the Hacks blog — comment over there if you have anything to say)

Firefox 29 issued half a year ago, so this post is long overdue. Nevertheless I wanted to pause for a second to discuss the Internationalization API first shipped on desktop in that release (and passing all tests!). Norbert Lindenberg wrote most of the implementation, and I reviewed it and now maintain it. (Work by Makoto Kato should bring this to Android soon; b2g may take longer due to some b2g-specific hurdles. Stay tuned.)

What’s internationalization?

Internationalization (i18n for short — i, eighteen characters, n) is the process of writing applications in a way that allows them to be easily adapted for audiences from varied places, using varied languages. It’s easy to get this wrong by inadvertently assuming one’s users come from one place and speak one language, especially if you don’t even know you’ve made an assumption.

function formatDate(d)
{
  // Everyone uses month/date/year...right?
  var month = d.getMonth() + 1;
  var date = d.getDate();
  var year = d.getFullYear();
  return month + "/" + date + "/" + year;
}

function formatMoney(amount)
{
  // All money is dollars with two fractional digits...right?
  return "$" + amount.toFixed(2);
}

function sortNames(names)
{
  function sortAlphabetically(a, b)
  {
    var left = a.toLowerCase(), right = b.toLowerCase();
    if (left > right)
      return 1;
    if (left === right)
      return 0;
    return -1;
  }

  // Names always sort alphabetically...right?
  names.sort(sortAlphabetically);
}

JavaScript’s historical i18n support is poor

i18n-aware formatting in traditional JS uses the various toLocaleString() methods. The resulting strings contained whatever details the implementation chose to provide: no way to pick and choose (did you need a weekday in that formatted date? is the year irrelevant?). Even if the proper details were included, the format might be wrong e.g. decimal when percentage was desired. And you couldn’t choose a locale.

As for sorting, JS provided almost no useful locale-sensitive text-comparison (collation) functions. localeCompare() existed but with a very awkward interface unsuited for use with sort. And it too didn’t permit choosing a locale or specific sort order.

These limitations are bad enough that — this surprised me greatly when I learned it! — serious web applications that need i18n capabilities (most commonly, financial sites displaying currencies) will box up the data, send it to a server, have the server perform the operation, and send it back to the client. Server roundtrips just to format amounts of money. Yeesh.

A new JS Internationalization API

The new ECMAScript Internationalization API greatly improves JavaScript’s i18n capabilities. It provides all the flourishes one could want for formatting dates and numbers and sorting text. The locale is selectable, with fallback if the requested locale is unsupported. Formatting requests can specify the particular components to include. Custom formats for percentages, significant digits, and currencies are supported. Numerous collation options are exposed for use in sorting text. And if you care about performance, the up-front work to select a locale and process options can now be done once, instead of once every time a locale-dependent operation is performed.

That said, the API is not a panacea. The API is “best effort” only. Precise outputs are almost always deliberately unspecified. An implementation could legally support only the oj locale, or it could ignore (almost all) provided formatting options. Most implementations will have high-quality support for many locales, but it’s not guaranteed (particularly on resource-constrained systems such as mobile).

Under the hood, Firefox’s implementation depends upon the International Components for Unicode library (ICU), which in turn depends upon the Unicode Common Locale Data Repository (CLDR) locale data set. Our implementation is self-hosted: most of the implementation atop ICU is written in JavaScript itself. We hit a few bumps along the way (we haven’t self-hosted anything this large before), but nothing major.

The Intl interface

The i18n API lives on the global Intl object. Intl contains three constructors: Intl.Collator, Intl.DateTimeFormat, and Intl.NumberFormat. Each constructor creates an object exposing the relevant operation, efficiently caching locale and options for the operation. Creating such an object follows this pattern:

var ctor = "Collator"; // or the others
var instance = new Intl[ctor](locales, options);

locales is a string specifying a single language tag or an arraylike object containing multiple language tags. Language tags are strings like en (English generally), de-AT (German as used in Austria), or zh-Hant-TW (Chinese as used in Taiwan, using the traditional Chinese script). Language tags can also include a “Unicode extension”, of the form -u-key1-value1-key2-value2..., where each key is an “extension key”. The various constructors interpret these specially.

options is an object whose properties (or their absence, by evaluating to undefined) determine how the formatter or collator behaves. Its exact interpretation is determined by the individual constructor.

Given locale information and options, the implementation will try to produce the closest behavior it can to the “ideal” behavior. Firefox supports 400+ locales for collation and 600+ locales for date/time and number formatting, so it’s very likely (but not guaranteed) the locales you might care about are supported.

Intl generally provides no guarantee of particular behavior. If the requested locale is unsupported, Intl allows best-effort behavior. Even if the locale is supported, behavior is not rigidly specified. Never assume that a particular set of options corresponds to a particular format. The phrasing of the overall format (encompassing all requested components) might vary across browsers, or even across browser versions. Individual components’ formats are unspecified: a short-format weekday might be “S”, “Sa”, or “Sat”. The Intl API isn’t intended to expose exactly specified behavior.

Date/time formatting

Options

The primary options properties for date/time formatting are as follows:

weekday, era
"narrow", "short", or "long". (era refers to typically longer-than-year divisions in a calendar system: BC/AD, the current Japanese emperor’s reign, or others.)
month
"2-digit", "numeric", "narrow", "short", or "long"
year
day
hour, minute, second
"2-digit" or "numeric"
timeZoneName
"short" or "long"
timeZone
Case-insensitive "UTC" will format with respect to UTC. Values like "CEST" and "America/New_York" don’t have to be supported, and they don’t currently work in Firefox.

The values don’t map to particular formats: remember, the Intl API almost never specifies exact behavior. But the intent is that "narrow", "short", and "long" produce output of corresponding size — “S” or “Sa”, “Sat”, and “Saturday”, for example. (Output may be ambiguous: Saturday and Sunday both could produce “S”.) "2-digit" and "numeric" map to two-digit number strings or full-length numeric strings: “70” and “1970”, for example.

The final used options are largely the requested options. However, if you don’t specifically request any weekday/year/month/day/hour/minute/second, then year/month/day will be added to your provided options.

Beyond these basic options are a few special options:

hour12
Specifies whether hours will be in 12-hour or 24-hour format. The default is typically locale-dependent. (Details such as whether midnight is zero-based or twelve-based and whether leading zeroes are present are also locale-dependent.)

There are also two special properties, localeMatcher (taking either "lookup" or "best fit") and formatMatcher (taking either "basic" or "best fit"), each defaulting to "best fit". These affect how the right locale and format are selected. The use cases for these are somewhat esoteric, so you should probably ignore them.

Locale-centric options

DateTimeFormat also allows formatting using customized calendaring and numbering systems. These details are effectively part of the locale, so they’re specified in the Unicode extension in the language tag.

For example, Thai as spoken in Thailand has the language tag th-TH. Recall that a Unicode extension has the format -u-key1-value1-key2-value2.... The calendaring system key is ca, and the numbering system key is nu. The Thai numbering system has the value thai, and the Chinese calendaring system has the value chinese. Thus to format dates in this overall manner, we tack a Unicode extension containing both these key/value pairs onto the end of the language tag: th-TH-u-ca-chinese-nu-thai.

For more information on the various calendaring and numbering systems, see the full DateTimeFormat documentation.

Examples

After creating a DateTimeFormat object, the next step is to use it to format dates via the handy format() function. Conveniently, this function is a bound function: you don’t have to call it on the DateTimeFormat directly. Then provide it a timestamp or Date object.

Putting it all together, here are some examples of how to create DateTimeFormat options for particular uses, with current behavior in Firefox.

var msPerDay = 24 * 60 * 60 * 1000;

// July 17, 2014 00:00:00 UTC.
var july172014 = new Date(msPerDay * (44 * 365 + 11 + 197));

Let’s format a date for English as used in the United States. Let’s include two-digit month/day/year, plus two-digit hours/minutes, and a short time zone to clarify that time. (The result would obviously be different in another time zone.)

var options =
  { year: "2-digit", month: "2-digit", day: "2-digit",
    hour: "2-digit", minute: "2-digit",
    timeZoneName: "short" };
var americanDateTime =
  new Intl.DateTimeFormat("en-US", options).format;

print(americanDateTime(july172014)); // 07/16/14, 5:00 PM PDT

Or let’s do something similar for Portuguese — ideally as used in Brazil, but in a pinch Portugal works. Let’s go for a little longer format, with full year and spelled-out month, but make it UTC for portability.

var options =
  { year: "numeric", month: "long", day: "numeric",
    hour: "2-digit", minute: "2-digit",
    timeZoneName: "short", timeZone: "UTC" };
var portugueseTime =
  new Intl.DateTimeFormat(["pt-BR", "pt-PT"], options);

// 17 de julho de 2014 00:00 GMT
print(portugueseTime.format(july172014));

How about a compact, UTC-formatted weekly Swiss train schedule? We’ll try the official languages from most to least popular to choose the one that’s most likely to be readable.

var swissLocales = ["de-CH", "fr-CH", "it-CH", "rm-CH"];
var options =
  { weekday: "short",
    hour: "numeric", minute: "numeric",
    timeZone: "UTC", timeZoneName: "short" };
var swissTime =
  new Intl.DateTimeFormat(swissLocales, options).format;

print(swissTime(july172014)); // Do. 00:00 GMT

Or let’s try a date in descriptive text by a painting in a Japanese museum, using the Japanese calendar with year and era:

var jpYearEra =
  new Intl.DateTimeFormat("ja-JP-u-ca-japanese",
                          { year: "numeric", era: "long" });

print(jpYearEra.format(july172014)); // 平成26年

And for something completely different, a longer date for use in Thai as used in Thailand — but using the Thai numbering system and Chinese calendar. (Quality implementations such as Firefox’s would treat plain th-TH as th-TH-u-ca-buddhist-nu-latn, imputing Thailand’s typical Buddhist calendar system and Latin 0-9 numerals.)

var options =
  { year: "numeric", month: "long", day: "numeric" };
var thaiDate =
  new Intl.DateTimeFormat("th-TH-u-nu-thai-ca-chinese", options);

print(thaiDate.format(july172014)); // ๒๐ 6 ๓๑

Calendar and numbering system bits aside, it’s relatively simple. Just pick your components and their lengths.

Number formatting

Options

The primary options properties for number formatting are as follows:

style
"currency", "percent", or "decimal" (the default) to format a value of that kind.
currency
A three-letter currency code, e.g. USD or CHF. Required if style is "currency", otherwise meaningless.
currencyDisplay
"code", "symbol", or "name", defaulting to "symbol". "code" will use the three-letter currency code in the formatted string. "symbol" will use a currency symbol such as $ or £. "name" typically uses some sort of spelled-out version of the currency. (Firefox currently only supports "symbol", but this will be fixed soon.)
minimumIntegerDigits
An integer from 1 to 21 (inclusive), defaulting to 1. The resulting string is front-padded with zeroes until its integer component contains at least this many digits. (For example, if this value were 2, formatting 3 might produce “03”.)
minimumFractionDigits, maximumFractionDigits
Integers from 0 to 20 (inclusive). The resulting string will have at least minimumFractionDigits, and no more than maximumFractionDigits, fractional digits. The default minimum is currency-dependent (usually 2, rarely 0 or 3) if style is "currency", otherwise 0. The default maximum is 0 for percents, 3 for decimals, and currency-dependent for currencies.
minimumSignificantDigits, maximumSignificantDigits
Integers from 1 to 21 (inclusive). If present, these override the integer/fraction digit control above to determine the minimum/maximum significant figures in the formatted number string, as determined in concert with the number of decimal places required to accurately specify the number. (Note that in a multiple of 10 the significant digits may be ambiguous, as in “100” with its one, two, or three significant digits.)
useGrouping
Boolean (defaulting to true) determining whether the formatted string will contain grouping separators (e.g. “,” as English thousands separator).

NumberFormat also recognizes the esoteric, mostly ignorable localeMatcher property.

Locale-centric options

Just as DateTimeFormat supported custom numbering systems in the Unicode extension using the nu key, so too does NumberFormat. For example, the language tag for Chinese as used in China is zh-CN. The value for the Han decimal numbering system is hanidec. To format numbers for these systems, we tack a Unicode extension onto the language tag: zh-CN-u-nu-hanidec.

For complete information on specifying the various numbering systems, see the full NumberFormat documentation.

Examples

NumberFormat objects have a format function property just as DateTimeFormat objects do. And as there, the format function is a bound function that may be used in isolation from the NumberFormat.

Here are some examples of how to create NumberFormat options for particular uses, with Firefox’s behavior. First let’s format some money for use in Chinese as used in China, specifically using Han decimal numbers (instead of much more common Latin numbers). Select the "currency" style, then use the code for Chinese renminbi (yuan), grouping by default, with the usual number of fractional digits.

var hanDecimalRMBInChina =
  new Intl.NumberFormat("zh-CN-u-nu-hanidec",
                        { style: "currency", currency: "CNY" });

print(hanDecimalRMBInChina.format(1314.25)); // ¥ 一,三一四.二五

Or let’s format a United States-style gas price, with its peculiar thousandths-place 9, for use in English as used in the United States.

var gasPrice =
  new Intl.NumberFormat("en-US",
                        { style: "currency", currency: "USD",
                          minimumFractionDigits: 3 });

print(gasPrice.format(5.259)); // $5.259

Or let’s try a percentage in Arabic, meant for use in Egypt. Make sure the percentage has at least two fractional digits. (Note that this and all the other RTL examples may appear with different ordering in RTL context, e.g. ٤٣٫٨٠٪ instead of ٤٣٫٨٠٪.)

var arabicPercent =
  new Intl.NumberFormat("ar-EG",
                        { style: "percent",
                          minimumFractionDigits: 2 }).format;

print(arabicPercent(0.438)); // ٤٣٫٨٠٪

Or suppose we’re formatting for Persian as used in Afghanistan, and we want at least two integer digits and no more than two fractional digits.

var persianDecimal =
  new Intl.NumberFormat("fa-AF",
                        { minimumIntegerDigits: 2,
                          maximumFractionDigits: 2 });

print(persianDecimal.format(3.1416)); // ۰۳٫۱۴

Finally, let’s format an amount of Bahraini dinars, for Arabic as used in Bahrain. Unusually compared to most currencies, Bahraini dinars divide into thousandths (fils), so our number will have three places. (Again note that apparent visual ordering should be taken with a grain of salt.)

var bahrainiDinars =
  new Intl.NumberFormat("ar-BH",
                        { style: "currency", currency: "BHD" });

print(bahrainiDinars.format(3.17)); // د.ب.‏ ٣٫١٧٠

Collation

Options

The primary options properties for collation are as follows:

usage
"sort" or "search" (defaulting to "sort"), specifying the intended use of this Collator. (A search collator might want to consider more strings equivalent than a sort collator would.)
sensitivity
"base", "accent", "case", or "variant". This affects how sensitive the collator is to characters that have the same “base letter” but have different accents/diacritics and/or case. (Base letters are locale-dependent: “a” and “ä” have the same base letter in German but are different letters in Swedish.) "base" sensitivity considers only the base letter, ignoring modifications (so for German “a”, “A”, and “ä” are considered the same). "accent" considers the base letter and accents but ignores case (so for German “a” and “A” are the same, but “ä” differs from both). "case" considers the base letter and case but ignores accents (so for German “a” and “ä” are the same, but “A” differs from both). Finally, "variant" considers base letter, accents, and case (so for German “a”, “ä, “ä” and “A” all differ). If usage is "sort", the default is "variant"; otherwise it’s locale-dependent.
numeric
Boolean (defaulting to false) determining whether complete numbers embedded in strings are considered when sorting. For example, numeric sorting might produce "F-4 Phantom II", "F-14 Tomcat", "F-35 Lightning II"; non-numeric sorting might produce "F-14 Tomcat", "F-35 Lightning II", "F-4 Phantom II".
caseFirst
"upper", "lower", or "false" (the default). Determines how case is considered when sorting: "upper" places uppercase letters first ("B", "a", "c"), "lower" places lowercase first ("a", "c", "B"), and "false" ignores case entirely ("a", "B", "c"). (Note: Firefox currently ignores this property.)
ignorePunctuation
Boolean (defaulting to false) determining whether to ignore embedded punctuation when performing the comparison (for example, so that "biweekly" and "bi-weekly" compare equivalent).

And there’s that localeMatcher property that you can probably ignore.

Locale-centric options

The main Collator option specified as part of the locale’s Unicode extension is co, selecting the kind of sorting to perform: phone book (phonebk), dictionary (dict), and many others.

Additionally, the keys kn and kf may, optionally, duplicate the numeric and caseFirst properties of the options object. But they’re not guaranteed to be supported in the language tag, and options is much clearer than language tag components. So it’s best to only adjust these options through options.

These key-value pairs are included in the Unicode extension the same way they’ve been included for DateTimeFormat and NumberFormat; refer to those sections for how to specify these in a language tag.

Examples

Collator objects have a compare function property. This function accepts two arguments x and y and returns a number less than zero if x compares less than y, 0 if x compares equal to y, or a number greater than zero if x compares greater than y. As with the format functions, compare is a bound function that may be extracted for standalone use.

Let’s try sorting a few German surnames, for use in German as used in Germany. There are actually two different sort orders in German, phonebook and dictionary. Phonebook sort emphasizes sound, and it’s as if “ä”, “ö”, and so on were expanded to “ae”, “oe”, and so on prior to sorting.

var names =
  ["Hochberg", "Hönigswald", "Holzman"];

var germanPhonebook = new Intl.Collator("de-DE-u-co-phonebk");

// as if sorting ["Hochberg", "Hoenigswald", "Holzman"]:
//   Hochberg, Hönigswald, Holzman
print(names.sort(germanPhonebook.compare).join(", "));

Some German words conjugate with extra umlauts, so in dictionaries it’s sensible to order ignoring umlauts (except when ordering words differing only by umlauts: schon before schön).

var germanDictionary = new Intl.Collator("de-DE-u-co-dict");

// as if sorting ["Hochberg", "Honigswald", "Holzman"]:
//   Hochberg, Holzman, Hönigswald
print(names.sort(germanDictionary.compare).join(", "));

Or let’s sort a list Firefox versions with various typos (different capitalizations, random accents and diacritical marks, extra hyphenation), in English as used in the United States. We want to sort respecting version number, so do a numeric sort so that numbers in the strings are compared, not considered character-by-character.

var firefoxen =
  ["FireFøx 3.6",
   "Fire-fox 1.0",
   "Firefox 29",
   "FÍrefox 3.5",
   "Fírefox 18"];

var usVersion =
  new Intl.Collator("en-US",
                    { sensitivity: "base",
                      numeric: true,
                      ignorePunctuation: true });

// Fire-fox 1.0, FÍrefox 3.5, FireFøx 3.6, Fírefox 18, Firefox 29
print(firefoxen.sort(usVersion.compare).join(", "));

Last, let’s do some locale-aware string searching that ignores case and accents, again in English as used in the United States.

// Comparisons work with both composed and decomposed forms.
var decoratedBrowsers =
  [
   "A\u0362maya",  // A͢maya
   "CH\u035Brôme", // CH͛rôme
   "FirefÓx",
   "sAfàri",
   "o\u0323pERA",  // ọpERA
   "I\u0352E",     // I͒E
  ];

var fuzzySearch =
  new Intl.Collator("en-US",
                    { usage: "search", sensitivity: "base" });

function findBrowser(browser)
{
  function cmp(other)
  {
    return fuzzySearch.compare(browser, other) === 0;
  }
  return cmp;
}

print(decoratedBrowsers.findIndex(findBrowser("Firêfox"))); // 2
print(decoratedBrowsers.findIndex(findBrowser("Safåri")));  // 3
print(decoratedBrowsers.findIndex(findBrowser("Ãmaya")));   // 0
print(decoratedBrowsers.findIndex(findBrowser("Øpera")));   // 4
print(decoratedBrowsers.findIndex(findBrowser("Chromè")));  // 1
print(decoratedBrowsers.findIndex(findBrowser("IË")));      // 5

Odds and ends

It may be useful to determine whether support for some operation is provided for particular locales, or to determine whether a locale is supported. Intl provides supportedLocales() functions on each constructor, and resolvedOptions() functions on each prototype, to expose this information.

var navajoLocales =
  Intl.Collator.supportedLocalesOf(["nv"], { usage: "sort" });
print(navajoLocales.length > 0
      ? "Navajo collation supported"
      : "Navajo collation not supported");

var germanFakeRegion =
  new Intl.DateTimeFormat("de-XX", { timeZone: "UTC" });
var usedOptions = germanFakeRegion.resolvedOptions();
print(usedOptions.locale);   // de
print(usedOptions.timeZone); // UTC

Legacy behavior

The ES5 toLocaleString-style and localeCompare functions previously had no particular semantics, accepted no particular options, and were largely useless. So the i18n API reformulates them in terms of Intl operations. Each method now accepts additional trailing locales and options arguments, interpreted just as the Intl constructors would do. (Except that for toLocaleTimeString and toLocaleDateString, different default components are used if options aren’t provided.)

For brief use where precise behavior doesn’t matter, the old methods are fine to use. But if you need more control or are formatting or comparing many times, it’s best to use the Intl primitives directly.

Conclusion

Internationalization is a fascinating topic whose complexity is bounded only by the varied nature of human communication. The Internationalization API addresses a small but quite useful portion of that complexity, making it easier to produce locale-sensitive web applications. Go use it!

(And a special thanks to Norbert Lindenberg, Anas El Husseini, Simon Montagu, Gary Kwong, Shu-yu Guo, Ehsan Akhgari, the people of #mozilla.de, and anyone I may have forgotten [sorry!] who provided feedback on this article or assisted me in producing and critiquing the examples. The English and German examples were the limit of my knowledge, and I’d have been completely lost on the other examples without their assistance. Blame all remaining errors on me. Thanks again!)

(and to reiterate: comment on the Hacks post if you have anything to say)

03.12.14

Working on the JS engine, Episode V

From a stack trace for a crash:

20:12:01     INFO -   2  libxul.so!bool js::DependentAddPtr<js::HashSet<js::ReadBarriered<js::UnownedBaseShape*>, js::StackBaseShape, js::SystemAllocPolicy> >::add<JS::RootedGeneric<js::StackBaseShape*>, js::UnownedBaseShape*>(js::ExclusiveContext const*, js::HashSet<js::ReadBarriered<js::UnownedBaseShape*>, js::StackBaseShape, js::SystemAllocPolicy>&, JS::RootedGeneric<js::StackBaseShape*> const&, js::UnownedBaseShape* const&) [HashTable.h:3ba384952a02 : 372 + 0x4]

If you can figure out where in that mess the actual method name is without staring at this for at least 15 seconds, I salute you. (Note that when I saw this originally, it wasn’t line-wrapped, making it even less readable.)

I’m not sure how this could be presented better, given the depth and breadth of template use in the class, in the template parameters to that class, in the method, and in the method arguments here.

13.11.14

Ten years

It was summer 2002, and I was on a bike tour across Michigan. For downtime reading I’d brought a bunch of unread magazines. One of the magazines I brought was a semi-recent PC Magazine with a review of the various browsers of the time. The major browsers were the focus, but a sidebar mentioned the Mozilla Suite and noted its being open source and open to downloads and contributions from anyone. It sounded unique and piqued my interest, so I filed the information away for future reference.

Sometime after I got home I downloaded a version of the Mozilla Suite: good, certainly better than Internet Explorer, but ponderous in its UI. I recommended it to a few people, but because of the UI somewhat half-heartedly, in an “if you can tolerate the UI, it’s better” sort of sense. Somehow I stumbled into downloading betas and later nightlies, and I began reading and triaging bugs, even reporting a few bugs (duplicates!) of my own.

Sometime later, probably through the MozillaZine default bookmark, I learned about Phoenix, the then-current name of the browser whose ultimate name would be Firefox. (Very shortly after this it acquired the Firebird name, which stuck for only a couple releases until the ultimate rename.) It seemed to do everything the Suite did (or at least everything I cared about) without the horrible UI. I began recommending it unreservedly to people, surreptitiously installing it on high school computers, and so on.

One notable lack in Firebird of the time was its lack of help documentation. Firebird was good stuff. I wanted to see it succeed. I could fix this. So I began contributing to the Firebird Help project that wrote built-in help documentation for the browser. At the time this was an external project whose contents were occasionally imported into the main tree. (I believe it later moved directly into the tree, although I’m not certain. Ultimately the entire system was replaced with fully-online documentation, which fixed a whole bunch of problems around ease of contribution — not least that I’d lost time to contribute to that particular aspect, mostly having moved onto other things in the project.) Thus began the start of several years of work writing help documentation describing various parts of the UI, including a late-breaking October 2004 weekend spent documenting the new preferences UI in 1.0 — in just before the buzzer!

I observed release day from a distance in my dorm room, but Air Mozilla made that experience more immediate than it might have been. (218 users on the IRC channel! How times have changed. Our main developer channel as I write this contains 448 people, and that seems pretty typical.) Air Mozilla wasn’t nearly as polished or streamlined as it is now. Varying-quality Creative Commons music as interludes between interviews, good times. But it was a start.

Air Mozilla on the Firefox 1.0 launch day (Ogg Vorbis). My first inclination is to say this was recorded by people on the ground in the release itself, and I downloaded it later. But on second thought, I can’t be certain I didn’t stream-capture live using VLC.

Ten years (and two internships, one proto-summit and two summits, and fulltime employment) later, I think it’s both surprising and unsurprising just how far Mozilla and Firefox have come. Surprising, in that entering a market where the main competitor has 95% of the market usually isn’t a winning strategy. (But you can’t win if you don’t try.) Yet also unsurprising, as Internet Explorer was really bad compared to its competition. (When released, IE6 was a really good browser, just as Tinderbox [now doubly-replaced] was once a really good continuous-integration system. But it’s not enough to be good at one instant.) And a serendipitously-timed wave of IE security vulnerabilities over summer 2004 helped, too. 🙂

Here’s to another ten years and a new round of challenges.

08.09.14

Quote of the day

Tags: , , , , — Jeff @ 15:56

Snipped from irrelevant context:

<jorendorff> In this case I see nearby code asserting that IsCompiled() is true, so I think I have it right

Assertions do more than point out mistakes in code. They also document that code’s intended behavior, permitting faster iteration and modification to that code by future users. Assertions are often more valuable as documentation, than they are as a means to detect bugs. (Although not always. *eyes fuzzers beadily*)

So don’t just assert the tricky requirements: assert the more-obvious ones, too. You may save the next person changing the code (and the person reviewing it, who could be you!) a lot of time.

31.07.14

mfbt now has UniquePtr and MakeUnique for managing singly-owned resources

Managing dynamic memory allocations in C++

C++ supports dynamic allocation of objects using new. For new objects to not leak, they must be deleted. This is quite difficult to do correctly in complex code. Smart pointers are the canonical solution. Mozilla has historically used nsAutoPtr, and C++98 provided std::auto_ptr, to manage singly-owned new objects. But nsAutoPtr and std::auto_ptr have a bug: they can be “copied.”

The following code allocates an int. When is that int destroyed? Does destroying ptr1 or ptr2 handle the task? What does ptr1 contain after ptr2‘s gone out of scope?

typedef auto_ptr<int> auto_int;
{
  auto_int ptr1(new int(17));
  {
    auto_int ptr2 = ptr1;
    // destroy ptr2
  }
  // destroy ptr1
}

Copying or assigning an auto_ptr implicitly moves the new object, mutating the input. When ptr2 = ptr1 happens, ptr1 is set to nullptr and ptr2 has a pointer to the allocated int. When ptr2 goes out of scope, it destroys the allocated int. ptr1 is nullptr when it goes out of scope, so destroying it does nothing.

Fixing auto_ptr

Implicit-move semantics are safe but very unclear. And because these operations mutate their input, they can’t take a const reference. For example, auto_ptr has an auto_ptr::auto_ptr(auto_ptr&) constructor but not an auto_ptr::auto_ptr(const auto_ptr&) copy constructor. This breaks algorithms requiring copyability.

We can solve these problems with a smart pointer that prohibits copying/assignment unless the input is a temporary value. (C++11 calls these rvalue references, but I’ll use “temporary value” for readability.) If the input’s a temporary value, we can move the resource out of it without disrupting anyone else’s view of it: as a temporary it’ll die before anyone could observe it. (The rvalue reference concept is incredibly subtle. Read that article series a dozen times, and maybe you’ll understand half of it. I’ve spent multiple full days digesting it and still won’t claim full understanding.)

Presenting mozilla::UniquePtr

I’ve implemented mozilla::UniquePtr in #include "mozilla/UniquePtr.h" to fit the bill. It’s based on C++11’s std::unique_ptr (not always available right now). UniquePtr provides auto_ptr‘s safety while providing movability but not copyability.

UniquePtr template parameters

Using UniquePtr requires the type being owned and what will ultimately be done to generically delete it. The type is the first template argument; the deleter is the (optional) second. The default deleter performs delete for non-array types and delete[] for array types. (This latter improves upon auto_ptr and nsAutoPtr [and the derivative nsAutoArrayPtr], which fail horribly when used with new[].)

UniquePtr<int> i1(new int(8));
UniquePtr<int[]> arr1(new int[17]());

Deleters are callable values, that are called whenever a UniquePtr‘s object should be destroyed. If a custom deleter is used, it’s a really good idea for it to be empty (per mozilla::IsEmpty<D>) so that UniquePtr<T, D> is as space-efficient as a raw pointer.

struct FreePolicy
{
  void operator()(void* ptr) {
    free(ptr);
  }
};

{
  void* m = malloc(4096);
  UniquePtr<void, FreePolicy> mem(m);
  int* i = static_cast<int*>(malloc(sizeof(int)));
  UniquePtr<int, FreePolicy> integer(i);

  // integer.getDeleter()(i) is called
  // mem.getDeleter()(m) is called
}

Basic UniquePtr construction and assignment

As you’d expect, no-argument construction initializes to nullptr, a single pointer initializes to that pointer, and a pointer and a deleter initialize embedded pointer and deleter both.

UniquePtr<int> i1;
assert(i1 == nullptr);
UniquePtr<int> i2(new int(8));
assert(i2 != nullptr);
UniquePtr<int, FreePolicy> i3(nullptr, FreePolicy());

Move construction and assignment

All remaining constructors and assignment operators accept only nullptr or compatible, temporary UniquePtr values. These values have well-defined ownership, in marked contrast to raw pointers.

class B
{
    int i;

  public:
    B(int i) : i(i) {}
    virtual ~B() {} // virtual required so delete (B*)(pointer to D) calls ~D()
};

class D : public B
{
  public:
    D(int i) : B(i) {}
};

UniquePtr<B> MakeB(int i)
{
  typedef UniquePtr<B>::DeleterType BDeleter;

  // OK to convert UniquePtr<D, BDeleter> to UniquePtr<B>:
  // Note: For UniquePtr interconversion, both pointer and deleter
  //       types must be compatible!  Thus BDeleter here.
  return UniquePtr<D, BDeleter>(new D(i));
}

UniquePtr<B> b1(MakeB(66)); // OK: temporary value moved into b1

UniquePtr<B> b2(b1); // ERROR: b1 not a temporary, would confuse
                     // single ownership, forbidden

UniquePtr<B> b3;

b3 = b1;  // ERROR: b1 not a temporary, would confuse
          // single ownership, forbidden

b3 = MakeB(76); // OK: return value moved into b3
b3 = nullptr;   // OK: can't confuse ownership of nullptr

What if you really do want to move a resource from one UniquePtr to another? You can explicitly request a move using mozilla::Move() from #include "mozilla/Move.h".

int* i = new int(37);
UniquePtr<int> i1(i);

UniquePtr<int> i2(Move(i1));
assert(i1 == nullptr);
assert(i2.get() == i);

i1 = Move(i2);
assert(i1.get() == i);
assert(i2 == nullptr);

Move transforms the type of its argument into a temporary value type. Move doesn’t have any effects of its own. Rather, it’s the job of users such as UniquePtr to ascribe special semantics to operations accepting temporary values. (If no special semantics are provided, temporary values match only const reference types as in C++98.)

Observing a UniquePtr‘s value

The dereferencing operators (-> and *) and conversion to bool behave as expected for any smart pointer. The raw pointer value can be accessed using get() if absolutely needed. (This should be uncommon, as the only pointer to the resource should live in the UniquePtr.) UniquePtr may also be compared against nullptr (but not against raw pointers).

int* i = new int(8);
UniquePtr<int> p(i);
if (p)
  *p = 42;
assert(p != nullptr);
assert(p.get() == i);
assert(*p == 42);

Changing a UniquePtr‘s value

Three mutation methods beyond assignment are available. A UniquePtr may be reset() to a raw pointer or to nullptr. The raw pointer may be extracted, and the UniquePtr cleared, using release(). Finally, UniquePtrs may be swapped.

int* i = new int(42);
int* i2;
UniquePtr<int> i3, i4;
{
  UniquePtr<int> integer(i);
  assert(i == integer.get());

  i2 = integer.release();
  assert(integer == nullptr);

  integer.reset(i2);
  assert(integer.get() == i2);

  integer.reset(new int(93)); // deletes i2

  i3 = Move(integer); // better than release()

  i3.swap(i4);
  Swap(i3, i4); // mozilla::Swap, that is
}

When a UniquePtr loses ownership of its resource, the embedded deleter will dispose of the managed pointer, in accord with the single-ownership concept. release() is the sole exception: it clears the UniquePtr and returns the raw pointer previously in it, without calling the deleter. This is a somewhat dangerous idiom. (Mozilla’s smart pointers typically call this forget(), and WebKit’s WTF calls this leak(). UniquePtr uses release() only for consistency with unique_ptr.) It’s generally much better to make the user take a UniquePtr, then transfer ownership using Move().

Array fillips

UniquePtr<T> and UniquePtr<T[]> share the same interface, with a few substantial differences. UniquePtr<T[]> defines an operator[] to permit indexing. As mentioned earlier, UniquePtr<T[]> by default will delete[] its resource, rather than delete it. As a corollary, UniquePtr<T[]> requires an exact type match when constructed or mutated using a pointer. (It’s an error to delete[] an array through a pointer to the wrong array element type, because delete[] has to know the element size to destruct each element. Not accepting other pointer types thus eliminates this class of errors.)

struct B {};
struct D : B {};
UniquePtr<B[]> bs;
// bs.reset(new D[17]()); // ERROR: requires B*, not D*
bs.reset(new B[5]());
bs[1] = B();

And a mozilla::MakeUnique helper function

Typing out new T every time a UniquePtr is created or initialized can get old. We’ve added a helper function, MakeUnique<T>, that combines new object (or array) creation with creation of a corresponding UniquePtr. The nice thing about MakeUnique is that it’s in some sense foolproof: if you only create new objects in UniquePtrs, you can’t leak or double-delete unless you leak the UniquePtr‘s owner, misuse a get(), or drop the result of release() on the floor. I recommend always using MakeUnique instead of new for single-ownership objects.

struct S { S(int i, double d) {} };

UniquePtr<S> s1 = MakeUnique<S>(17, 42.0);   // new S(17, 42.0)
UniquePtr<int> i1 = MakeUnique<int>(42);     // new int(42)
UniquePtr<int[]> i2 = MakeUnique<int[]>(17); // new int[17]()


// Given familiarity with UniquePtr, these work particularly
// well with C++11 auto: just recognize MakeUnique means new,
// T means single object, and T[] means array.
auto s2 = MakeUnique<S>(17, 42.0); // new S(17, 42.0)
auto i3 = MakeUnique<int>(42);     // new int(42)
auto i4 = MakeUnique<int[]>(17);   // new int[17]()

MakeUnique<T>(...args) computes new T(...args). MakeUnique of an array takes an array length and constructs the correspondingly-sized array.

In the long run we probably should expect everyone to recognize the MakeUnique idiom so that we can use auto here and cut down on redundant typing. In the short run, feel free to do whichever you prefer.

Update: Beware! Due to compiler limitations affecting gcc less than 4.6, passing literal nullptr as an argument to a MakeUnique call will fail to compile only on b2g-ics. Everywhere else will pass. You have been warned. The only alternative I can think of is to pass static_cast<T*>(nullptr) instead, or assign to a local variable and pass that instead. Love that b2g compiler!

Conclusion

UniquePtr was a free-time hacking project last Christmas week, that I mostly finished but ran out of steam on when work resumed. Only recently have I found time to finish it up and land it, yet we already have a couple hundred uses of it and MakeUnique. Please add more uses, and make our existing new code safer!

A final note: please use UniquePtr instead of mozilla::Scoped. UniquePtr is more standard, better-tested, and better-documented (particularly on the vast expanses of the web, where most unique_ptr documentation also suffices for UniquePtr). Scoped is now deprecated — don’t use it in new code!

« NewerOlder »