20.10.11

Computing the length or end of a compile-time constant-length array: {JS,NS}_ARRAY_{END,LENGTH} are out, mozilla::Array{Length,End} are in

Tags: , , , , , , , — Jeff @ 16:03

Determining the length of a fixed-length array

Suppose in C++ you want to perform variations of some simple task several times. One way to do this is to loop over the variations in an array to perform each task:

/* Defines the necessary envvars to 1. */
int setVariables()
{
  static const char* names[] = { "FOO", "BAR", "BAZ", "QUUX", "EIT", "GOATS" };
  for (int i = 0; i < 6; i++)
    if (0 > setenv(names[i], "1")) return -1;
  return 0;
}

Manually looping by index is prone to error. One particular issue with loop-by-index is that you must correctly compute the extent of iteration. Hard-coding a constant works, but what if the array changes? The constant must also change, which isn’t obvious to someone not looking carefully at all uses of the array.

The traditional way to get an array’s length is with a macro using sizeof:

#define NS_ARRAY_LENGTH(arr)  (sizeof(arr) / sizeof((arr)[0]))

This works but has problems. First, it’s a macro, which means it has the usual macro issues. For example, macros are untyped, so you can pass in “wrong” arguments to them and may not get type errors. This is the second problem: NS_ARRAY_LENGTH cheerfully accepts non-array pointers and returns a completely bogus length.

  const char* str = "long enough string";
  char* copy = (char*) malloc(NS_ARRAY_LENGTH(str)); // usually 4 or 8
  strcpy(copy, str); // buffer overflow!

Introducing mozilla::ArrayLength and mozilla::ArrayEnd

Seeing an opportunity to both kill a moderately obscure macro and to further improve the capabilities of the Mozilla Framework Based on Templates, I took it. Now, rather than use these macros, you can #include "mozilla/Util.h" and use mozilla::ArrayLength to compute the length of a compile-time array. You can also use mozilla::ArrayEnd to compute arr + ArrayLength(arr). Both these methods (not macros!) use C++ template magic to only accept arrays with compile-time-fixed length, failing to compile if something else is provided.

Limitations

Unfortunately, ISO C++ limitations make it impossible to write a method completely replacing the macro. So the macros still exist, and in rare cases they remain the correct answer.

The array can’t depend on an unnamed type (class, struct, union) or a local type

According to C++ §14.3.1 paragraph 2, “A local type [or an] unnamed type…shall not be used as a template-argument for a template type-parameter.” C++ makes this a compile error:

size_t numsLength()
{
  // unnamed struct, also locally defined
  static const struct { int i; } nums[] = { { 1 }, { 2 }, { 3 } };

  return mozilla::ArrayLength(nums);
}

It’s easy to avoid both limitations: move local types to global code, and name them.

// now defined globally, and with a name
struct Number { int i; };
size_t numsLength()
{
  static const Number nums[] = { 1, 2, 3 };
  return mozilla::ArrayLength(nums);
}

mozilla::ArrayLength(arr) isn’t a constant, NS_ARRAY_LENGTH(arr) is

Some contexts in C++ require a compile-time constant expression: template parameters, (in C++ but not C99) for local array lengths, for array lengths in typedefs, for the value of enum initializers, for static/compile-time assertions (which are usually bootstrapped off these other locations), and perhaps others. A function call, even one evaluating to a compile-time constant, is not a constant expression.

One other context doesn’t require a constant but strongly wants one: the values of static variables, inside classes and methods and out. If the value is a function call, even if it computes a constant, the compiler might make it a static initialization, delaying startup.

The long and short of it is that everything in the code below is a bad idea:

int arr[] = { 1, 2, 3, 5 };
static size_t len = ArrayLength(arr); // not an error, but don't do it
void t(JSContext* cx)
{
  js::Vector<int, ArrayLength(arr)> v(cx); // non-constant template parameter
  int local[ArrayLength(arr)]; // variadic arrays not okay in C++
  typedef int Mirror[ArrayLength(arr)]; // non-constant array length
  enum { L = ArrayLength(arr); }; // non-constant initializer
  PR_STATIC_ASSERT(4 == ArrayLength(arr)); // special case of one of the others
}

In these situations you should continue to use NS_ARRAY_LENGTH (or in SpiderMonkey, JS_ARRAY_LENGTH).

mozilla/Util.h is fragile with respect to IPDL headers, for include order

mozilla/Util.h includes mozilla/Types.h, which includes jstypes.h, which includes jsotypes.h, which defines certain fixed-width integer types: int8, int16, uint8, uint16, and so on. It happens that ipc/chromium/src/base/basictypes.h also defines these integer types — but incompatibly on Windows only. This header is, alas, included through every IPDL-generated header. In order to safely include any mfbt header in a file which also includes an IPDL-generated header, you must include the IPDL-generated header first. So when landing patches using mozilla/Util.h, watch out for Windows-specific bustage.

Removing the limitations

The limitations on the type of elements in arrays passed to ArrayLength are unavoidable limitations of C++. But C++11 removes these limitations, and compilers will likely implement support fairly quickly. When that happens we’ll be able to stop caring about the local-or-unnamed problem, not even needing to work around it.

The compile-time-constant limitation is likewise a limitation of C++. It too will go away in C++11 with the constexpr keyword. This modifier specifies that a function provided constant arguments computes a constant. The compiler must allow calls to the function that have constant arguments to be used as compile-time constants. Thus when compilers support constexpr, we can add it to the declaration of ArrayLength and begin using ArrayLength in compile-time-constant contexts. This is more low-hanging C++11 fruit that compilers will pick up soon. (Indeed, GCC 4.6 already implements it.)

Last, we have the Windows-specific #include ordering requirement. We have some ideas for getting around this problem, and we hope to have a solution soon.

A gotcha

Both these methods have a small gotcha: their behavior may not be intuitive when applied to C strings. What does sizeof("foo") evaluate to? If you think of "foo" as a string, you might say 3. But in reality "foo" is much better thought of as an array — and strings are '\0'-terminated. So actually, sizeof("foo") == 4. This was the case with NS_ARRAY_LENGTH, too, so it’s not new behavior. But if you use these methods without considering this, you might end up misusing them.

Conclusion

Avoid NS_ARRAY_LENGTH when possible, and use mozilla::ArrayLength or mozilla::ArrayEnd instead. And watch out when using them on strings, because their behavior might not be what you wanted.

(Curious how these methods are defined, and what C++ magic is used? See my next post.)

07.06.11

Introducing mozilla/RangedPtr.h: a smart pointer for members of buffers

Tags: , , , , — Jeff @ 15:17

Introduction

Suppose you’re implementing a method which searches for -- as part of an HTML parser:

bool containsDashDash(const char* chars, size_t length)
{
  for (size_t i = 0; i < length; i++)
  {
    if (chars[i] != '-')
      continue;
    if (chars[i + 1] == '-')
      return true;
  }
  return false;
}

But your method contains a bug! Can you spot it?

The buffer bounds-checking problem

The problem with the above implementation is that the bounds-checking is off. If chars doesn’t contain -- but ends with a -, then chars[i + 1] will read past the end of chars. Sometimes that might be harmless; other times it might be an exploitable security vulnerability.

The most obvious way to fix this in C++ is to pass an object which encapsulates both characters and length together, so you can’t use the characters except in accordance with the length. If you were using std::string, for example, accessing characters by [] or at() would generally assert or throw an exception for out-of-bounds access.

For one reason or another, however, you might not want to use an encapsulated type. In a parser, for example, you probably would want to use a pointer to process the input string, because the compiler might not be able to optimize an index into the equivalent pointer.

Is there a way to get “safety” via debug assertions or similar without giving up a pointer interface?

Introducing RangedPtr

We’re talking C++, so of course the answer is yes, and of course the answer is a smart pointer class.

The Mozilla Framework Based on Templates in mozilla-central now includes a RangedPtr<T> class. It’s defined in mfbt/RangedPtr.h and can be #included from mozilla/RangedPtr.h. RangedPtr stores a pointer, and in debug builds it stores start and end pointers fixed at construction time. Operations on the smart pointer — indexing, deriving new pointers through addition or subtraction, dereferencing, &c. — assert in debug builds that they don’t exceed the range specified by the start and end pointers. Indexing and dereferencing are restricted to the half-open range [start, end); new-pointer derivation is restricted to the range [start, end] to permit sentinel pointers. (It’s possible for start == end, although you can’t really do anything with such a pointer.)

The RangedPtr interface is pretty straightforward, supporting these constructors:

#include "mozilla/RangedPtr.h"

int nums[] = { 1, 2, 5, 3 };
RangedPtr<int> p1(nums, nums, nums + 4);
RangedPtr<int> p2(nums, nums, 4); // short for (nums, nums, nums + 4)
RangedPtr<int> p3(nums, 4); // short for (nums, nums, nums + 4)
RangedPtr<int> p4(nums); // short for (nums, length(nums))

RangedPtr<T> supports all the usual actions you’d expect from a pointer — indexing, dereferencing, addition, subtraction, assignment, equality and comparisons, and so on. All methods assert self-correctness as far as is possible in debug builds. RangedPtr<T> differs from T* only in that it doesn’t implicitly convert to T*: use get() method to get the corresponding T*. In addition to being explicit and consistent with nsCOMPtr and nsRefPtr, this will serve as a nudge to consider changing the relevant code to use RangedPtr instead of a raw pointer. But in essence RangedPtr is a pretty easy drop-in replacement for raw pointers in buffers. For example, adjusting containsDashDash to use it to assert in-rangeness is basically a single-line change:

#include "mozilla/RangedPtr.h"

bool containsDashDash(const char* charsRaw, size_t length)
{
  RangedPtr<const char> chars(charsRaw, length);
  for (size_t i = 0; i < length; i++)
  {
    if (chars[i] != '-')
      continue;
    if (chars[i + 1] == '-')
      return true;
  }
  return false;
}

(And to resolve all loose ends, if you wanted containsDashDash to be correct, you’d change the loop to go from 1 rather than 0 and would check chars[i - 1] and chars[i]. Thanks go to Neil in comments for noting this.)

A minor demerit of RangedPtr

RangedPtr is extremely lightweight and should almost always be as efficient as a raw pointer, even as it provides debug-build correctness checking. The sole exception is that, for sadmaking ABI reasons, using RangedPtr<T> as an argument to a method may be slightly less efficient than using a T* (passed-on-the-stack versus passed-in-a-register, to be precise). Most of the time the cost will be negligible, and if the method is inlined there probably won’t be any cost at all, but it’s worth pointing out as a potential concern if performance is super-critical.

Bottom line

Raw pointers into buffers bad, smart RangedPtrs into buffers good. Go forth and use RangedPtr throughout Mozilla code!

« Newer