07.06.11

Introducing mozilla/RangedPtr.h: a smart pointer for members of buffers

Tags: , , , , — Jeff @ 15:17

Introduction

Suppose you’re implementing a method which searches for -- as part of an HTML parser:

bool containsDashDash(const char* chars, size_t length)
{
  for (size_t i = 0; i < length; i++)
  {
    if (chars[i] != '-')
      continue;
    if (chars[i + 1] == '-')
      return true;
  }
  return false;
}

But your method contains a bug! Can you spot it?

The buffer bounds-checking problem

The problem with the above implementation is that the bounds-checking is off. If chars doesn’t contain -- but ends with a -, then chars[i + 1] will read past the end of chars. Sometimes that might be harmless; other times it might be an exploitable security vulnerability.

The most obvious way to fix this in C++ is to pass an object which encapsulates both characters and length together, so you can’t use the characters except in accordance with the length. If you were using std::string, for example, accessing characters by [] or at() would generally assert or throw an exception for out-of-bounds access.

For one reason or another, however, you might not want to use an encapsulated type. In a parser, for example, you probably would want to use a pointer to process the input string, because the compiler might not be able to optimize an index into the equivalent pointer.

Is there a way to get “safety” via debug assertions or similar without giving up a pointer interface?

Introducing RangedPtr

We’re talking C++, so of course the answer is yes, and of course the answer is a smart pointer class.

The Mozilla Framework Based on Templates in mozilla-central now includes a RangedPtr<T> class. It’s defined in mfbt/RangedPtr.h and can be #included from mozilla/RangedPtr.h. RangedPtr stores a pointer, and in debug builds it stores start and end pointers fixed at construction time. Operations on the smart pointer — indexing, deriving new pointers through addition or subtraction, dereferencing, &c. — assert in debug builds that they don’t exceed the range specified by the start and end pointers. Indexing and dereferencing are restricted to the half-open range [start, end); new-pointer derivation is restricted to the range [start, end] to permit sentinel pointers. (It’s possible for start == end, although you can’t really do anything with such a pointer.)

The RangedPtr interface is pretty straightforward, supporting these constructors:

#include "mozilla/RangedPtr.h"

int nums[] = { 1, 2, 5, 3 };
RangedPtr<int> p1(nums, nums, nums + 4);
RangedPtr<int> p2(nums, nums, 4); // short for (nums, nums, nums + 4)
RangedPtr<int> p3(nums, 4); // short for (nums, nums, nums + 4)
RangedPtr<int> p4(nums); // short for (nums, length(nums))

RangedPtr<T> supports all the usual actions you’d expect from a pointer — indexing, dereferencing, addition, subtraction, assignment, equality and comparisons, and so on. All methods assert self-correctness as far as is possible in debug builds. RangedPtr<T> differs from T* only in that it doesn’t implicitly convert to T*: use get() method to get the corresponding T*. In addition to being explicit and consistent with nsCOMPtr and nsRefPtr, this will serve as a nudge to consider changing the relevant code to use RangedPtr instead of a raw pointer. But in essence RangedPtr is a pretty easy drop-in replacement for raw pointers in buffers. For example, adjusting containsDashDash to use it to assert in-rangeness is basically a single-line change:

#include "mozilla/RangedPtr.h"

bool containsDashDash(const char* charsRaw, size_t length)
{
  RangedPtr<const char> chars(charsRaw, length);
  for (size_t i = 0; i < length; i++)
  {
    if (chars[i] != '-')
      continue;
    if (chars[i + 1] == '-')
      return true;
  }
  return false;
}

(And to resolve all loose ends, if you wanted containsDashDash to be correct, you’d change the loop to go from 1 rather than 0 and would check chars[i - 1] and chars[i]. Thanks go to Neil in comments for noting this.)

A minor demerit of RangedPtr

RangedPtr is extremely lightweight and should almost always be as efficient as a raw pointer, even as it provides debug-build correctness checking. The sole exception is that, for sadmaking ABI reasons, using RangedPtr<T> as an argument to a method may be slightly less efficient than using a T* (passed-on-the-stack versus passed-in-a-register, to be precise). Most of the time the cost will be negligible, and if the method is inlined there probably won’t be any cost at all, but it’s worth pointing out as a potential concern if performance is super-critical.

Bottom line

Raw pointers into buffers bad, smart RangedPtrs into buffers good. Go forth and use RangedPtr throughout Mozilla code!

9 Comments »

  1. Does this only assert in debug builds, or does it enforce bounds on release builds as well?

    Comment by Brion — 07.06.11 @ 18:21

  2. It only asserts in debug builds. In release builds it is functionally identical to the pointer value it represents. It’s designed to detect errors during the writing of a patch and during testing, not to be a hard guarantee of proper use or safe failure. We have a lot of good testers who I trust to find errors if they exist — this is just a matter of making sure those errors can’t be missed (say, because in practice the overread reads from a location with a predictable value).

    Comment by Jeff — 07.06.11 @ 20:15

  3. Does this assert fatally?

    Comment by Kyle Huey — 07.06.11 @ 21:23

  4. An assertion is by definition fatal. 😛

    Comment by Jeff — 07.06.11 @ 21:29

  5. length - 1 isn’t the complete fix, because length might be 0, but you could start the loop at 1 and use chars[i - 1].

    Comment by Neil Rashbrook — 08.06.11 @ 03:06

  6. I had a quick peek at the source, and you have gone overboard with the sanity checks; two methods actually duplicate work, and there was at least one other redundant check. Also the this->operator= stuff was confusing, but (for instance) operator--() really doesn’t have to be that complicated.

    Comment by Neil Rashbrook — 08.06.11 @ 03:48

  7. Sorry, I meant to refer to operator--() [there is of course no operator-()].

    Comment by Neil Rashbrook — 08.06.11 @ 03:54

  8. Oh, it’s not me, it’s your blog software, should I have written operator--()?

    Comment by Neil Rashbrook — 08.06.11 @ 03:55

  9. Fair point about non-zero length, although I could perhaps claim that the user shouldn’t be checking for dashes in a sequence of no characters, since I didn’t make the contract clear. 🙂

    Most of the duplication I remember is amongst the constructors, but that’s unavoidable if rangeStart and rangeEnd are to be const members. You’re probably right that not explicitly invoking operator= would be clearer. I’ll probably fix that in future adjustments (I’ll need operator! and operator bool in order to use RangedPtr as the return value of JSString::getChars, for starters).

    WordPress likes to convert double-dashes into em dashes, as you’ve discovered. I only discovered when writing this post that it won’t convert if it’s inside a <code> tag, which avoids the problem but is not particularly discoverable.

    Comment by Jeff — 08.06.11 @ 08:35

RSS feed for comments on this post. TrackBack URI

Leave a comment

HTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>