Experimenting with Small-Buffer Optimization

We implemented SmallFun, an alternative to std::function using fixed-size capture optimization (a form of small-buffer optimization) that is 3–5x faster in benchmarks.

March 1, 2018·5 min read

We implemented SmallFun, an alternative to std::function which implements fixed-size capture optimization (a form of small-buffer optimization). Whilst SmallFun is a bit less generic than std::function, it is 3–5x faster in some benchmarks.

Background

std::function is a convenient way to store lambdas with closures (also known as captures), whilst providing a unified interface. Before std::function and lambdas, we would create a hand-crafted functor object like this:

struct Functor {
  // The context / capture
  int i;
  unsigned N;
 
  int operator()(int j) const {
    return i * j + N;
  }
};

The Missed Opportunity of std::function

std::function uses a PImpl pattern to provide a unified interface across all functors for a given signature. For example, these two instances f and g have the same size, despite having different captures:

int x = 2, y = 9, z = 4;
 
// f captures nothing
std::function<int(int)> f = [](int i) { return i + 1; };
 
// g captures x, y and z
std::function<int(int)> g = [=](int i) { return (i * (x + z)) + y; };

This is because std::function stores the capture on the heap. This unifies the size of all instances, but it is also an opportunity for optimization!

How SmallFun Works

Instead of dynamically allocating memory on the heap, we place the function object (including its virtual table) into a preallocated location on the stack:

// A SmallFun with a size of 64 bytes
SmallFun<unsigned(int const j), 64> f = [i, N](int j) {
  return i * j + N;
};

Benchmarks

Test	Time	Note
functor	191 ns	Baseline: a hand-crafted functor
sf32	312 ns	Big enough to store 2 ints
sf64	369 ns
sf128	346 ns
sf256	376 ns
sf512	503 ns
sf1024	569 ns
sf2048	870 ns
std::function	1141 ns

SmallFun with a 32-byte buffer is already ~3.7x faster than std::function.

The Test

To measure allocation and call overhead, instances are saved in a vector and executed in a loop:

void stdFunction(benchmark::State& state) {
  unsigned N = 100;
 
  std::vector<std::function<unsigned(int const j)>> fs(N);
  std::vector<int> r(N);
 
  while (state.KeepRunning()) {
    for (int i = 0; i < N; ++i) {
      fs[i] = [i, N](int j) { return i * j + N; };
    }
 
    int j = 0;
    std::transform(fs.begin(), fs.end(), r.begin(), [&](auto const& f) {
      return f(j++);
    });
  }
}

SmallFun Implementation Details

SmallFun combines three C++ patterns: type-erasure, PImpl, and placement-new.

Type-erasure

Type-erasure unifies many implementations into one interface. Our public concept:

template<class ReturnType, class...Xs>
struct Concept {
  virtual ReturnType operator()(Xs...) const = 0;
  virtual ReturnType operator()(Xs...) = 0;
  virtual ~Concept() {};
};

For any callable type with a given signature:

template<class F, class ReturnType, class...Xs>
struct Model final : Concept<ReturnType, Xs...> {
  F f;
 
  Model(F const& f) : f(f) {}
 
  virtual ReturnType operator()(Xs...xs) const { return f(xs...); }
  virtual ReturnType operator()(Xs...xs) { return f(xs...); }
  virtual ~Model() {}
};

PImpl

A naive std::function-style implementation allocates on the heap:

template<class ReturnType, class...Xs>
class Function<ReturnType(Xs...)> {
  std::shared_ptr<Concept<ReturnType,Xs...>> pimpl; // heap allocation
 
public:
  template<class F>
  Function(F const& f) : pimpl(new Model<F, ReturnType, Xs...>) {}
};

Placement-new

Placement-new allocates memory at a given address:

char memorypool[64];
int* a = new (memorypool) int[4];
int* b = new (memorypool + sizeof(int) * 4) int[4];

Putting it All Together

By combining these patterns we eliminate the heap allocation:

template<class ReturnType, class...Xs>
class SmallFun<ReturnType(Xs...)> {
  char memory[SIZE];
 
public:
  template<class F>
  SmallFun(F const& f) {
    assert(sizeof(Model<F, ReturnType, Xs...>) < SIZE);
    new (memory) Model<F, ReturnType, Xs...>(f);
  }
 
  ~SmallFun() {
    ((Concept<ReturnType,Xs...>*)memory)->~Concept();
  }
};

If Model<...>'s size is greater than SIZE, bad things happen — but this can be caught at compile-time using enable_if_t.

Copy Constructor

Unlike std::function, we cannot just copy a std::shared_ptr. We cannot copy bitwise either, since the lambda may manage a resource with side-effects. We need the model to copy-construct itself for a given memory location:

virtual void copy(void* memory) const {
  new (memory) Model<F, ReturnType, Xs...>(f);
}
 
template<unsigned rhsSize, std::enable_if_t<(rhsSize <= size), bool> = 0>
SmallFun(SmallFun<ReturnType(Xs...), rhsSize> const& rhs) {
  rhs.copy(memory);
}

Further Remarks

We can verify at compile-time if a lambda will fit in memory. If it does not, we could provide a fallback to heap allocation.
A more generic implementation would take a generic allocator.
Using type-traits, we could check if the underlying type is POD and then copy bitwise.

SmallFunction — Fast std::function Alternative — The open-source SmallFun library.
value_ptr — The Missing C++ Smart-Pointer — More C++ type-erasure and memory management patterns.
Either vs Error-Codes — Compiler optimizations that eliminate abstraction overhead in C++.

⊕

Oct 1, 2017·6 min read