We implemented SmallFun, an alternative to std::function using fixed-size capture optimization (a form of small-buffer optimization) that is 3–5x faster in benchmarks.
We implemented SmallFun, an alternative to std::function which implements fixed-size capture optimization (a form of small-buffer optimization). Whilst SmallFun is a bit less generic than std::function, it is 3–5x faster in some benchmarks.
Background
std::function is a convenient way to store lambdas with closures (also known as captures), whilst providing a unified interface. Before std::function and lambdas, we would create a hand-crafted functor object like this:
struct Functor {
// The context / capture
int i;
unsigned N;
int operator()(int j) const {
return i * j + N;
}
};The Missed Opportunity of std::function
std::function uses a PImpl pattern to provide a unified interface across all functors for a given signature. For example, these two instances f and g have the same size, despite having different captures:
int x = 2, y = 9, z = 4;
// f captures nothing
std::function<int(int)> f = [](int i) { return i + 1; };
// g captures x, y and z
std::function<int(int)> g = [=](int i) { return (i * (x + z)) + y; };This is because std::function stores the capture on the heap. This unifies the size of all instances, but it is also an opportunity for optimization!
How SmallFun Works
Instead of dynamically allocating memory on the heap, we place the function object (including its virtual table) into a preallocated location on the stack:
// A SmallFun with a size of 64 bytes
SmallFun<unsigned(int const j), 64> f = [i, N](int j) {
return i * j + N;
};Benchmarks
| Test | Time | Note |
|---|---|---|
| functor | 191 ns | Baseline: a hand-crafted functor |
| sf32 | 312 ns | Big enough to store 2 ints |
| sf64 | 369 ns | |
| sf128 | 346 ns | |
| sf256 | 376 ns | |
| sf512 | 503 ns | |
| sf1024 | 569 ns | |
| sf2048 | 870 ns | |
| std::function | 1141 ns |
SmallFun with a 32-byte buffer is already ~3.7x faster than std::function.
The Test
To measure allocation and call overhead, instances are saved in a vector and executed in a loop:
void stdFunction(benchmark::State& state) {
unsigned N = 100;
std::vector<std::function<unsigned(int const j)>> fs(N);
std::vector<int> r(N);
while (state.KeepRunning()) {
for (int i = 0; i < N; ++i) {
fs[i] = [i, N](int j) { return i * j + N; };
}
int j = 0;
std::transform(fs.begin(), fs.end(), r.begin(), [&](auto const& f) {
return f(j++);
});
}
}SmallFun Implementation Details
SmallFun combines three C++ patterns: type-erasure, PImpl, and placement-new.
Type-erasure
Type-erasure unifies many implementations into one interface. Our public concept:
template<class ReturnType, class...Xs>
struct Concept {
virtual ReturnType operator()(Xs...) const = 0;
virtual ReturnType operator()(Xs...) = 0;
virtual ~Concept() {};
};For any callable type with a given signature:
template<class F, class ReturnType, class...Xs>
struct Model final : Concept<ReturnType, Xs...> {
F f;
Model(F const& f) : f(f) {}
virtual ReturnType operator()(Xs...xs) const { return f(xs...); }
virtual ReturnType operator()(Xs...xs) { return f(xs...); }
virtual ~Model() {}
};PImpl
A naive std::function-style implementation allocates on the heap:
template<class ReturnType, class...Xs>
class Function<ReturnType(Xs...)> {
std::shared_ptr<Concept<ReturnType,Xs...>> pimpl; // heap allocation
public:
template<class F>
Function(F const& f) : pimpl(new Model<F, ReturnType, Xs...>) {}
};Placement-new
Placement-new allocates memory at a given address:
char memorypool[64];
int* a = new (memorypool) int[4];
int* b = new (memorypool + sizeof(int) * 4) int[4];Putting it All Together
By combining these patterns we eliminate the heap allocation:
template<class ReturnType, class...Xs>
class SmallFun<ReturnType(Xs...)> {
char memory[SIZE];
public:
template<class F>
SmallFun(F const& f) {
assert(sizeof(Model<F, ReturnType, Xs...>) < SIZE);
new (memory) Model<F, ReturnType, Xs...>(f);
}
~SmallFun() {
((Concept<ReturnType,Xs...>*)memory)->~Concept();
}
};If Model<...>'s size is greater than SIZE, bad things happen — but this can be caught at compile-time using enable_if_t.
Copy Constructor
Unlike std::function, we cannot just copy a std::shared_ptr. We cannot copy bitwise either, since the lambda may manage a resource with side-effects. We need the model to copy-construct itself for a given memory location:
virtual void copy(void* memory) const {
new (memory) Model<F, ReturnType, Xs...>(f);
}
template<unsigned rhsSize, std::enable_if_t<(rhsSize <= size), bool> = 0>
SmallFun(SmallFun<ReturnType(Xs...), rhsSize> const& rhs) {
rhs.copy(memory);
}Further Remarks
- We can verify at compile-time if a lambda will fit in memory. If it does not, we could provide a fallback to heap allocation.
- A more generic implementation would take a generic allocator.
- Using type-traits, we could check if the underlying type is POD and then copy bitwise.
Related
- SmallFunction — Fast std::function Alternative — The open-source SmallFun library.
- value_ptr — The Missing C++ Smart-Pointer — More C++ type-erasure and memory management patterns.
- Either vs Error-Codes — Compiler optimizations that eliminate abstraction overhead in C++.
Related Articles
Either vs Error-Codes: Lifting Errors into the Type-System
Consider using an Either type to handle errors — they lift errors into the type-system and have the same performance characteristics as error-codes.
value_ptr — The Missing C++ Smart-Pointer
Use value_ptr to get value semantics on a heap resource. At the cost of some extra copying, your code will be simpler and easier to reason about.
C++ Coroutine TS — It's About Inversion of Control
One motivating example where lazy sequences enable you to separate your concerns and maximize code reuse — without increasing complexity.