I don't have the opportunity to code in Rust professionally (C++ is still king
when you're at the intersection of graphics and scientific computing), but I've
been following its developpment for quite some time now (and, shameless plug,
also evaluating it for writing a
sparse matrix library).
Recently, a benchmark
made it to the top of
/r/programming,
featuring Rust among other languages, and I was a bit surprised to see that the
idiomatic Rust program was not competitive
with the best-tuned C++ solution. The benchmark implements a binary tree, and
the C++ solution leverages raw pointers while Rust would use an
Option<Box<Node>>
to represent its tree. Since Option
knows that Box
is
non-nullable, it should compile down to a raw pointer. Quickly inspecting the
Rust and C++ versions would not let me find where the performance difference
came from.
I was to take the train that evening for quite a long trip, so I decided I
would investigate. I cloned the benchmark repo on my laptop, and quickly
ran perf
and flamegraph
the Rust version to get an idea of where the problem
was. As a sanity check I also ran the C++ version under valgrind to check for
memory issue and be sure the comparison would be fair. The real investigation
would begin while on the train.
On the train, I realized the investigation would not be as easy as I had hoped:
I had not installed all my usual developpment tools on my laptop, so I could
not use perf
or valgrind
. I remembered from the earlier flamegraph that
most of the time was spent inside the split_binary
function.
Here's its implementation:
fn split_binary(orig: NodeCell, value: i32) -> (NodeCell, NodeCell) {
if let Some(mut orig_node) = orig {
if orig_node.x < value {
let split_pair = split_binary(orig_node.right.take(), value);
orig_node.right = split_pair.0;
(Some(orig_node), split_pair.1)
} else {
let split_pair = split_binary(orig_node.left.take(), value);
orig_node.left = split_pair.1;
(split_pair.0, Some(orig_node))
}
} else {
(None, None)
}
}
and the C++ one:
void Tree::split(NodePtr orig, NodePtr& lower, NodePtr& greaterOrEqual, int val)
{
if(!orig)
{
lower = greaterOrEqual = nullptr;
return;
}
if(orig->x < val)
{
lower = orig;
split(lower->right, lower->right, greaterOrEqual, val);
}
else
{
greaterOrEqual = orig;
split(greaterOrEqual->left, lower, greaterOrEqual->left, val);
}
}
Using objdump -d --demangle
, I was able to compare the assembly for the
Rust and C++ versions. In Rust:
0000000000006f10 <rust::idiomatic::split_binary>:
6f10: 41 57 push %r15
6f12: 41 56 push %r14
6f14: 41 54 push %r12
6f16: 53 push %rbx
6f17: 50 push %rax
6f18: 48 89 fb mov %rdi,%rbx
6f1b: 48 89 1c 24 mov %rbx,(%rsp)
6f1f: 48 85 db test %rbx,%rbx
6f22: 74 34 je 6f58 <rust::idiomatic::split_binary+0x48>
6f24: 39 73 10 cmp %esi,0x10(%rbx)
6f27: 7d 3e jge 6f67 <rust::idiomatic::split_binary+0x57>
6f29: 4c 8d 73 08 lea 0x8(%rbx),%r14
6f2d: 48 8b 7b 08 mov 0x8(%rbx),%rdi
6f31: 48 c7 43 08 00 00 00 movq $0x0,0x8(%rbx)
6f38: 00
6f39: e8 d2 ff ff ff callq 6f10 <rust::idiomatic::split_binary>
6f3e: 49 89 c7 mov %rax,%r15
6f41: 49 89 d4 mov %rdx,%r12
6f44: 4c 89 f7 mov %r14,%rdi
6f47: e8 a4 fe ff ff callq 6df0 <core::ptr::drop_in_place>
6f4c: 4c 89 7b 08 mov %r15,0x8(%rbx)
6f50: 49 89 de mov %rbx,%r14
6f53: 4c 89 e3 mov %r12,%rbx
6f56: eb 2f jmp 6f87 <rust::idiomatic::split_binary+0x77>
6f58: 48 89 e7 mov %rsp,%rdi
6f5b: e8 90 fe ff ff callq 6df0 <core::ptr::drop_in_place>
6f60: 45 31 f6 xor %r14d,%r14d
6f63: 31 db xor %ebx,%ebx
6f65: eb 20 jmp 6f87 <rust::idiomatic::split_binary+0x77>
6f67: 48 8b 3b mov (%rbx),%rdi
6f6a: 48 c7 03 00 00 00 00 movq $0x0,(%rbx)
6f71: e8 9a ff ff ff callq 6f10 <rust::idiomatic::split_binary>
6f76: 49 89 c6 mov %rax,%r14
6f79: 49 89 d7 mov %rdx,%r15
6f7c: 48 89 df mov %rbx,%rdi
6f7f: e8 6c fe ff ff callq 6df0 <core::ptr::drop_in_place>
6f84: 4c 89 3b mov %r15,(%rbx)
6f87: 4c 89 f0 mov %r14,%rax
6f8a: 48 89 da mov %rbx,%rdx
6f8d: 48 83 c4 08 add $0x8,%rsp
6f91: 5b pop %rbx
6f92: 41 5c pop %r12
6f94: 41 5e pop %r14
6f96: 41 5f pop %r15
6f98: c3 retq
6f99: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
And in C++:
00000000000061b0 <Tree::split(Tree::Node*, Tree::Node*&, Tree::Node*&, int)>:
61b0: 48 85 ff test %rdi,%rdi
61b3: 74 17 je 61cc <Tree::split(Tree::Node*, Tree::Node*&, Tree::Node*&, int)+0x1c>
61b5: 0f 1f 00 nopl (%rax)
61b8: 39 0f cmp %ecx,(%rdi)
61ba: 7d 24 jge 61e0 <Tree::split(Tree::Node*, Tree::Node*&, Tree::Node*&, int)+0x30>
61bc: 48 89 3e mov %rdi,(%rsi)
61bf: 48 8d 77 10 lea 0x10(%rdi),%rsi
61c3: 48 8b 7f 10 mov 0x10(%rdi),%rdi
61c7: 48 85 ff test %rdi,%rdi
61ca: 75 ec jne 61b8 <Tree::split(Tree::Node*, Tree::Node*&, Tree::Node*&, int)+0x8>
61cc: 48 c7 02 00 00 00 00 movq $0x0,(%rdx)
61d3: 48 c7 06 00 00 00 00 movq $0x0,(%rsi)
61da: c3 retq
61db: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
61e0: 48 89 3a mov %rdi,(%rdx)
61e3: 48 8d 57 08 lea 0x8(%rdi),%rdx
61e7: 48 8b 7f 08 mov 0x8(%rdi),%rdi
61eb: 48 85 ff test %rdi,%rdi
61ee: 75 c8 jne 61b8 <Tree::split(Tree::Node*, Tree::Node*&, Tree::Node*&, int)+0x8>
61f0: 48 c7 02 00 00 00 00 movq $0x0,(%rdx)
61f7: 48 c7 06 00 00 00 00 movq $0x0,(%rsi)
61fe: c3 retq
61ff: 90 nop
Now I'm far from an expert at reading assembly, but here it looked like the main
difference was coming from the drop_in_place
calls. While the C++ version
was not freeing any pointer, the Rust version was calling destructors. How
come? I remembered that the valgrind run did not report any memory leak, so
I could not blame the C++ code for forgetting to free nodes. The conclusion was
clear: something in the Rust code was being dropped needlessly.
The problem lied in these lines:
let split_pair = split_binary(orig_node.right.take(), value);
orig_node.right = split_pair.0;
Option::take
moved the right child of the node out of its Option
, replacing
it by None
, and later the same right child would be overwritten by
split_pair.0
. But in that assignment, the right child has to be dropped, even
though it is None
. I remembered that std::mem::forget
was safe and designed
for these kind of cases. I therefore replaced the assignment by a swap, and
asked Rust to forget the remaining None
:
let mut split_pair = split_binary(orig_node.left.take(), value);
::std::mem::swap(&mut orig_node.left, &mut split_pair.1);
::std::mem::forget(split_pair.1);
With this optimization repeated everywhere I could find the issue, the
execution time of the Rust variant went from 0.49s to 0.32s, closer to
the C++ version which takes 0.26s. Rust should fare better in this benchmark
once the PR
is merged, but it's not the end of the story, C++ is still a bit faster. It looks
like gcc did a huge amount of inlining. Maybe when I take the train back home
I'll find out where the difference comes from. And this time, with perf
installed...
Comments on /r/rust.