I don't have the opportunity to code in Rust professionally (C++ is still king when you're at the intersection of graphics and scientific computing), but I've been following its developpment for quite some time now (and, shameless plug, also evaluating it for writing a sparse matrix library).
Recently, a benchmark
made it to the top of
/r/programming,
featuring Rust among other languages, and I was a bit surprised to see that the
idiomatic Rust program was not competitive
with the best-tuned C++ solution. The benchmark implements a binary tree, and
the C++ solution leverages raw pointers while Rust would use an
Option<Box<Node>> to represent its tree. Since Option knows that Box is
non-nullable, it should compile down to a raw pointer. Quickly inspecting the
Rust and C++ versions would not let me find where the performance difference
came from.
I was to take the train that evening for quite a long trip, so I decided I
would investigate. I cloned the benchmark repo on my laptop, and quickly
ran perf and flamegraph the Rust version to get an idea of where the problem
was. As a sanity check I also ran the C++ version under valgrind to check for
memory issue and be sure the comparison would be fair. The real investigation
would begin while on the train.
On the train, I realized the investigation would not be as easy as I had hoped:
I had not installed all my usual developpment tools on my laptop, so I could
not use perf or valgrind. I remembered from the earlier flamegraph that
most of the time was spent inside the split_binary function.
Here's its implementation:
fn split_binary(orig: NodeCell, value: i32) -> (NodeCell, NodeCell) {
    if let Some(mut orig_node) = orig {
        if orig_node.x < value {
            let split_pair = split_binary(orig_node.right.take(), value);
            orig_node.right = split_pair.0;
            (Some(orig_node), split_pair.1)
        } else {
            let split_pair = split_binary(orig_node.left.take(), value);
            orig_node.left = split_pair.1;
            (split_pair.0, Some(orig_node))
        }
    } else {
        (None, None)
    }
}
and the C++ one:
void Tree::split(NodePtr orig, NodePtr& lower, NodePtr& greaterOrEqual, int val)
{
    if(!orig)
    {
        lower = greaterOrEqual = nullptr;
        return;
    }
    if(orig->x < val)
    {
        lower = orig;
        split(lower->right, lower->right, greaterOrEqual, val);
    }
    else
    {
        greaterOrEqual = orig;
        split(greaterOrEqual->left, lower, greaterOrEqual->left, val);
    }
}
Using objdump -d --demangle, I was able to compare the assembly for the
Rust and C++ versions. In Rust:
0000000000006f10 <rust::idiomatic::split_binary>:
    6f10:   41 57                   push   %r15
    6f12:   41 56                   push   %r14
    6f14:   41 54                   push   %r12
    6f16:   53                      push   %rbx
    6f17:   50                      push   %rax
    6f18:   48 89 fb                mov    %rdi,%rbx
    6f1b:   48 89 1c 24             mov    %rbx,(%rsp)
    6f1f:   48 85 db                test   %rbx,%rbx
    6f22:   74 34                   je     6f58 <rust::idiomatic::split_binary+0x48>
    6f24:   39 73 10                cmp    %esi,0x10(%rbx)
    6f27:   7d 3e                   jge    6f67 <rust::idiomatic::split_binary+0x57>
    6f29:   4c 8d 73 08             lea    0x8(%rbx),%r14
    6f2d:   48 8b 7b 08             mov    0x8(%rbx),%rdi
    6f31:   48 c7 43 08 00 00 00    movq   $0x0,0x8(%rbx)
    6f38:   00
    6f39:   e8 d2 ff ff ff          callq  6f10 <rust::idiomatic::split_binary>
    6f3e:   49 89 c7                mov    %rax,%r15
    6f41:   49 89 d4                mov    %rdx,%r12
    6f44:   4c 89 f7                mov    %r14,%rdi
    6f47:   e8 a4 fe ff ff          callq  6df0 <core::ptr::drop_in_place>
    6f4c:   4c 89 7b 08             mov    %r15,0x8(%rbx)
    6f50:   49 89 de                mov    %rbx,%r14
    6f53:   4c 89 e3                mov    %r12,%rbx
    6f56:   eb 2f                   jmp    6f87 <rust::idiomatic::split_binary+0x77>
    6f58:   48 89 e7                mov    %rsp,%rdi
    6f5b:   e8 90 fe ff ff          callq  6df0 <core::ptr::drop_in_place>
    6f60:   45 31 f6                xor    %r14d,%r14d
    6f63:   31 db                   xor    %ebx,%ebx
    6f65:   eb 20                   jmp    6f87 <rust::idiomatic::split_binary+0x77>
    6f67:   48 8b 3b                mov    (%rbx),%rdi
    6f6a:   48 c7 03 00 00 00 00    movq   $0x0,(%rbx)
    6f71:   e8 9a ff ff ff          callq  6f10 <rust::idiomatic::split_binary>
    6f76:   49 89 c6                mov    %rax,%r14
    6f79:   49 89 d7                mov    %rdx,%r15
    6f7c:   48 89 df                mov    %rbx,%rdi
    6f7f:   e8 6c fe ff ff          callq  6df0 <core::ptr::drop_in_place>
    6f84:   4c 89 3b                mov    %r15,(%rbx)
    6f87:   4c 89 f0                mov    %r14,%rax
    6f8a:   48 89 da                mov    %rbx,%rdx
    6f8d:   48 83 c4 08             add    $0x8,%rsp
    6f91:   5b                      pop    %rbx
    6f92:   41 5c                   pop    %r12
    6f94:   41 5e                   pop    %r14
    6f96:   41 5f                   pop    %r15
    6f98:   c3                      retq
    6f99:   0f 1f 80 00 00 00 00    nopl   0x0(%rax)
And in C++:
00000000000061b0 <Tree::split(Tree::Node*, Tree::Node*&, Tree::Node*&, int)>:
    61b0:   48 85 ff                test   %rdi,%rdi
    61b3:   74 17                   je     61cc <Tree::split(Tree::Node*, Tree::Node*&, Tree::Node*&, int)+0x1c>
    61b5:   0f 1f 00                nopl   (%rax)
    61b8:   39 0f                   cmp    %ecx,(%rdi)
    61ba:   7d 24                   jge    61e0 <Tree::split(Tree::Node*, Tree::Node*&, Tree::Node*&, int)+0x30>
    61bc:   48 89 3e                mov    %rdi,(%rsi)
    61bf:   48 8d 77 10             lea    0x10(%rdi),%rsi
    61c3:   48 8b 7f 10             mov    0x10(%rdi),%rdi
    61c7:   48 85 ff                test   %rdi,%rdi
    61ca:   75 ec                   jne    61b8 <Tree::split(Tree::Node*, Tree::Node*&, Tree::Node*&, int)+0x8>
    61cc:   48 c7 02 00 00 00 00    movq   $0x0,(%rdx)
    61d3:   48 c7 06 00 00 00 00    movq   $0x0,(%rsi)
    61da:   c3                      retq
    61db:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
    61e0:   48 89 3a                mov    %rdi,(%rdx)
    61e3:   48 8d 57 08             lea    0x8(%rdi),%rdx
    61e7:   48 8b 7f 08             mov    0x8(%rdi),%rdi
    61eb:   48 85 ff                test   %rdi,%rdi
    61ee:   75 c8                   jne    61b8 <Tree::split(Tree::Node*, Tree::Node*&, Tree::Node*&, int)+0x8>
    61f0:   48 c7 02 00 00 00 00    movq   $0x0,(%rdx)
    61f7:   48 c7 06 00 00 00 00    movq   $0x0,(%rsi)
    61fe:   c3                      retq
    61ff:   90                      nop
Now I'm far from an expert at reading assembly, but here it looked like the main
difference was coming from the drop_in_place calls. While the C++ version
was not freeing any pointer, the Rust version was calling destructors. How
come? I remembered that the valgrind run did not report any memory leak, so
I could not blame the C++ code for forgetting to free nodes. The conclusion was
clear: something in the Rust code was being dropped needlessly.
The problem lied in these lines:
            let split_pair = split_binary(orig_node.right.take(), value);
            orig_node.right = split_pair.0;
Option::take moved the right child of the node out of its Option, replacing
it by None, and later the same right child would be overwritten by
split_pair.0. But in that assignment, the right child has to be dropped, even
though it is None. I remembered that std::mem::forget was safe and designed
for these kind of cases. I therefore replaced the assignment by a swap, and
asked Rust to forget the remaining None:
            let mut split_pair = split_binary(orig_node.left.take(), value);
            ::std::mem::swap(&mut orig_node.left, &mut split_pair.1);
            ::std::mem::forget(split_pair.1);
With this optimization repeated everywhere I could find the issue, the
execution time of the Rust variant went from 0.49s to 0.32s, closer to
the C++ version which takes 0.26s. Rust should fare better in this benchmark
once the PR
is merged, but it's not the end of the story, C++ is still a bit faster. It looks
like gcc did a huge amount of inlining. Maybe when I take the train back home
I'll find out where the difference comes from. And this time, with perf
installed...
Comments on /r/rust.