Dropping drops

Thu 17 May 2018

I don't have the opportunity to code in Rust professionally (C++ is still king when you're at the intersection of graphics and scientific computing), but I've been following its developpment for quite some time now (and, shameless plug, also evaluating it for writing a sparse matrix library).

Recently, a benchmark made it to the top of /r/programming, featuring Rust among other languages, and I was a bit surprised to see that the idiomatic Rust program was not competitive with the best-tuned C++ solution. The benchmark implements a binary tree, and the C++ solution leverages raw pointers while Rust would use an Option<Box<Node>> to represent its tree. Since Option knows that Box is non-nullable, it should compile down to a raw pointer. Quickly inspecting the Rust and C++ versions would not let me find where the performance difference came from.

I was to take the train that evening for quite a long trip, so I decided I would investigate. I cloned the benchmark repo on my laptop, and quickly ran perf and flamegraph the Rust version to get an idea of where the problem was. As a sanity check I also ran the C++ version under valgrind to check for memory issue and be sure the comparison would be fair. The real investigation would begin while on the train.

On the train, I realized the investigation would not be as easy as I had hoped: I had not installed all my usual developpment tools on my laptop, so I could not use perf or valgrind. I remembered from the earlier flamegraph that most of the time was spent inside the split_binary function.

Here's its implementation:

fn split_binary(orig: NodeCell, value: i32) -> (NodeCell, NodeCell) {
    if let Some(mut orig_node) = orig {
        if orig_node.x < value {
            let split_pair = split_binary(orig_node.right.take(), value);
            orig_node.right = split_pair.0;
            (Some(orig_node), split_pair.1)
        } else {
            let split_pair = split_binary(orig_node.left.take(), value);
            orig_node.left = split_pair.1;
            (split_pair.0, Some(orig_node))
    } else {
        (None, None)

and the C++ one:

void Tree::split(NodePtr orig, NodePtr& lower, NodePtr& greaterOrEqual, int val)
        lower = greaterOrEqual = nullptr;

    if(orig->x < val)
        lower = orig;
        split(lower->right, lower->right, greaterOrEqual, val);
        greaterOrEqual = orig;
        split(greaterOrEqual->left, lower, greaterOrEqual->left, val);

Using objdump -d --demangle, I was able to compare the assembly for the Rust and C++ versions. In Rust:

0000000000006f10 <rust::idiomatic::split_binary>:
    6f10:   41 57                   push   %r15
    6f12:   41 56                   push   %r14
    6f14:   41 54                   push   %r12
    6f16:   53                      push   %rbx
    6f17:   50                      push   %rax
    6f18:   48 89 fb                mov    %rdi,%rbx
    6f1b:   48 89 1c 24             mov    %rbx,(%rsp)
    6f1f:   48 85 db                test   %rbx,%rbx
    6f22:   74 34                   je     6f58 <rust::idiomatic::split_binary+0x48>
    6f24:   39 73 10                cmp    %esi,0x10(%rbx)
    6f27:   7d 3e                   jge    6f67 <rust::idiomatic::split_binary+0x57>
    6f29:   4c 8d 73 08             lea    0x8(%rbx),%r14
    6f2d:   48 8b 7b 08             mov    0x8(%rbx),%rdi
    6f31:   48 c7 43 08 00 00 00    movq   $0x0,0x8(%rbx)
    6f38:   00
    6f39:   e8 d2 ff ff ff          callq  6f10 <rust::idiomatic::split_binary>
    6f3e:   49 89 c7                mov    %rax,%r15
    6f41:   49 89 d4                mov    %rdx,%r12
    6f44:   4c 89 f7                mov    %r14,%rdi
    6f47:   e8 a4 fe ff ff          callq  6df0 <core::ptr::drop_in_place>
    6f4c:   4c 89 7b 08             mov    %r15,0x8(%rbx)
    6f50:   49 89 de                mov    %rbx,%r14
    6f53:   4c 89 e3                mov    %r12,%rbx
    6f56:   eb 2f                   jmp    6f87 <rust::idiomatic::split_binary+0x77>
    6f58:   48 89 e7                mov    %rsp,%rdi
    6f5b:   e8 90 fe ff ff          callq  6df0 <core::ptr::drop_in_place>
    6f60:   45 31 f6                xor    %r14d,%r14d
    6f63:   31 db                   xor    %ebx,%ebx
    6f65:   eb 20                   jmp    6f87 <rust::idiomatic::split_binary+0x77>
    6f67:   48 8b 3b                mov    (%rbx),%rdi
    6f6a:   48 c7 03 00 00 00 00    movq   $0x0,(%rbx)
    6f71:   e8 9a ff ff ff          callq  6f10 <rust::idiomatic::split_binary>
    6f76:   49 89 c6                mov    %rax,%r14
    6f79:   49 89 d7                mov    %rdx,%r15
    6f7c:   48 89 df                mov    %rbx,%rdi
    6f7f:   e8 6c fe ff ff          callq  6df0 <core::ptr::drop_in_place>
    6f84:   4c 89 3b                mov    %r15,(%rbx)
    6f87:   4c 89 f0                mov    %r14,%rax
    6f8a:   48 89 da                mov    %rbx,%rdx
    6f8d:   48 83 c4 08             add    $0x8,%rsp
    6f91:   5b                      pop    %rbx
    6f92:   41 5c                   pop    %r12
    6f94:   41 5e                   pop    %r14
    6f96:   41 5f                   pop    %r15
    6f98:   c3                      retq
    6f99:   0f 1f 80 00 00 00 00    nopl   0x0(%rax)

And in C++:

00000000000061b0 <Tree::split(Tree::Node*, Tree::Node*&, Tree::Node*&, int)>:
    61b0:   48 85 ff                test   %rdi,%rdi
    61b3:   74 17                   je     61cc <Tree::split(Tree::Node*, Tree::Node*&, Tree::Node*&, int)+0x1c>
    61b5:   0f 1f 00                nopl   (%rax)
    61b8:   39 0f                   cmp    %ecx,(%rdi)
    61ba:   7d 24                   jge    61e0 <Tree::split(Tree::Node*, Tree::Node*&, Tree::Node*&, int)+0x30>
    61bc:   48 89 3e                mov    %rdi,(%rsi)
    61bf:   48 8d 77 10             lea    0x10(%rdi),%rsi
    61c3:   48 8b 7f 10             mov    0x10(%rdi),%rdi
    61c7:   48 85 ff                test   %rdi,%rdi
    61ca:   75 ec                   jne    61b8 <Tree::split(Tree::Node*, Tree::Node*&, Tree::Node*&, int)+0x8>
    61cc:   48 c7 02 00 00 00 00    movq   $0x0,(%rdx)
    61d3:   48 c7 06 00 00 00 00    movq   $0x0,(%rsi)
    61da:   c3                      retq
    61db:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
    61e0:   48 89 3a                mov    %rdi,(%rdx)
    61e3:   48 8d 57 08             lea    0x8(%rdi),%rdx
    61e7:   48 8b 7f 08             mov    0x8(%rdi),%rdi
    61eb:   48 85 ff                test   %rdi,%rdi
    61ee:   75 c8                   jne    61b8 <Tree::split(Tree::Node*, Tree::Node*&, Tree::Node*&, int)+0x8>
    61f0:   48 c7 02 00 00 00 00    movq   $0x0,(%rdx)
    61f7:   48 c7 06 00 00 00 00    movq   $0x0,(%rsi)
    61fe:   c3                      retq
    61ff:   90                      nop

Now I'm far from an expert at reading assembly, but here it looked like the main difference was coming from the drop_in_place calls. While the C++ version was not freeing any pointer, the Rust version was calling destructors. How come? I remembered that the valgrind run did not report any memory leak, so I could not blame the C++ code for forgetting to free nodes. The conclusion was clear: something in the Rust code was being dropped needlessly.

The problem lied in these lines:

            let split_pair = split_binary(orig_node.right.take(), value);
            orig_node.right = split_pair.0;

Option::take moved the right child of the node out of its Option, replacing it by None, and later the same right child would be overwritten by split_pair.0. But in that assignment, the right child has to be dropped, even though it is None. I remembered that std::mem::forget was safe and designed for these kind of cases. I therefore replaced the assignment by a swap, and asked Rust to forget the remaining None:

            let mut split_pair = split_binary(orig_node.left.take(), value);
            ::std::mem::swap(&mut orig_node.left, &mut split_pair.1);

With this optimization repeated everywhere I could find the issue, the execution time of the Rust variant went from 0.49s to 0.32s, closer to the C++ version which takes 0.26s. Rust should fare better in this benchmark once the PR is merged, but it's not the end of the story, C++ is still a bit faster. It looks like gcc did a huge amount of inlining. Maybe when I take the train back home I'll find out where the difference comes from. And this time, with perf installed...

Comments on /r/rust.