Sunday, 22 July 2012

More fun with Intel TBB and Boost.Proto (and Boost.Fusion)

In my last post, we developed an EDSL for specifying TBB Flow Graph structure with the help of Boost.Proto. Instead of writing a series of tbb::flow::make_edge calls, we can use a little language to help us be more declarative. In the process, however, we lost some of the runtime efficiency. Instead of just invoking make_edge() a number of times, our solution uses loops and temporary data structures (e.g. std::set) which introduce runtime overhead. In this post, we will look at how to modify what we did last time to create a meta-program that will enable us to remove the unnecessary runtime overhead.

Boost.Fusion

Fusion library sits somewhere between Boost.MPL and STL. Like MPL, it provides heterogeneous data structures but unlike MPL it is able to operate at runtime. It also is a great tool for certain meta-programming tasks. For example, suppose we wanted to generate code that would invoke foo() multiple times, each with a different value (and maybe type). With Fusion, it is as easy as putting those values into a fusion::vector and then calling fusion::for_each() with the vector and a polymorphic functor:
void foo(int);
void foo(double);

struct call_foo {
    template <typename T>
    void operator()(T x) const {
        foo(x);
    }
};

int main() {
    boost::fusion::vector<int, double, int> v(1, 2.3, 4);
    boost::fusion::for_each(v, call_foo());
    return 0;
}

Once the program is compiled with optimizations enabled, all that remains are just foo() invocations (I ran the output generated via -s flag through c++filt to demangle the names):
main:                          # @main
    .cfi_startproc
# BB#0:
    pushq %rax
.Ltmp1:
    .cfi_def_cfa_offset 16
    movl $1, %edi
    callq foo(int)
    movsd .LCPI0_0(%rip), %xmm0
    callq foo(double)
    movl $4, %edi
    callq foo(int)
    xorl %eax, %eax
    popq %rdx
    ret

Since what we are trying to do is generate a bunch of make_edge() calls, this seems like exactly what we need.

Putting it all together

The basic idea is very simple. Replace the use of std::set with boost::fusion::vector when keeping track of senders and receivers in a compound node. We begin by re-defining compound_node to bundle up two fusion vectors. I replaced the use of std::tuple with a struct to make the code more readable and also added a utility function to make the construction easier:
template <typename Senders, typename Receivers>
struct compound_node {
    typedef Senders senders_t;
    typedef Receivers receivers_t;

    senders_t senders;
    receivers_t receivers;
};

template <typename S, typename R>
compound_node<S, R> mk_compound_node(S s, R r) {
    return compound_node<S, R>{s, r};
}

Now it's time to modify the make_compound_node transform that is invoked by Proto everytime it meets a terminal:
struct make_compound_node : proto::callable {
    typedef compound_node<
        fusion::vector<flow::sender<flow::continue_msg> *>,
        fusion::vector<flow::receiver<flow::continue_msg> *>
    > result_type;

    result_type operator()(flow::sender<flow::continue_msg>& s, flow::receiver<flow::continue_msg>& r) {
        return mk_compound_node(
            fusion::vector<flow::sender<flow::continue_msg> *>(&s),
            fusion::vector<flow::receiver<flow::continue_msg> *>(&r)
        );
    }
};

Next up is the join transform that is applied to process the + operator. Recall that all it has to do is create a new compound node by joining the senders and receivers of the left and right hand sides. To join two Fusion vectors, we use fusion::join function. Unfortunately, it returns not a new vector but a view which contains references to the original sequences it was asked to join. Because we pass everything by value and have tons of temporaries, this is a recipe for disaster. So we force Fusion to create a new vector by passing the result of join to fusion::as_vector. The messy part turns out to be the return type. Remember, Proto transforms have to comply with (TR1/C++11) result_of protocol. While previously we got away by using the simple method of typedef'ing a result_type, this time our return type depends on the types of arguments. We have to define a meta-function by specializing the result<Sig> struct. Luckily, Fusion library also supports a uniform way of computing result types of its functions (albeit slightly different from the standard approach). All of its functions have a meta-function with the same name in the fusion::result_of namespace.
struct join : proto::callable {
    template <typename Sig>
    struct result;

    template <typename This, typename LeftNode, typename RightNode>
    struct result<This(LeftNode, RightNode)> {
        typedef typename LeftNode::senders_t left_senders_t;
        typedef typename LeftNode::receivers_t left_receivers_t;
        typedef typename RightNode::senders_t right_senders_t;
        typedef typename RightNode::receivers_t right_receivers_t;

        typedef compound_node<
            typename fusion::result_of::as_vector<
                typename fusion::result_of::join<left_senders_t const, right_senders_t const>::type
            >::type,
            typename fusion::result_of::as_vector<
                typename fusion::result_of::join<left_receivers_t const, right_receivers_t const>::type
            >::type
        > type;
    };

    template <typename LeftNode, typename RightNode>
    typename result<join(LeftNode, RightNode)>::type operator()(LeftNode left, RightNode right) const {
        return mk_compound_node(
                fusion::as_vector(fusion::join(left.senders, right.senders)),
                fusion::as_vector(fusion::join(left.receivers, right.receivers))
            );
    };
};

Finally we get to the meat of the problem -- implementing splice transform where actual make_edge calls are made. Just like in the join transform above, we need to adhere to result_of protocol using struct specialization. We also need to execute the double for loop: the outer loop iterates over the left-hand's senders and the inner one over the right-hand receivers. But of course we'll use fusion::for_each() twice to do the looping. We are lucky that the fusion::vectors contain homogeneous types -- they are all either flow::sender<continue_msg> or flow::receiver<continue_msg>. This makes it possible to use C++11 lambdas and keep everything in one place.
struct splice : proto::callable {
    template <typename Sig>
    struct result;

    template <typename This, typename LeftNode, typename RightNode>
    struct result<This(LeftNode, RightNode)> {
        typedef compound_node<
            typename RightNode::senders_t,
            typename LeftNode::receivers_t
        > type;
    };

    template <typename LeftNode, typename RightNode>
    typename result<splice(LeftNode, RightNode)>::type operator()(LeftNode left, RightNode right) const {
        fusion::for_each(left.senders, [&](flow::sender<flow::continue_msg>* s) {
            fusion::for_each(right.receivers, [=](flow::receiver<flow::continue_msg>* r) {
                flow::make_edge(*s, *r);
            });
        });

        return mk_compound_node(right.senders, left.receivers);
    }
};

Note that no changes were necessary to the grammar nor, of course, main(). To verify that we reached our goal, let's check out the disassembly. Since flow::make_edge is inline function, I temporarily declared a make_edge() in the global namespace with the matching signature and used that instead. And voilĂ :
    leaq    560(%rsp), %rsi
    leaq    344(%rsp), %rdi
.LEHB53:
    call    make_edge(tbb::flow::interface6::sender<tbb::flow::interface6::continue_msg>&, tbb::flow::interface6::receiver<tbb::flow::interface6::continue_msg>&)
    leaq    816(%rsp), %rsi
    leaq    600(%rsp), %rdi
    call    make_edge(tbb::flow::interface6::sender<tbb::flow::interface6::continue_msg>&, tbb::flow::interface6::receiver<tbb::flow::interface6::continue_msg>&)
    leaq    1072(%rsp), %rsi
    leaq    88(%rsp), %rdi
    call    make_edge(tbb::flow::interface6::sender<tbb::flow::interface6::continue_msg>&, tbb::flow::interface6::receiver<tbb::flow::interface6::continue_msg>&)
    leaq    560(%rsp), %rsi
    leaq    88(%rsp), %rdi
    call    make_edge(tbb::flow::interface6::sender<tbb::flow::interface6::continue_msg>&, tbb::flow::interface6::receiver<tbb::flow::interface6::continue_msg>&)
    leaq    48(%rsp), %rsi
    leaq    1336(%rsp), %rdi
    call    make_edge(tbb::flow::interface6::sender<tbb::flow::interface6::continue_msg>&, tbb::flow::interface6::receiver<tbb::flow::interface6::continue_msg>&)
    leaq    304(%rsp), %rsi
    leaq    1336(%rsp), %rdi
    call    make_edge(tbb::flow::interface6::sender<tbb::flow::interface6::continue_msg>&, tbb::flow::interface6::receiver<tbb::flow::interface6::continue_msg>&)