introduce tree batches: TCs [(1,0,0), (1,0,1), (1,), (2,)] will be computed in batches with same trees[(1,0,0), (1,0,1)], [(1,)], [(2,)]. This leads to a significant performance boost.
code was not thread safe (core dumps when using >1 threads) -> compute pred outside from multithreading. This requieres retain_graph=True in grad() calls.
Examples:
Drop multiprocessing example. Since we use multithreading, there are no restrictions/no explanations needed.
DNN: switch to binary example with two output nodes, add multithreading
DNN & GNN: More interpretable results for training