?
TreeDQN: Sample-efficient off-policy reinforcement learning for combinatorial optimization
A convenient approach to optimally solving combinatorial optimization tasks is the Branch-and-Bound method.
Its branching heuristic can be learned to solve a large set of similar tasks. The promising results here are
achieved by the recently appeared on-policy reinforcement learning method based on the tree Markov Decision
Process. To overcome its main disadvantages, namely, very large training time and unstable training, we
propose TreeDQN (Tree Deep Q-Network), a sample-efficient off-policy RL method trained by optimizing the
geometric mean of expected return. To theoretically support the training procedure for our method, we prove
the contraction property of the Bellman operator for the tree MDP. As a result, our method requires up to
10 times less training data and performs faster than known on-policy methods on synthetic tasks. Moreover,
TreeDQN significantly outperforms the state-of-the-art techniques on a challenging practical task from the
ML4CO competition.