Distributed Classification of Text Streams: Limitations, Challenges, and Solutions
Text stream classification is an important problem that is difficult to solve at scale. Batch processing systems, widely adopted for text classification tasks, cannot provide for low latency. Distributed stream processing systems can offer low latency, but do not support the same level of fault tolerance and determinism as the batch systems. In this work, we demonstrate how the distributed stream processing features can affect the results of a typical text classification data flow. Our analysis shows emerged trade-offs between fault tolerance and reproducibility on the one side, and performance on the other side. We outline potential ways to solve the revealed issues and to handle streaming features.