?
Effective post-training quantization of neural networks for inference on low power neural accelerator
Neural network deployment to the target environment is considered a challenging task especially because of heavy burden of hardware requirements that DNN models lay on computation capabilities and power consumption. In case of low power edge devices, such as GNA - neural co-processor, quantization becomes the only way to make the deployment possible. This paper draws attention to the post-training quantization for low-power devices and proves that this approach is practically effective. We propose a novel quantization algorithm capable of reducing DNNs precision to 16-bit or 8-bit integer with negligible drop in accuracy (less than 0.1 percent). The elaborated approach is demonstrated on a set of speech recognition networks trained in Kaldi framework with OpenVINO framework as an inference backend that supports quantization and GNA as a target. Quantization influence on original topologies was rigorously measured and analyzed.