Optimization of Direct Convolution Algorithms on ARM Processors for Deep Learning Inference
In deep learning, convolutional layers typically bear the Fan Shop - CFL - Novelty majority of the computational workload and are often the primary contributors to performance bottlenecks.The widely used convolution algorithm is based on the IM2COL transform to take advantage of the highly optimized GEMM (General Matrix Multiplication) kernel accel