Fast algorithms for the optimum-path forest-based classifier
Universidad Católica San Pablo
Pattern Recognition applications deal with ever increasing datasets, both in size and complexity. In this work, we propose and analyze efficient algorithms for the Optimum-Path Forest (OPF) supervised classifier. This classifier has proven to provide results comparable to most well-know pattern recognition techniques, but with a much faster training phase. However, there is still room for improvement. The contribution of this work is the introduction of spatial indexing and parallel algorithms on the training and classification phases of the OPF supervised classifier. First, we propose a simple parallelization approach for the training phase. Following the traditional sequential training for the OPF, it maintains a priority queue to compute best samples at each iteration. Later on, we replace this priority queue by an array and a linear search, in the aim of using a more parallel-friendly data structure. We show that this approach leads to more temporal and spatial locality than the former, providing better speedups. Additionally, we show how the use of vectorization on distance calculations affects the overall speedup and we provide directions on when to use it. For the classification phase, we first aim to reduce the number of distance calculations against the classifier samples and, then, we also introduce parallelization. For this purpose, we elaborate a novel theory to index the OPF classifier in a metric space. Then, we use it to build an efficient data structure that allows us to reduce the number of comparison with classifier samples. Finally, we propose its parallelization, leading to a very fast classification for new samples.