Photometric selection of quasars in large astronomical data sets with a fast and accurate machine learning algorithm

You are here: Home / Submitted Papers / 2014 / Photometric selection of quasars in large astronomical data sets with a fast and accurate machine learning algorithm

Abstract

Future astronomical surveys will produce data on ˜108 objects per night. In order to characterize and classify these sources, we will require algorithms that scale linearly with the size of the data, that can be easily parallelized and where the speedup of the parallel algorithm will be linear in the number of processing cores. In this paper, we present such an algorithm and apply it to the question of colour selection of quasars. We use non-parametric Bayesian classification and a binning algorithm implemented with hash tables (BASH tables). We show that this algorithm’s run time scales linearly with the number of test set objects and is independent of the number of training set objects. We also show that it has the same classification accuracy as other algorithms. For current data set sizes, it is up to three orders of magnitude faster than commonly used naive kernel-density-estimation techniques and it is estimated to be about eight times faster than the current fastest algorithm using dual kd-trees for kernel density estimation. The BASH table algorithm scales linearly with the size of the test set data only, and so for future larger data sets, it will be even faster compared to other algorithms which all depend on the size of the test set and the size of the training set. Since it uses linear data structures, it is easier to parallelize compared to tree-based algorithms and its speedup is linear in the number of cores unlike tree-based algorithms whose speedup plateaus after a certain number of cores. Moreover, due to the use of hash tables to implement the binning, the memory usage is very small. While our analysis is for the specific problem of selection of quasars, the ideas are general and the BASH table algorithm can be applied to any density-estimation problem involving sparse high-dimensional data sets. Since sparse high-dimensional data sets are a common type of scientific data set, this method has the potential to be useful in a broad range of machine-learning applications in astrophysics.

Author

Gupta, Pramod; Connolly, Andrew J.; Gardner, Jeffrey P.

Journal

Monthly Notices of the Royal Astronomical Society

Paper Type

Astrostatistics