© 2015 The Authors Published by Oxford University Press on behalf of the Royal Astronomical Society. We present analyses of data augmentation for machine learning redshift estimation. Data augmentation makes a training sample more closely resemble a test sample, if the two base samples differ, in order to improve measured statistics of the test sample. We perform two sets of analyses by selecting 800 000 (1.7 million) Sloan Digital Sky Survey Data Release 8 (Data Release 10) galaxies with spectroscopic redshifts. We construct a base training set by imposing an artificial r-band apparent magnitude cut to select only bright galaxies and then augment this base training set by using simulations and by applying the k-correct package to artificially place training set galaxies at a higher redshift. We obtain redshift estimates for the remaining faint galaxy sample, which are not used during training. We find that data augmentation reduces the error on the recovered redshifts by 40 per cent in both sets of analyses, when compared to the difference in error between the ideal case and the non-augmented case. The outlier fraction is also reduced by at least 10 per cent and up to 80 per cent using data augmentation. We finally quantify how the recovered redshifts degrade as one probes to deeper magnitudes past the artificial magnitude limit of the bright training sample. We find that at all apparent magnitudes explored, the use of data augmentation with tree-based methods provide an estimate of the galaxy redshift with a low value of bias, although the error on the recovered redshifts increases as we probe to deeper magnitudes. These results have applications for surveys which have a spectroscopic training set which forms a biased sample of all photometric galaxies, for example if the spectroscopic detection magnitude limit is shallower than the photometric limit.
- Galaxies: distances and redshifts