Astronomical Image Processing with Hadoop

Wiley, Keith

In the coming decade astronomical surveys of the sky will generate tens of terabytes of images and detect hundreds of millions of sources every night. With a requirement that these images be analyzed in real time to identify moving sources such as potentially hazardous asteroids or transient objects such as supernovae, these data streams present many computational challenges. In the commercial world, new techniques that utilize cloud computing have been developed to handle massive data streams. In this talk we describe how cloud computing, and in particular the map-reduce paradigm, can be used in astronomical data processing. We will focus on our experience implementing a scalable image-processing pipeline for the SDSS database using Hadoop (http://hadoop.apache.org/). This multi-terabyte imaging dataset approximates future surveys such as those which will be conducted with the LSST. Our pipeline performs image coaddition in which multiple partially overlapping images are registered, integrated and stitched into a single overarching image. We will first present our initial implementation, then describe several critical optimizations that have enabled us to achieve high performance, and finally describe how we are incorporating a large in-house existing image processing library into our Hadoop system. The optimizations involve prefiltering of the input to remove irrelevant images from consideration, grouping individual FITS files into larger, more efficient indexed files, and a hybrid system in which a relational database is used to determine the input images relevant to the task. The incorporation of an existing image processing library, written in C++, presented difficult challenges since Hadoop is programmed primarily in Java. We will describe how we achieved this integration and the sophisticated image processing routines that were made feasible as a result. We will end by briefly describing the longer term goals of our work, namely detection and classification of transient objects and automated object classification.

Return to oral presentation list