Skip to main navigation Skip to search Skip to main content

Massive cosmological data generation and distribution

Student thesis: Doctoral thesis

Abstract

In recent decades, physicists and astronomers have significantly transformed their methodology for investigating the universe's content and evolution. Advanced computing techniques have emerged as indispensable tools to manage the substantial data amassed by contemporary automated telescopes and highly sensitive instruments. Extracting scientific insights from the vast information pool necessitates interdisciplinary collaboration among mechanical and electronic engineers, physicists, astronomers, computer scientists, and software engineers. This PhD thesis explores the interface of Computer Science and Cosmology within the Port d'Informació Científica (PIC), a High Throughput Computing (HTC) data center. The work encompasses two core domains: (comprehensive) data management and the advancement of (complex) algorithms for cosmological simulations. In the realm of data management, conventional tools like relational databases are usually employed. In this work, a pioneering stance is taken towards them, exemplified by their central role in the Physics of the Accelerating Universe Survey (PAUS). The design of a comprehensive data management infrastructure within the tight constraints of PAUS is the first contribution in this thesis. Moreover, given the limitations of relational databases in handling extensive data and evolving usage patterns, this study also delves into alternatives. The challenges in the distribution of cosmological catalogs within the PAUS collaboration lead to the adoption of the Apache Hadoop ecosystem. This investigation culminated in the creation of CosmoHub, an application leveraging Apache Hive -an unprecedented endeavor within astronomy and cosmology- that promotes Open Science principles. Concurrently, in the domain of algorithm development for cosmological simulations, this thesis describes the effort in developing, optimizing and calibrating an algorithm for the simulation of observed galaxy electromagnetic fluxes. This algorithm, integrated into a much larger set of Python modules within a Spark-driven pipeline operating on a Hadoop cluster, is crucial to the creation of the most extensive and comprehensive virtual galaxy catalogs, serving the European Space Agency's Euclid project.
Date of Award12 Apr 2024
Original languageEnglish
SupervisorNadia Tonello (Director), Jorge Carretero Palacios (Director) & Eduardo Cesar Galobardes (Director)

Cite this

'