Big Data in basic research: from elementary particles to black holes

Javier Coronado Blazquez    23 May, 2022
The Event Horizon Telescope (EHT) collaboration, who produced the first ever image of a black hole released in 2019, has today a new view of the massive object at the centre of the Messier 87 (M87) galaxy: how it looks in polarised light. This is the first time astronomers have been able to measure polarisation, a signature of magnetic fields, this close to the edge of a black hole.  This image shows the polarised view of the black hole in M87. The lines mark the orientation of polarisation, which is related to the magnetic field around the shadow of the black hole. Photo: Imagen: EHT Collaboration

The Big Data paradigm has profoundly penetrated all the layers of our society, changing the way in which we interact with each other and technological projects are carried out. Basic research, specifically in the field of physics, has not been immune to this change in the last two decades and has been able to adapt to incorporate this new model to the exploitation of data from leading experiments. We will talk here about the impact of Big Data on three of the major milestones in modern physics.

Large Hadron Collider: the precursor of Big Data

One of the buzzwords of 2012 was the “Higgs boson”, that mysterious particle that we were told was responsible for the mass of all other known particles (more or less) and that had been discovered that same year. But in terms of media hype, the focus was on the instrument that enabled the discovery, the Large Hadron Collider, or LHC, at the European Organization for Nuclear Research (CERN).

The LHC is a particle accelerator and is probably the most complex machine ever built by humans, costing some €7.5 billion. A 27 km long ring buried at an average depth of 100 metres under the border between Switzerland and France, it uses superconducting electromagnets to accelerate protons to 99.9999991% of the speed of light (i.e., in one second they go around the ring more than 11,000 times). By colliding protons at these delirious speeds, we can create new particles and study their properties. One such particle was the Higgs boson.

To make sure that the protons, which are elementary particles, collide with each other, instead of using them one by one, large packets are launched, resulting in about 1 billion collisions per second. All these collisions are recorded as single events. Thousands of individual particles can be produced from a single collision, which are characterised in real time (well below a millisecond) by detectors, collecting information such as trajectory, energy, momentum, etc.

Massive amounts of data

As we can imagine, this produces an enormous amount of data. Specifically, over 50,000-70,000 TB per year of raw data. And that’s just from the main detectors, as there are other secondary experiments at the LHC. Because it doesn’t operate every day of the year, it generates an average of 200-300 TB of data; a complicated – but feasible – volume to handle today. The problem is that the LHC came into operation in 2008, when Big Data was a very new concept, so there was a lot of ad hoc technology development. Not for the first time, the Internet itself was born at CERN, with the World Wide Web.

The Worldwide LHC Computer Grid (WLCG), a network of 170 computing centres in 42 countries, was established in 2003, with a total of 250,000 available cores allowing more than 1 billion hours of computing per year.

Depending on the technical characteristics, each of the nodes in this network can be dedicated to data storage, processing or analysis. To ensure good coordination between them, a three-tier hierarchical system was chosen: Tier 0 at CERN, Tier 1 at several regional sites, and Tier 2 at centres with very good connectivity between them.

Spain hosts several of these computing centres, both Tier 1 and Tier 2, located in Barcelona, Cantabria, Madrid, Santiago de Compostela and Valencia. One of the aspects that has fostered this large volume of data is the application of machine learning and artificial intelligence algorithms to search for physics beyond what is known, but that is a story for another day…

 LHC control room / Brice, Maximilien, CERN
LHC control room / Brice, Maximilien, CERN

James Webb Space Telescope: the present and future of astrophysics

The LHC explores the basic building blocks of our Universe: the elementary particles. Now we are going to travel to the opposite extreme, studying stars and entire galaxies. Except for the remarkable advances in neutrino and gravitational-wave astronomy in recent years, if we want to observe the Universe, we will do so with a telescope.

Due to the Earth’s rotation, a “traditional” telescope will only be able to observe at night. In addition, the atmospheric effect will reduce the quality of the images when we are looking for sharpness in very small or faint signals. Wouldn’t it be wonderful to have a telescope in space, where these factors disappear?

That was what NASA thought in the late 1980s, launching the Hubble space telescope in 1995, which has produced (and continues to produce) the most spectacular images of the cosmos. NASA considered a couple of decades ago what the next step was, and began designing its successor, the James Webb Space Telescope (JWST), launched on 25 December 2021 and currently undergoing calibration.

With a large number of technical innovations and patents, it was decided to place JWST at the L2 Lagrange point, 4 times further away from us than the Moon. At such a distance, it is completely unfeasible to send a manned mission to make repairs, as was the case with Hubble, which orbits at “only” 559 km from the Earth’s surface.

NASA’s James Webb Telescope main mirror. Image Credit: NASA/MSFC/David Higginbotham

One of the biggest design challenges was data transmission. Although the JWST carries shields to thermally insulate the telescope, because it is so far from the Earth’s magnetosphere, the hard disk that records the data must be an SSD (to ensure transmission speed) with high protection against solar radiation and cosmic rays, since it must be able to operate continuously for at least 10 years.

This compromises the capacity of such a hard disk, which is a modest 60 GB. With the large volume of data collected in observations, after about 3 hours of measurements this capacity may be reached.

The JWST is expected to perform two data downloads per day, in addition to receiving pointing instructions and sensor readings from the various components, with a transmission rate of about 30 Mbit/s.

Compared to the LHC’s figures this may seem insignificant, but we must not forget that JWST orbits 1.5 million kilometres from Earth, in a tremendously hostile environment, with temperatures of about 30°C on the Sun-facing side and -220°C on the shadow side. An unparalleled technical prodigy producing more than 20 TB of raw data per year, which will keep the astrophysical community busy for years to come, with robust and sophisticated machine learning algorithms already in place to exploit all this data

Event Horizon Telescope: Lifetime Big Data

Both the LHC and JWST are characterised by fast and efficient data transmission for processing. However, sometimes it is not so easy to get the “5 WiFi stripes”. How many times have we been frustrated when a YouTube video would freeze and load because of our poor connection? Let’s imagine that instead of a simple video we need to download 5 PB of data.

This is the problem encountered by the Event Horizon Telescope (EHT), which in 2019 published the first picture of a black hole. This instrument is actually a network of seven radio telescopes around the world (one of them in Spain), which joined forces to perform a simultaneous observation of the supermassive black hole at the centre of the galaxy M87 for 4 days in 2017. Over the course of the observations, each telescope generated about 700 TB of data, resulting in a total of 5 PB of data scattered over three continents. The challenge was to combine all this information in one place for analysis, which it was decided to centralise in Germany.

In contrast to the LHC, the infrastructure for data transfer at this level did not exist, nor was it worth developing as it was a one-off use case. It was therefore decided to physically transport the hard disks by air, sea and land. In fact, one of the radio telescopes was located in Antarctica, and we had to wait until the summer for the partial thaw to allow physical access to its hard disks.

Researcher Katie Bouman (MIT), who led the development of the algorithm to obtain the black hole photo with the EHT, proudly poses with the project's hard disks.
Researcher Katie Bouman (MIT), who led the development of the algorithm to obtain the black hole photo with the EHT, proudly poses with the project’s hard disks.

In total, half a tonne of storage media was transported, processed and analysed to generate the familiar sub-1 MB image. Explaining the technique required to achieve this would take several individual posts.

What is important here is that sometimes it is more important to be pragmatic than hyper-technological. Although our world has changed radically in so many ways thanks to Big Data, sometimes it is worth giving a vintage touch to our project and imitate those observatories of a century ago that transported huge photographic plates from telescopes to universities to be properly studied and analysed

Featured image shows the polarised view of the black hole in M87. The lines mark the orientation of polarisation, which is related to the magnetic field around the shadow of the black hole. Photo: EHT Collaboration

Leave a Reply

Your email address will not be published.