Hello Major TOM: ESA Φ-lab releases largest ML-ready Sentinel-2 dataset ever published
 
	 
	 
		ESA Φ-lab has launched Major TOM (Terrestrial Observation Metaset), a community-oriented project that allows researchers to share, use and combine large Earth observation (EO) datasets. The Major TOM framework will help unlock the huge potential of satellite imagery by offering users the largest ever quality-controlled and globally distributed sample of data, with future expansions to multiple satellites and modalities planned from both Φ-lab and the wider community.
Recent years have seen a marked trend towards larger, more general EO and geospatial models, known as foundation models, which require massive volumes of high-quality training data. These large models present unique opportunities in that they have the potential to help solve many pressing scientific and societal problems.
But there are also challenges, including the risk of deepening the reproducibility crisis seen in AI research, whereby published models are often difficult to recreate due to closed data sources and opaque technical details. Bias is another issue, since all models are skewed by the data they learn from, and this may lead to biases being embedded into the systems that foundation models form part of.
ESA Φ-lab believes that these issues can be alleviated through the creation of high-quality globally distributed and collaborative ML-ready datasets, and has begun to integrate them under the moniker Major TOM. These ML-ready datasets are a means to steer the development of large models in a positive direction, democratising them and helping to make systems that are more reproducible and with a lower bias by virtue of the dataset’s global sampling. To achieve this, Φ-lab has partnered with Hugging Face to host and freely distribute Major TOM on the Hugging Face Hub. With its open and community-driven platform for datasets and models, Hugging Face is a leading light for the democratisation of machine learning technology.
The creation of such a large dataset presented the team with several technical hurdles. “Satellite data is often held and delivered in very large products – over 100 km across – which many people find difficult to work with for machine learning applications, especially when trying to combine different satellites whose products overlap to differing extents,” explains Φ-lab research fellow Alistair Francis. “By contrast, Major TOM uses a fixed, 10 km grid across the entire globe, meaning that data from one Major TOM dataset will fit neatly on top of another.”
Whilst the sheer volume of data processing involved was a challenge, the need to ensure its quality was equally difficult. For example, optical satellite imagery often contains clouds that hide the surface below. Although not eliminated from the Major TOM dataset entirely, cloudy imagery was minimised by using Φ-lab’s state-of-the-art AI cloud mask, soon to be released publicly.

Major TOM’s inaugural core dataset has now been released on Hugging Face. It constitutes the largest ML-ready collection of Copernicus Sentinel-2 images ever published. Covering over 50% of the Earth’s surface (including almost all dry land) with nearly 50 TB of data and 2.5 trillion pixels, Major TOM Core is a game-changer for those seeking to train large models with satellite data. It is expected that future expansions from the broader EO community, spearheaded by ESA Φ-lab, will spawn a diverse ecosystem of combinable datasets that will be invaluable in creating the next generation of large deep learning models from satellite data.
Giuseppe Borghi is the Head of ESA Φ-lab: “We want to build an open community of contributors and end users who can create a data landscape that ensures EO derives the largest benefits from the AI revolution. If we want to make sure that EO models are reliable, reproducible, traceable and in turn, trustworthy, then it stands to reason that we need to start with high-quality trustable data.”
An interview on Major TOM with Alistair Francis and fellow Φ-lab researcher Mikolaj Czerkawski can be found here.
To know more: Φ-lab, Hugging Face, Major TOM paper preprint
Header image contains modified Copernicus Sentinel data (2022), processed by ESA
Share