Building on the success of Major TOM’s (Terrestrial Observation Metaset) inaugural dataset release last year, ESA Φ-lab has launched the first global embedding dataset for Earth Observation (EO), in collaboration with CloudFerro. These embeddings deliver an efficient representation of vast volumes, enabling more precise and scalable analysis of satellite data.
March 2024 saw the release of Major TOM, a collaborative project designed to enable researchers to share, access, and integrate extensive EO datasets. Major TOM’s inaugural core dataset constitutes the largest machine learning (ML)-based collection of Copernicus Sentinel-2 images to date.
Less than one year later, ESA Φ-lab and CloudFerro revealed new Major TOM’s embedding expansions, which will improve the processing of complex information and drive advancements in ML, natural language processing, and computer vision.
With the massive and continuously increasing volumes of EO data in programmes like Copernicus, efficient vector representations are more necessary than ever. By encoding complex data into high-dimensional vectors, embeddings capture relationships and meaning, transforming natural language, images and other data types into a compact form that can be readily integrated in diverse AI pipelines.
This process enables machines to uncover patterns, similarities and connections with precision and accuracy in a manner agnostic to the downstream task. With embeddings, users can efficiently interpret key features of interest from satellite imagery, sensor data and geographic information systems, simplifying the analysis of spatial relationships and optimising time and resources.
This latest release consisted of more than 169 million embeddings, as the result of processing over 62 TB of raw data. Major TOM’s Embedding Expansions, now available for free on HuggingFace, include the Sentinel-2 Multispectral SSL4EO Model (Core-S2L1C-SSL4EO), the Sentinel-1 RTC SSL4EO Model (Core-S1RTC-SSL4EO), the Sentinel-2 RGB DINOv2 Model (Core-S2RGB-DINOv2) and the Sentinel-2 RGB SigLIP Model (Core-S2RGB-SigLIP).
Mikolaj Czerkawski, Internal Research Fellow at Φ-lab, was the leading researcher in this project: “Once applied at the full scale of Sentinel data archives, embeddings will fundamentally change the way users engage with Earth Observation data. This collaboration between Φ-lab and CloudFerro enabled a rapid delivery of an open-source prototype of this technology, showing how open data programmes like Copernicus can deliver further benefits to the global community beyond what was originally foreseen.”
Future work will focus on evaluating how Major TOM embeddings perform across a range of EO tasks, including pattern detection and predictive modelling, and on investigating other foundation models – such as MMEarth and DeCUR – to understand which the differences between how various models interpret EO data. The Major TOM dataset, now enriched with embeddings, will also be made available on the CREODIAS repository, offering open access to researchers and promoting collaborations within the EO community.
To know more: ESA Φ-lab, CloudFerro
Photo courtesy of ESA/Mikolaj Czerkawski
Share