Written by Dr Alessandro Maritati, Data Scientist

This post is the result of a collaboration between Datarock and Lunarlab, a lunar-focused research initiative within the Frontier Development Lab (FDL). Lunarlab is developing Lunar-FM, the first AI foundation model for lunar exploration and resource prospecting, using multi-modal orbital datasets. Datarock’s contribution was to bring geophysical and geological domain expertise, working with these embeddings to produce spatial outputs that are interpretable and actionable. We’re grateful for the opportunity to collaborate on a problem at the intersection of advanced machine learning and applied geoscience.

TL;DR: Self-supervised learning is a machine learning technique commonly used in Computer Vision that is revolutionising how we analyse geophysical and satellite imagery. By transforming multiple datasets into compact numerical representations (embeddings), they allow us to detect subtle geological variation across large areas without the need for labelled training data (for more information, read Thomas Schaap’s blog on Harnessing the power of Computer Vision in geophysics).

But generating embeddings is only half the challenge. The other half is organising them into spatially coherent geological domains that can be interpreted and acted on. Many clustering approaches are used to do this. While they may work well in feature space, they can often produce fragmented spatial outputs that are difficult to interpret geologically.

This post demonstrates a spatially aware clustering approach, using embeddings from Lunarlab’s lunar foundation model to define sub-units of the lunar geological map within the basaltic lava flows of the Mare Tranquillitatis (Sea of Tranquillity) unit at the Lunar equator. Analysis of this nature is critical in supporting the upcoming lunar missions and ultimately a more permanent presence on the moon.


A Foundation Model Built for Lunar Exploration and Resource Prospecting

Our collaborators at LunarLab, part of the Frontier Development Lab (FDL), are building Lunar-FM, the fist AI foundation model dedicated to lunar exploration, resource prospecting, and supporting humankind’s return to the moon (https://lunarlab.ai/a-lunar-fm). Lunarlab is a partnership that includes the Luxembourg Space Agency, the European Space Resources Innovation Centre (ESRIC), and Trillium Technologies.

The model uses a Masked Autoencoder (MAE) architecture, a deep learning approach that Datarock routinely applies to its geophysical processing workflows, to jointly learn from multiple orbital datasets, including gravity, optical imagery, elevation, rock abundance, and regolith temperature. Satellite data are tiled at 0.5 × 0.5-degree resolution, and for each tile the model produces a global embedding vector (Figure 1).

The key point is that these embeddings are learned entirely through self-supervision across different data modalities, without the need for geological labels. The model learns to encode structure from the data itself.

Satellite data

Figure 1: Satellite data ingested by the multi-modal lunar foundation model. A masked autoencoder learns joint representations of gravity, optical, elevation and thermophysical datasets, producing global and patch-level embeddings (from https://lunarlab.ai/a-lunar-fm).

Sea of Tranquillity: Uncovering Variability in a Uniform Lunar Plain

We pulled 2,000 tiles from a study area at the lunar equator and interrogated their embeddings vectors and their relationship to the known lunar geology. First, we examined the geological map for the study area (Figure 2). Just like on Earth, Lunar maps are created through expert interpretation of orbital imagery, surface morphology, crater density, and spectral data, with units defined primarily by age and surface expression. At regional scale, these units can appear uniform, even though they may contain internal variability that is not explicitly resolved.

Unified Geological map of the Moon overlain on the LROC Wide Angle Camera (WAC) mosaic

Figure 2: 1:5M Unified Geological map of the Moon overlain on the LROC Wide Angle Camera (WAC) mosaic. The large red unit in the centre of the AOI corresponds to the Sea of Tranquillity.

The Sea of Tranquillity is one such large unit. Hosting the Apollo 11 landing site, it is one of the Moon’s major volcanic plains formed from ancient basaltic lava flows and is of prospecting interest due to enrichment in ilmenite, an iron–titanium mineral relevant to future in-situ resource utilisation. Although mapped as a single unit, it records multiple eruptive phases and compositional variation.

When we apply dimensionality reduction to the MAE embeddings within this unit, we can see distinct groupings emerge in feature space (Figure 3). The foundation model has captured subtle heterogeneity across modalities that is not visible in the mapped product. The question then becomes how to organise that variation into interpretable spatial domains.

Understanding the Limitations of Standard Clustering for Geological Domains

Leiden clustering is a natural first step. It is a graph-based community detection algorithm well suited to embedding analysis. The standard workflow builds a k-nearest neighbour graph in embedding space, applies the Leiden algorithm, and then maps labels back to geographic space (for more information, read Katie Silversides’ blog on Automated geological mapping with unsupervised community detection models).

Applied to the embeddings within the Sea of Tranquillity, Leiden identifies meaningful clusters in feature space (Figure 3). But when mapped to the surface, the output is fragmented, resulting in isolated patches scattered across the unit rather than contiguous geological domains (Figure 5, left panel).

This is not a flaw in the algorithm. It reflects the fact that feature-only clustering has no knowledge of geographic position. While fine for statistical analysis, this can be a significant limitation for geological interpretation and exploration targeting.

UMAP dimensionality reduction of embeddings colour-coded by standard Leiden cluster membership

Figure 3: UMAP dimensionality reduction of embeddings colour-coded by standard Leiden cluster membership. Clusters are overall well separated in feature space.

SpatialLeiden: Clustering that Respects both Data and Geography

A solution comes from an unlikely field: spatial transcriptomics, the analysis of gene expression patterns directly within intact tissue sections while preserving the spatial and cellular context. Researchers working on that problem faced the same issue: meaningful clusters in feature space, spatial chaos on the microscope slide. Their solution, SpatialLeiden (Müller-Bötticher et al., 2025), translates remarkably well to geoscience (Figure 4). SpatialLeiden builds a two-layer graph that combines:

  • A feature graph connecting tiles by cosine similarity in embedding space
  • A spatial graph connecting geographically adjacent tiles on the ground

Community detection is then optimised jointly across both layers simultaneously. A spatial weight parameter controls how much the algorithm prioritises geographic continuity versus feature similarity, giving you a tuneable parameter between “spatially coherent” and “feature faithful.” The critical difference from post-processing smoothing is that spatial structure isn’t imposed on top of a clustering result, but it shapes the clustering from the start.

SpatialLeiden for tissue samples

Figure 4: SpatialLeiden for tissue samples. In this example, gene expression features measured within intact tissue sections are embedded and clustered using a two-layer graph that combines molecular similarity with physical cell adjacency. The resulting clustering produces spatially coherent tissue domains that remain faithful to underlying gene expression patterns.

In our SpatialLeiden implementation, we used an 8 neighbour spatial connectivity and tuned the spatial weight to 0.5 to give equal weight to the feature and spatial graphs. Applied to the Sea of Tranquillity embeddings, it produces a clear improvement over standard Leiden clustering (Figure 5, right panel). For some clusters (such as Cluster 2), good separation in embedding space translates directly into a spatially coherent map pattern. For others, the picture is more nuanced. The light blue (Cluster 8) and dark blue (Cluster 0) clusters, for example, are well separated in feature space yet appear fragmented in the standard Leiden output. SpatialLeiden resolves this: both clusters emerge as spatially contiguous domains without sacrificing their distinctiveness in feature space.

Importantly, the internal heterogeneity of the mare unit is preserved but expressed in a way that a geologist can readily interpret and communicate. This is particularly evident along boundaries, where ambiguity in cluster membership is reduced through spatial coherence, resulting in cleaner and more geologically meaningful transitions between domains.

Compared results from Leiden and SpatialLeiden clustering algorithms on tile embeddings within the Sea of Tranquillity on WAC mosaic

Figure 5: Compared results from Leiden and SpatialLeiden clustering algorithms on tile embeddings within the Sea of Tranquillity on WAC mosaic. Compared to Leiden, SpatialLeiden produces more spatially contiguous and interpretable domains while preserving embedding-driven heterogeneity.

From the Moon to the Mine: Applications for Greenfield Exploration and Geological Map Refinement

The lunar example serves as a proof of concept. On Earth, the primary use case is geological map refinement and early stage exploration, particularly in regions where regional maps group extensive areas into single undifferentiated units such as regolith packages, sedimentary sequences, or poorly exposed basement.

By integrating gravity, magnetics, radiometrics, DEM derivatives, and hyperspectral satellite imagery within a shared self-supervised embedding space, previously hidden heterogeneity becomes detectable. Spatially aware clustering then organises this complexity into coherent geological domains that reflect meaningful spatial patterns and produce outputs that exploration geologists can interrogate, communicate clearly, and act on with confidence.

In practice, this approach can be used to refine regolith cover domains and broad mapping units, distinguishes lithological variants beneath cover, and supports greenfields domain segmentation by resolving internal subdomains within these regionally mapped units.


Spatially Aware Clustering: Bridging Machine Learning and Geological interpretation

Standard clustering of geoscientific embeddings tends to produce outputs that are statistically meaningful but often geographically fragmented. SpatialLeiden addresses this by jointly optimising feature similarity and spatial adjacency, producing domain-scale outputs that align with how geologists think about the subsurface.

Applied to Lunarlab’s multi-modal embeddings over the Sea of Tranquillity, the approach demonstrates that apparently uniform geological units contain meaningful internal heterogeneity, and that spatially aware clustering is an effective tool for making that heterogeneity interpretable (Figure 6).

Figure 6: Comparison between lunar geology map and SpatialLeiden clustering output on the Sea of Tranquillity.

As self-supervised learning models become more central to geoscience workflows, spatially aware analysis will be essential for bridging the gap between machine learning representations and geological interpretation. At Datarock, we are continually experimenting with and developing novel approaches to extract deeper insights from both new and existing datasets. As computer vision technologies evolve, we will continue integrating them into geophysical exploration workflows to support smarter, more efficient decision making.

References

LunarLab. Lunar Foundation Model.  https://lunarlab.ai/a-lunar-fm

Müller-Bötticher, N., Sahay, S., Eils, R., & Ishaq, N. (2025). SpatialLeiden: Spatially aware Leiden clustering.

Schaap, T. Harnessing the power of computer vision in geophysics. https://datarock.com.au/harnessing-the-power-of-computer-vision-in-geophysics

Silversides, K.  Automated geological mapping with unsupervised community detection models.  https://datarock.com.au/automated-geological-mapping-with-unsupervised-community-detection-models