Skip to main content

Created by Rian Dutch

Machine learning is an amazing tool to help automate repetitive tasks, remove some of the subjectivity in data collection and more importantly, provide a new layer of information to aid in a geologists interpretation. Using ML to help automate core logging is not new, there are a number of examples out there (in fact Datarocks first blog post in this series was an example of this). Most of these approaches use acquired datasets, such as geochemistry or petrophysics to predict a parameter, such as lithology. But as a geologist, when I log core I’m not looking at numbers, I’m looking at the rock itself. Geologists use the texture, patterns of layering or mineralogical banding, structural features and even colour to help us recognise and classify the rock. Then we use the associated data to help refine and define those classified units.

Machines, in the form of computer vision algorithms and convolutional neural networks, are also very good at finding patterns and relationships in visual data such as photos. Datarock has spent many years now developing models and workflows to do just this, to use computers to find, segment and classify important geotechnical and geological features in core images. But we can also go a step further and start to integrate computer vision outputs and additional datasets to quantitatively describe rock texture and integrate it with other continuous data, to start to let computers see rocks more like we as geologists do.

At Datarock, we call this type of project a Fusion job, integrating the power of the Datarock platform with other datasets and our skills as geologists. In this blog, we present a simple example, using open file core photography and down hole XRF scanning geochemistry from the Minalyze core scanner. In this case, we decided to make it a bit more challenging. Instead of predicting something like lithology, which is generally fairly “easy” to predict using data like mineralogy or chemistry, we decided to see if we could improve the prediction of stratigraphy by including the core imagery. Stratigraphy is an interesting challenge as it can have various lithologies, alteration and structural overprints.

The data set

Data for this blog comes from the Geological Survey of South Australia’s Mineral Systems Drilling Project (MSDP) in the Gawler Range Volcanics. The MSDP was a collaborative initiative managed by the Geological Survey of South Australia (GSSA) to refine geological models and identify mineralization controls in a challenging terrain where exploration models had not been established. The program aimed to extend understanding of the 1590 Ma magmatic-hydrothermal event along the southern margin of the Gawler Ranges and improve understanding of the lithological variability, thickness, and structural controls on the Gawler Range Volcanics. The program collected core samples, down-hole geochemistry and geophysical data from 14 drill holes across three project areas (Figure 1). Additionally a number of core scanning technologies were used including Hylogger hyperspectral and Minalyze XRF scanning. In total, eight of the 14 drill holes had core photography and Minalyze XRF geochemistry, and those were the holes used for this study. All of this data is freely available from the GSSA’s SARIG web portal.

Figure 1. Location of the MSDP drilling program and drill holes. Holes used in this study are highlighted red. Figure modified from Fabris 2017.

The geology

The aim of this exercise is to see if we can predict stratigraphy from our eight drill holes using a combination of images and chemical information. These drill holes intersect a number of different units from the GRV and surrounding packages.

The GRV is a thick volcanic package that was formed approximately 1590 million years ago. The stratigraphy of the GRV is divided into three main units: the lower, middle and upper sequences.

The lower sequence is characterised by subaerial basaltic lava flows and shallow marine sediments with intercalated tuffs, volcaniclastic rocks and cherts. The middle sequence consists mainly of submarine basaltic pillow lava flows and associated volcaniclastic rocks, hyaloclastite and tuffs. The upper sequence consists of subaerial rhyolitic to dacitic lava flows and associated volcaniclastic rocks, tuffs and sedimentary rocks. Related to the GRV are the Hiltaba Suite granites. These are typically coarse-grained to porphyritic and range in composition from granodiorite to monzogranite.

To the south of the GRV are a series of other units including the Sleaford Complex, Hutchison Group, and Peter Pan Supersuite. The Sleaford Complex consists of highly metamorphosed and deformed granites and gneisses, while the Hutchison Group is a deformed and metamorphosed metasedimentary sequence consisting of quartzites, pelites and dolomitic units. The Peter Pan Supersuite is a series of deformed granitic plutons that pre-date the GRV.

Workflow

Dealing with a data set like this can be challenging. There are multiple types of data, such as imagery and tabular continuous data and interpreted lables such as lithology and stratigraphy to handle as well as the usual issues of missing data, data at various scales and having to join various different data sets. There was a significant amount of work required to get to the point of being able to create a model. This workflow is summarised in the below figure. In this blog we will describe the image processing workflow but we will skip the tabular data processing for another day.

Figure 2. Schematic workflow of the process from data gathering to modelling.

Image processing and feature extraction

In order to clean and pre-process the core photography, we utilised some of the processing capability of the Datarock platform. The Datarock platform takes RGB images of core boxes and uses advanced computer vision models to extract geological and geotechnical information from the core. While it is possible for us to use more advanced outputs from the platform, in this example we’re only using some of the standard pre-processing models, including advanced depth registration and a segmentation model to identify coherent core.

Figure 3. Example core box photo from MSDP 12 (top) and below, example core segmentation model from the Datarock Platform highlighting areas of coherent core, empty tray and core blocks. 

Using the depth registered images we are able to then identify only images of coherent core. This is an important data cleaning step as it makes sure we’re only modelling on core and not including broken core, core blocks or empty tray in the model and potentially adding another bias. We then ‘chop’ the core rows up into ~5cm square image sprites which we will use to learn features from. This is a fairly arbitrary size and we could modify that depending on the scale of features we are looking for in the core.

After extracting coherent core image sections and cutting up the tiles we are ready to generate feature embeddings. We can do this in a number of ways using convolutional neural networks (CNN). One option is transfer learning and another is using self-supervised learning (SSL). SSL is potentially a better approach as it is a process of letting a CNN learn the important features of a dataset itself, so it can extract specific geological information from the image data. In this case, for a simple example, we decided to use transfer learning.

Transfer learning uses a pre-trained CNN, leveraging the learned features from a large dataset. In this case, EfficientNetV2 CNN pre-trained on the ImageNet dataset. Here we extract a set of 512 length feature vectors from the final fully connected layer. Feature vectors, or embeddings, are simply numerical representations of abstract textural information in the image learnt by the neural network. Because we are operating at various scales with this data, we need to do some upscaling of our embeddings to be able to model this with our geochemistry. To do this we used mean pooling, simply taking the average of the embeddings for each image sprite to match the 1m XRF chemistry composites.

Figure 4. Example process of using a CNN to create feature embeddings for image tiles.

Unsupervised analysis

Now we have our image embeddings and our chemistry data, we can begin to explore the relationships in the data. Here we decided to do some unsupervised analysis of the datasets using UMAP. UMAP (Uniform Manifold Approximation and Projection) is a dimensionality reduction technique, which unlike PCA, can find non-linear relationships and reveal hidden structure in high-dimensional data. What this means is that features which are similar, should group close to each other in this UMAP space.

First we look just at the image embeddings them selves. In the below 3D plot, we have reduced the 512 dimension image embeddings down to 3 dimensions, and coloured them by stratigraphy (represented by the map symbol, the shape corresponds to lithology). We can already begin to see broad groupings in the data, with the Mh granites (Hiltaba Suite) and Ma rhyodacites (undifferentiated GRV) pulling away from the main cluster. Within the main cluster there are clearly also some groupings, with the Maup (Paney Rhyolite Member), Mauy (Yannabie Rhyolite Member),  Mab (Bittali Rhyolite) and Mau (Eucarro Rhyolite) forming good clusters. Mayp (Pondanna Dacite Member) on the other hand seems to spread across two or three areas near the Maup and Mau units, perhaps suggesting there is something different going on within that unit, or perhaps we need to go back and see if those sections of core are labelled correctly.

Beneath this cluster of volcanic units we can see a broad group of the other units. It looks like the Lz (Peter Pan Supersuite) orthogneisses group fairly well, and are distinct from the ALs (Sleaford Complex) undifferentiated gneisses. Between these units we see a generally ordered spread of L-r (Corunna Conglomerate) units and undifferentiated Paleo- to Mesoproterozoic units (LM), which are mostly skarns.

There is also one very distinct cluster of Mau samples a long way from the main group. These are actually a group of bad, pixelated core box photos and were able to be removed from the analysis based on their distinct embeddings.

The next step was to undertake the same analysis, but this time include the 19 element XRF data along with the image embeddings. By adding in the chemical data as well, we can see that the overall groupings have tightened up and some units, such as L-r have now pulled clearly out of the main cluster, meaning adding the chemistry has improved the potential predictability of that unit.

Modelling stratigraphy

Now we’ve reviewed our data, we can see if we can build a model to predict stratigraphy. Supervised modelling was undertaken using the XGBoost algorithm, first using only the XRF data and then using both XRF and image features. The model was trained to predict stratigraphic unit (MAP_SYMBOL). As seen in the plot below, there was a significant class imbalance in the stratigraphic units in the dataset. This poses some interesting problems for creating a generalisable model. There are potentially a number of ways to deal with this but for this example we simply trained the model and used a stratified random 30% hold-out dataset for testing. Unfortunately, because of the distribution of labels amongst the holes, such as MSDP 13 only containing one stratigraphic unit L-r, we are not able to effectively control for spatial autocorrelation when building these models. This means that while the models may look good based on the training and test data available, they are not likely to generalise too well. For example, the unit L-r can only be trained and tested on hole MSDP 13, so there is the potential to overfit a model to this data. But having said all that, the goal here is to see if adding imagery to the model improves the predictive power, and not to create a general stratigraphy predictor for the Gawler Ranges.

Figure 7. Count plot showing the total number of each stratigraphic unit in the data set. On the right is a hole by hole count, showing some holes are characterised by a single stratigraphic unit

The two models were trained using the pre-set XGBoost hyperparameters and the multi:softprob objective with 11 classes. Because of the large number of classes and the relatively small sample size, particularly for some units, model performance was measured using a confusion matrix to visualise the true and false prediction rate for each unit. In the figure below we see the first model (left) trained only on the XRF chemistry data. This clearly shows that using only chemistry, the model struggles to predict the stratigraphic units, particularly getting confused between units such as ALS and L-r, ALs and Lz, Mau and Mayp, Mauy and Maup and Lz and Mab. When we add in the image embeddings (right), we can see a dramatic improvement in the model with a significant reduction in miss-classifications for all units.

Figure 8. Model confusion matrix. Left, stratigraphy model trained only on XRF chemistry data and right, stratigraphy model trained on XRF chemistry and image embeddings.

Summary

Modelling a property like stratigraphy is potentially a challenging exercise because of the general complexity of what defines a stratigraphic unit. Using data sets like chemistry or mineralogy alone probably wont cut it due to the potential for multiple lithological units, structure and deformation, metamorphic grade or alteration that can affect a single stratigraphic unit. But by incorporating core photography into the mix, by helping models look at geology like geologists do, we can potentially begin to model complex features in core. The added benefit is that this approach is a data integration process, using the power of computers to find patterns between different data modalities which can help discern subtle differences or new insights that geologists might miss on their own. By leveraging the power of the Datarock platform, it’s possible to build complex models and explore old data for new insights by fusing image analysis with down hole continuous or point data sets.