Created by Eleanor Mare.
Geochemists routinely collect large amounts of data from exploration and mining projects. These geochemical datasets can contain many thousands of samples, each analysed for multiple elements. For a long time, geoscientists analysed these data using scatter plots to investigate the relationships between a handful of elements at a time. However, in the last decade or so, machine learning methods have become more accessible, and these tools are ideally suited to identify subtle and complex patterns within data that would be difficult for a human to discover.
However, there are properties of geochemical data that present a challenge for some standard machine learning techniques. Specifically, geochemical data does not always strictly meet the definition of “continuous” data, because the analytical methods have limits of detection and sensitivities that differ across various elements. This nuance has some interesting consequences for machine learning methods, and in this post, I will cover one example that I’ve come across during my work at Datarock.
The nature of geochemical data
You may have heard of the concepts of “discrete” vs “continuous” data. Continuous data can take any value to an infinite precision, like height or weight, whereas discrete data can only take specific values, like the number of people in a room.
Geochemical data occupies an interesting middle ground. Strictly, the number of atoms of an element in a sample is discrete (you can’t have 1.5 atoms of something). But short of atom probe tomography, our instruments are nowhere near sensitive enough to count atoms in samples. Instead, our instruments often count things like the number of electric pulses triggered by an x-ray, or an ionised atom, hitting a detector. These types of discrete “count” data are then converted into concentrations (weight percent, atomic percent) via various steps of data processing. In the end, the result you get probably looks like continuous data.
But depending on the analytical method, and who provides you with the data, you may find that the results have been reported to different decimal places for different elements. Maybe gold is reported to ppb levels but germanium only to ppm. You may also find that the data doesn’t have any zeros in it; instead, there will be a detection limit for each element. Results that come back as ‘below the detection limit’ are represented either explicitly with a string like ‘b.d.l’, or encoded in a numerical value, such as a negative number, or a value corresponding to half the detection limit.
In short, geochemical data is fundamentally discrete at the scale of atoms, but at the scale of anything we can routinely analyse it’s effectively continuous. Yet our analytical techniques generate data with ‘discrete’-ish properties. As we’ll see, it’s important to keep this in mind when applying machine learning techniques that are designed for continuous data.
Generating fake data that looks like the real thing
One example of how the ‘discrete-ish’ properties of geochemical data can cause problems relates to generating synthetic data. Synthetic data is sometimes used to help with making machine learning models in cases where the dataset is lopsided, with an interesting group of data making up only a small proportion of the total.
For the rest of this post I will take you through a toy example based on XRF data from several drill holes, provided by the Geological Survey of South Australia. You can read more about this dataset in a previous blog post. The example I’ll show is necessarily contrived, but I have encountered the same issues in a real project.
Imagine you want to make a model that can predict whether a rock is rhyolite or not, based on its chemistry. Perhaps the rocks in this area are weathered and it’s sometimes hard to distinguish the different lithologies. Maybe this rhyolite is associated with mineralisation, so it’s really important to know when it’s there and when it’s some other felsic unit.
The only problem is, the rhyolites are only a very small part of your dataset. Maybe you have around 50 examples of rhyolites, but hundreds or thousands of examples of other rock types. This is an example of an imbalanced dataset.
The risk with imbalanced datasets is that a model might tend to predict most things as the majority class (non-rhyolite). After all, most of the data the model has seen is not rhyolite, so the model could learn that “most things aren’t rhyolite”, which is accurate, but not helpful.
To counteract this issue, you may wish to balance your dataset – either by removing examples of the “non-rhyolites” (undersampling), or adding more examples of the “rhyolites” (oversampling).
The simplest way to approach this is randomly – for example, random undersampling would mean randomly removing non-rhyolites from your data. The downside of this is that it involves throwing out perfectly good data. Random oversampling – where you randomly duplicate rhyolite samples – carries the risk of exaggerating anomalies in your data.
To get around the drawbacks of random over- or undersampling, there are a variety of clever ways to create synthetic data to help balance your classes. The most fundamental of these is called SMOTE.
What is SMOTE?
SMOTE stands for synthetic minority oversampling technique. This technique selects samples that are similar in model space, and creates new synthetic samples in between the real ones. The idea is to balance the classes without the drawbacks of random under- or over-sampling.
In brief, this method works by randomly choosing a sample of the minority class, finding its nearest neighbours, and then placing a synthetic sample somewhere along a straight line joining the original sample and one of its neighbours. This process is repeated until enough synthetic samples have been created to balance the classes.
The problem with SMOTE on geochemical data
However, SMOTE is designed for continuous data, and as we’ve seen, geochemical data is not always strictly continuous. Let’s try applying SMOTE to our ‘rhyolite/non-rhyolite’ problem, and make some plots to understand what the synthetic data looks like.
At first glance, it looks reasonable enough. When looking at Ti vs Zr, we can see how the algorithm has done some ‘joining the dots’ to generate synthetic ‘rhyolite’ samples.
However, looking at data for Sc below, we notice something a bit strange. In the original data (left-hand panel), there’s a gap between 0 and 10 ppm Sc. In fact, the values at 0 are actually 0.000001; presumably the detection limit for Sc in this data is 10 ppm, and samples below the detection limit are represented by this very small positive number. However, the SMOTE algorithm has no concept of detection limits, and creates synthetic data throughout the range of 0-10 ppm (right hand panel).
The other issue that you might notice in the plot above is the horizontal “striping” in the Sn data, reflecting that Sn is only reported to ppm levels in this particular dataset. However, the SMOTE algorithm has created data in a continuous fashion, resulting in data between ppm values.
But wait, what about CLR?
Of course, in all this talk about the “nature of geochemical data”, we haven’t yet mentioned the fact that it is compositional data. In principle, the element concentrations add up to 100% for each sample. Astute readers might be wondering why I haven’t applied a CLR (centre log ratio) transform to the data.
Compositional data and CLR transforms are a topic for another time, but here’s a high-level summary. In many types of machine learning algorithms it’s important to scale or normalise your data. For example, in geochemistry, a difference of 1 ppm Si between two samples would be almost meaningless, but a difference of 1 ppm Au would be very interesting. We need to scale the data to account for the differences in magnitude of major and trace elements. However, the usual methods of scaling or normalisation don’t work well when used on compositional data, so instead we use a specific method called the centre log ratio, or CLR.
Is it necessary to apply a CLR before SMOTE? It’s probably a good idea, given that SMOTE is using the high-dimensional geochemical space to find nearest neighbours. But it doesn’t solve the problem we’ve been discussing.
In fact, the problem is arguably more stark. Here’s the result of applying the CLR and then SMOTE to the same data as before, this time plotting V vs P. Again, there’s synthetic data created where no real data would ever be, between the ‘below detection limit’ value and the detection limit.
The consequences for machine learning
Whether these unrealistic samples matter depends entirely on the context, but it’s possible that a model can end up learning the wrong thing. I’ve seen this happen in a project where I was using a tree-based model that relied strongly on many elements that were near or below the detection limit.
To illustrate how a tree-based model can be affected by bad SMOTE data, I’ve created classification models using XGBoost on the rhyolite/non-rhyolite data. To make it easier to interpret the results, I’m doing this on the raw data (without CLR applied), but as shown above, I’d expect the same issues to arise if CLR had been used.
For this experiment I made two models:
- A “basic” model, that was trained on the original data
- A model trained after applying SMOTE to the original data
I’m not going to present the model results, because it’s a contrived example, but I will show you some SHAP dependence plots. If you’re not familiar with these, below is an example to get your eye in.
We can use SHAP dependence plots to understand the contributions of different feature values to the model prediction. In the example above, we can see that high values of Pb are associated with the negative class (non-rhyolite) because they always have SHAP values below zero. In contrast, very low values Pb often (but not always) predict the positive class, rhyolite.
Keep in mind that we are working with tree-based models, so there’s no guarantee that the relationship will be as simple as “high X = rhyolite, low X = not rhyolite”. Sometimes, an intermediate value of an element might be quite predictive.
Now take a look at the plot below. In the left-hand panel we show the basic model (trained on the original data), zoomed in to show only the first 20 ppm of Sc values. You can see that Sc below the detection limit (represented by 0.000001) tends to be predictive of rhyolite, whereas higher values of Sc tend to predict non-rhyolite.
However, when we look at the results for a model trained on data with SMOTE applied (right-hand panel), we see a conspicuous group of synthetic data points between 0 and 10 ppm. These points all have positive SHAP values, so the model appears to have learned that any points within this range are likely to be rhyolite. This might be a reasonable thing to learn, given that lower values of Sc were typically associated with rhyolite, but could also indicate overfitting, given that the only points in the 0-10 range were synthetic rhyolite data.
A second example is shown below for Sn. Here, it’s quite conspicuous that data for Sn is reported in 1 ppm intervals. The basic model has positive SHAP values in the range of 2-4 ppm. This is also seen in the model with SMOTE (centre panel), but strikingly, the synthetic data that fall between the 1-ppm levels all have high SHAP values. The model has learned that if a sample has a Sn value with a decimal place, it’s likely to be rhyolite, which is of course nonsense!
What we’re seeing is signs of overfitting in the model with SMOTE. Overfitting means that the model has learned so much detail from the training data that it cannot generalise to new data. As a comparison, it’s interesting to check what the original model would predict for the synthetic data. That is, what if we used the model trained without SMOTE to predict ‘rhyolite’ or ‘not-rhyolite’ on data that has had SMOTE applied. Would it also think that a Sn value with a decimal place was indicative of rhyolite? We would not expect so, because the original model was trained without any of the synthetic data. Indeed, looking at the right-hand panel below, we can see that for Sn values of ~2.5 ppm, the SHAP values are roughly in the middle of those for Sn values of 2 and 3 ppm. This makes much more sense than the model with SMOTE (centre panel).
If we produce an overfit model like the one in the middle panel above, there are two main risks:
- That implementing SMOTE won’t actually improve your model performance (annoying but not terrible)
- That the model won’t generalise well in future, if the ‘discrete-ish’ properties of new data are different to that used in training
For the latter point, imagine that in three years time, the lab tells you they can measure Sn to a precision of 0.1 ppm, and Sc with a detection limit of 1 ppm. You put this new data into your model and suddenly have lots of rhyolite predictions on unexpected samples. This may not be because you found lots of new rhyolite, but simply because the model has learned the wrong thing from the synthetic data you trained it on. Admittedly, the results are not likely to be as extreme as that, but it’s worth bearing in mind how this could go wrong.
So what can we do?
Am I saying we should avoid SMOTE on geochemical data? Probably, but it depends on the context. In some cases, it may work fine, and the problem illustrated here will not matter because the model isn’t strongly affected by values near the detection limit.
However, it’s worth pointing out that there’s another good argument for not using SMOTE, which is that according to a recent study, it doesn’t usually help anyway. According to that paper (and see here, here and here for some further discussions), state-of-the-art classifiers like XGBoost are quite robust to imbalanced data and you can get good results by tweaking hyperparameters instead. In cases where balancing classes did help, they found that random oversampling was just as good as SMOTE.
If you’re still keen to generate synthetic data after all that, there are other methods that might be interesting to explore. One is a recently-published method that uses mathematical structures called “copulas” to create synthetic data that takes into account the entire distribution of the minority class (rather than SMOTE’s approach of using the local distribution). The paper demonstrates their method using a fun dinosaur.
When I tried this, it gave sensible-looking results when applied to CLR-transformed data, as shown below (right hand panel), although I have not done extensive tests to understand how the synthetic data would behave when put into a model.
Another idea is to try a variant of SMOTE that uses generative adversarial networks (GAN). GANs are usually used in computer vision, where they’re used to create synthetic images. GANs work by using two neural networks – one produces synthetic images, and the other tries to tell if the image is real or synthetic. The first network can eventually learn to produce an image that fools the second network. A recent study showed how the GAN concept could be combined with SMOTE and applied to tabular data instead of images. This approach sounds like a promising way to create fake geochemistry data that preserves the not-quite-continuous nature of the original data (although I haven’t tried it).
Summary
- Geochemical data has specific characteristics that mean it doesn’t exactly fit the definition of ‘continuous data’
- These characteristics can cause unexpected consequences for machine learning techniques, such as SMOTE
- At worst, applying SMOTE could lead to an overfit model that performs poorly on future data (although the severity of the issue will depend on the context)
- SMOTE is probably best avoided on geochemical datasets, partly because of the problem described here, and partly because it may not help anyway
- But there are some novel methods for generating synthetic data that might be worth exploring