Skip to main content

Created by Tom Schaap 

I didn’t get to make it to AEGC in Brisbane this year, although Datarock did have a significant presence with a workshop and two oral presentations. One of the great things about the conference is the extended abstract format for every talk. The extended abstracts from the 2023 conference are published on their website, and previous conferences are published by ASEG

Hoping to read up on some of the great science that I missed out on seeing presented live, I downloaded this year’s abstracts, but I was immediately presented with a problem. There were 210 pdf files with codified names which provided no insight into what the contents were. Of course, the talks in the conference program were numbered and it is simple enough to look at the program and find the relevant abstract from there. However, it got me thinking about ways I might be able to represent the content of each abstract in terms of their content using a data-scientific approach. After all, here I had a set of data with no labels that I could not reasonably characterise on my own; this is a classic scenario for some form of unsupervised classification. The only issue was that I am more adept at dealing with numeric geophysical and geochemical information. I am not by any means a natural language processing expert!

What I was presented with after downloading all the abstract files

Happily, in this remarkable age of highly accessible open source software, I happened upon a pretty incredible resource. BERTopic is an open source python library which is able to digest passages of text and provide a range of options in terms of classifying it. In the simplest case, with only a few lines of code, it can generate a human-readable string label which describes the topic a passage of text relates to. If you require more granular detail, it can also be used to generate raw feature embeddings from text, as well as dimensionality-reduced features and clustering information. 

In my case, I read the complete text from all 210 papers into a list of strings, and used a BERTopic model to generate a series of topic labels and classify the papers based on those labels. I did this a couple of times with alterations to the raw text. It is standard to remove ‘stop words’ (‘the’, ‘a’ etc.) in these analyses, but I also found that other terms in this context might be considered generic, such as ‘Australia’ (it’s an Australian conference, so lots of studies are based there) and ‘et al’ (all of the abstracts contained citations). These words would sometimes pollute the output topic labels, and I found the most useful results would come from excluding them. There were also a few hyperparameters which I played around with, such as specifying a low minimum count for a cluster and a low nearest-neighbour count for the UMAP stage (this was not exactly a big data problem so I was happy to have quite granular clusters).

# libraries used for the analysis (excluding some plotting libraries)

import PyPDF2
import glob
from tqdm import tqdm
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
from umap import UMAP
# Text data acquisition from a series of pdfs.

path = 'data/*.pdf'
files = glob.glob(path)

# create a list containing a string of all the content in each pdf
texts = []
for file in tqdm(files):

    pdf_file = open(file, 'rb')
    pdf_reader = PyPDF2.PdfReader (pdf_file)

    # append text from each page to a string
    text = ''
    for page in  pdf_reader.pages:
        text += page.extract_text()

    pdf_file.close()

    # remove some generic 'deadwords'
    for deadword in ['et al', 'australia', 'figure', 'data']:
        text = text.lower().replace(deadword, '')
    texts.append(text)
# BERTopic model setup and fitting

# we add this to remove stopwords
vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english")
# the UMAP is an integral part of the BERTopic model architecture and can be modified
umap_model = UMAP(n_neighbors=15, n_components=2, min_dist=0, metric='cosine')

model = BERTopic(
    vectorizer_model=vectorizer_model,
    language='english', calculate_probabilities=True,
    verbose=True, min_topic_size = 3,
    umap_model = umap_model
)
topics, probs = model.fit_transform(texts)

The model I arrived at grouped the papers into 16 topics, including one ‘noise’ topic, named ‘seismic_basin_new_id’ which the model could not pin into any of the other topics. This appears to be a fairly generic label for a conference that heavily featured basin geophysics, so I assume these papers featured these methods but did not specifically mention the content of any of the other topics. That does not mean those papers were ‘generic’, but more likely that their specific content could not be grouped in with the other topics. Perhaps if I removed a few more generic terms then this would improve.

List of topics determined by the final model

BERTopic models come with a bunch of inbuilt functions to facilitate rapid visualisation of results. For instance, simply calling `model.visualize_hierarchy()` generates a dendrogram representing the relationships between each topic label as determined by the backend HDBSCAN clusterer. This intuitively tells me that the main split between the topics was on the presence or absence of seismic methodologies. This probably pulls apart most of the fossil fuel-related talks from the minerals-related talks. The next split on the non-seismic side appears to be along the lines of drilling and machine learning as opposed to more traditional geoscience and geophysics.

At this point I had already achieved more than I anticipated. I reached my goal of classifying the papers, and I could now easily pick a topic and pull out any papers on that topic. For instance, when I looked into the ‘core_drilling_ucs_rock’ category, I happened to dig up the two Datarock-presented abstracts (Sam Johnson et al., paper 213 Extracting consistent geotechnical data from drill core imagery using computer vision, and Nathanael Pittaway et al., paper 226 Using computer vision and drill core photography to automate geological logging workflows and improve orebody knowledge). Makes sense for a company that has an ML platform that deals primarily with drillcore!

Clustered outcomes such as this do not always do justice to the internal structures of the input data. With relative ease, I was able to produce raw feature embeddings using the related SentenceTransformer library, and reduce those to a 2-component UMAP for visualisation. As expected, the topics decided by BERTopic clustered in this visualisation quite convincingly, but this also provides a means of picking out the ‘most similar’ papers on my own volition. According to this analysis the two Datarock papers are closely matched, and lie somewhat close to papers in the ‘learning_model_training_validation’ topic, which also makes sense. A close neighbour which was grouped in the ‘noise’ class was a paper on integrated downhole geophysics analysis (paper 190, Thrane et al.)  I could also use methods such as cosine similarity on the feature embeddings to find the abstract which most closely matches the one I am reading.

I achieved everything here in less than 100 lines of code, with a total runtime of less than 2 minutes (most of that time was absorbed by parsing the raw text data from the pdfs using PyPDF2).  I am not suggesting that the labels generated here are perfect, and they probably don’t justify the broad range of topics presented at AEGC, but this has given me a rapid and relatively simple tool to more surgically choose papers I might be interested in reading. This kind of technique could have applications in all sorts of geological problems, such as characterising field and core logging notes, or summarising exploration reports as demonstrated here.