Skip to main content

Created by Pouya Emami and Eleanor Mare

If you’ve ever been involved with data science projects, you’ll know that they rarely go exactly as planned. As the team at Datarock has grown and taken on a wide diversity of projects, we’ve had to rapidly level up our project management game. A recent paper, ‘Data Science Methodologies: Current Challenges and Future Approaches’ by Martinez et al, 2021, highlighted the very issues we grapple with daily. It confirmed our feeling that navigating the ‘messiness’ and uncertainty of a data science project is all part of the process, and can even be where the real value gets uncovered.

Challenges 

Here’s a common challenge: your project begins with an objective like “build a model that can classify these three rock types based on their chemistry”. As you go through the process, you might discover that for one of the rock types, you don’t have enough examples to build a model. For another rock type, you find that it has two chemically distinct populations that could not be visually distinguished during logging. The third rock type is a thin vein with its chemical signature diluted by the 1-m interval of geochemical sampling, so it’s hard to distinguish with the data available. As you work through these issues, your understanding of the problem evolves. Half way through the project, you find that you need to solve a different problem to the one you started with, in order to actually address the business needs.   

There’s nothing inherently wrong with this. It is part of the process of developing your understanding of the geology, but it poses challenges from a project management perspective. The task you said you would complete next week is no longer relevant, nor is the time estimate you assigned to it. Your finance team wants to know when the project is going to be finished and how much it’s going to cost.

This is one of the biggest challenges we face in our projects, and it happens constantly, so it was refreshing to see it expressed directly by Martinez et al in their paper. This falls under the category of ‘Project Management’ challenges as laid out by Martinez et al:

Reproduced from Table 1 of Martinez et al.

We felt this breakdown and description of challenges was quite useful, even if only to realise that these challenges are common and, although project management methods can mitigate these issues, they’re unlikely to disappear entirely. By expecting and accepting these challenges as part of the process, it can be less stressful when things do feel like they’re going awry.

Methodologies

The paper continues by evaluating 19 project management methodologies against the three groups of challenges (project, team, and data). These methodologies all had common themes. Most emphasised the critical importance of understanding business needs at the start of the project, with some kind of iterative process to account for the evolving understanding of the problem.

Reproduced from Figure 21 of Martinez et al.

The authors conclude that none of the 19 methods perfectly address all the criteria they set out, but they imply that their criteria could be used to develop the ultimate project management methodology in future. However, we firmly believe that one size does not fit all.

While some methodologies (as highlighted in the paper) promote rigid frameworks, our experience shows that a more tailored approach is essential, particularly for the dynamic nature of data science projects. To address this, we use a hybrid project management methodology that combines the structured predictability of Waterfall with the flexibility and adaptability of Agile. This ensures our workflows align with the unique scope, objectives, and complexities of each project, driving successful outcomes.

Our hybrid approach starts with a strong focus on forecasting, carefully planning budget, time, and resource allocation during the initial phases using Waterfall principles. This ensures all stakeholders are aligned on timelines, milestones, and deliverables. As the project moves into more dynamic phases, such as data exploration or model development, we shift to Agile practices. Agile enables us to manage uncertainty, adapt to evolving requirements, and maintain steady progress through iterative sprints. We work through issues in close collaboration with the client, which allows us to continue delivering value, even if the solution evolves beyond what was initially expected. By maintaining open communication and iterating based on real data insights, we ensure that the project remains aligned with business needs and delivers meaningful outcomes.

Continuous improvement

Importantly, we capture lessons learned at every stage of our projects, refining our processes to drive continuous improvement throughout the project lifecycle.

Although our projects usually involve only a few people, we ensure that the whole team learns from each other on all projects. To facilitate this, near the beginning of each new project, the whole team will discuss the project and potential challenges. This is really valuable for the project team – allowing them to gain ideas from the wider group. Along with pure machine learning engineers, we have geo-data scientists with backgrounds ranging from blast engineering to geochronology, not to mention experience in industry-focussed geology, geophysics, and geochemistry. With this wide range of experiences, everyone has a slightly different perspective on a problem. These sessions are also extremely valuable for the wider group – allowing us all to have a high-level sense of the types of problems we have solved, which can facilitate knowledge sharing in future projects.

Then, at the end of each project, we conduct a project retrospective. The project team, the project manager and often one or two members of the wider team come together and discuss what went well, what challenges we faced, and what we could do better next time. We’ve now been doing this systematically for over a year, and have accumulated a rich record of ‘lessons learned’ from dozens of projects. Some examples of our ‘lessons learned’ include:

  • Early in the project, set expectations with stakeholders about what is possible given the available data. For example, if the desired outcome is a regression model, but the data turns out to be so sparse that a classification model will be a better approach, communicate that early.
  • For vectoring projects, figuring out how to define ‘distance to orebody’ is a crucial step but is often something that takes a bit of time and discussion with SMEs. Avoid letting this drag on too long because it can really delay the project.
  • When scoping or initiating a project, don’t assume that examples of all classes actually exist in the dataset, or that class labels are correct. Ask SMEs how certain they are of the data in relation to its ability to solve a given task. Is the data reliable and comprehensive? If not, or if that’s unknown, what can be a backup plan?

At a high level, a common lesson that comes out of data science projects is something that was also described by Martinez et al:

the main outcome of the project may not be the machine learning model or the predicted quantity of interest, but an intangible such as the project process itself or the generated knowledge along its development

We have seen this time and time again. When deep in the weeds of a confusing and difficult project, it’s helpful to remember that the messiness is all part of the process. Sometimes it’s by working through that messiness that we actually find the real business value, which may look different to what was initially envisioned.

Our hybrid approach supports the messy and iterative nature of these projects, allowing us to adapt as the project evolves and uncover value in unexpected ways. With a culture of continuous improvement, we learn from every project – both technically and operationally.

Our project manager, Pouya, recalls his former lecturer explaining:

“Managing projects is like going to a tailor for a suit, it’s not about handing you something off the shelf. A good tailor takes your measurements, understands your needs, and crafts a suit that fits perfectly. And if you come back later, they don’t rely on the old measurements, they reassess and create something new. In the same way, each project, even if it seems similar, needs to be tailored to the specific situation and evolving requirements.”

This philosophy continues to guide how we approach our projects today.  We customize both our project management and technical approaches to fit the specific needs of each geo-data science project. We’re always eager to learn from others and hear about the biggest challenges you’ve faced in data science, as well as the strategies that helped you overcome them. Let’s connect on LinkedIn and share ideas – collaboration is at the heart of solving complex problems!