Streamlining a fruit salad data process with Shiny and AWS

Created by Eleanor Mare.

What do you do if you’ve developed a great data analysis method, but your usual tools aren’t cutting it? We often see clients come up against this problem. Sometimes the best solution involves building a bespoke application, often connected to cloud-hosted data, and one of our favourite tools for this is a web development framework for R called Shiny.

Here in the Datarock Applied Science team, we have built dozens of Shiny apps for our clients over the years. This month, we completed one of our most sophisticated and impactful apps in a project that ran for over two years.

I recently had the opportunity to present some learnings from this project at a virtual conference about all things Shiny, called ShinyConf. In this blog post, I’ll summarise my talk and expand on the technical learnings from this project. Although we can’t go into the specifics of the subject matter, our client has kindly given us permission to describe the context using an analogy.

The problem

Our client for this project was a team of fruit-salad experts* in a large multinational company. This team had developed a method for ranking fruit salads in terms of their potential for revenue. They would receive photos of fruit salads from other parts of the business, and they had developed a systematic ‘Fruit Salad Evaluation Method’* that allowed them to provide their stakeholders with consistent assessments.

The problem was that their data process was tedious, manual and vulnerable to human error. Their process involved storing the fruit salad photos on a computer, extracting fruit statistics from the photos using several pieces of software, and copying/pasting data out into a huge excel spreadsheet.

The original data process our client was using to implement the Fruit Salad Evaluation Method

This process worked, but it couldn’t scale. As demand for the team’s fruit-salad evaluations kept increasing, they needed to find a way to streamline their data process.

This is where my team came in. The Applied Science team comprises data scientists and machine learning engineers; many of us have a background in geoscience, and fortunately we also have a good working knowledge of fruit salads*.

The requirements

Our client explained that they needed:

1. A way to have the fruit statistics extracted automatically from the photos

2. A web application that could display the fruit statistics and relevant photos to streamline their Fruit Salad Evaluation Method workflow

Our client put together a very impressive, well-thought out and detailed set of requirements for what they needed the app to do, and produced beautiful mockups of how the app might be laid out.

The real app mockup was orders of magnitude more impressive than this one

Proving the concept

It was clear from the beginning that we needed to help the client develop a robust and scalable cloud-hosted data process, so we got to work on a proof of concept. With the assistance of data engineers in our client’s company, we set up a data pipeline in their AWS tenancy. This involved storing the fruit salad photos in AWS S3 (cloud storage), using AWS Lambda functions to extract the fruit statistics from the photos, and storing the derived statistics as tabular data in S3 as .csv files.

The new data process

We chose to create the app using R Shiny because it is a robust and mature web development framework with a large ecosystem of extensions that provide a huge amount of flexibility. Also, within the Applied Science team we have a number of people with expertise in Shiny.

However, we also chose to use Python alongside R in this project. We’d created some Python code in the early stages of the project to extract fruit statistics from photos, and so we wanted to avoid re-writing these functions in R and duplicating the effort. Although using Python together with R was not without challenges (caching and debugging were more difficult), it turned out that Python was useful to have in the project for a number of reasons as we went along, and we’ll touch on those a little later.

Our client already had a PositConnect server, so their data engineers set things up for deployment. A challenge throughout was that we were unable to access the deployed application for security reasons, which made it difficult to test the performance of the app in its live environment and for us to fully understand the software and hardware constraints of the system.

Indeed, after we deployed our first proof-of-concept app for use by the client, it wasn’t long before they told us that the app wasn’t performing well. When they tried to view fruit images, the images took an extremely long time to load. Worse, when they tried to retrieve statistics for a large batch of fruit salads, the app would crash.

Making it fast and stable

Our next task was to identify and address inefficiencies in our new data process. The primary issue was that we were trying to retrieve large volumes of data from AWS to the server where the app was running. Simply transferring that volume of data took a long time, and then when it did arrive, it was sometimes too much to hold in memory which caused the app to crash.

There were two main things we did to address these issues:

1. We did more pre-processing of the images in the pipeline and stored the images in a compressed format (numpy compressed arrays, which was another reason it was handy having Python available in the app)

2. We made better use of SQL to reduce the volume of data being retrieved from Athena

The second point was challenging. We’d written the functions to handle tabular data using R’s lovely data wrangling paradigm (tidyverse). Initially, trying to re-write those functions in SQL seemed horrendously complicated. The queries were extremely verbose and took a long time to run.

So we decided a better approach was to make sure the Athena database and queries were optimised (here’s a good resource, and implemented four tweaks.

1. Normalise the database, by pivoting one table from “wide” to “long”. This had the added benefit of meaning that the schema didn’t need to change whenever a new fruit was added to the database.

2. Partition the Athena database, which reduces the amount of data that needs to be scanned by each query. This involved adjusting the nested structure of how files were stored in S3.

3. Store the tabular data in .parquet format instead of .csv. This reduced the size of the database making the queries slightly faster.

4. Get Athena to run multiple queries in parallel. We split up the complicated queries into more manageable chunks, and guided by this excellent article, we were able to use Python’s boto3 library to send multiple queries to Athena at once (note: the ‘paws’ package could be a good R-native alternative to boto3).

These changes resolved the speed and stability issues and we could move on to focusing on improving the actual functionality of the application.

Custom images

The final challenge to overcome was displaying the images of fruit to meet the needs of our client. Their requirements were:

A grid of images
Scale bar
Zoom in and out
Tooltips that show the colour of the fruit when you hover over the image

Plotly seemed like a natural choice, because it has all of that functionality built-in. Subplots can be used to create a grid, zooming is handled by default, the axis tick marks could act as a scale bar, and tooltips can be created and customised easily.

It all sounded so simple. But when we actually tried it, we ran into unexpected problems. The main challenge was that the images were rendering slowly. It turns out that each pixel of the image is treated a bit like a point in a scatter plot – so if the images are large, you’re trying to render a lot of points! We were able to somewhat get around this by encoding the image as a base64 string before giving it to Plotly, but this didn’t help with the tooltips, which were passed in as another layer of information the same size as the image. It was a lot of data to send from the server to the user’s browser, which could take time.

So to address these issues, we ended up implementing a custom solution for displaying the images in this application. To do this, we created a custom HTML template with a CSS grid, and rendered the images on HTML Canvas tags. The Canvas tags exposed the RGB values that could be accessed by JavaScript, so we could use JavaScript to create the tooltips. We also used JavaScript to create the scale bars and implement a simple zooming functionality by redrawing the images at a different size.

The custom fruit image grid and tooltip showing RGB values

Key technical learnings

From the technical perspective, the key things I personally learned were:

1. That combining Python and R in an R-Shiny app is powerful

2. That it’s critical to optimise how you store and retrieve cloud data in order to ensure your application’s performance

3. That Shiny is extremely flexible especially if you know a little bit of HTML, CSS and JavaScript and how these interact with each other and with Shiny

Impact

This was a long-running, challenging, but ultimately very fruitful project. Our client now has a new cloud-hosted data process that is fast, automated, robust and scalable, and a web app that helps them get from raw photos to fruit-salad evaluations in hours rather than days to weeks. This means the team can provide results to their stakeholders quickly and efficiently, helping their stakeholders make timely decisions about which fruit salads have the most potential.

Further, implementing this data process sets the foundation for implementing more sophisticated computer vision modelling on their data. This may help the team make more consistent decisions with less bias. It might even lead to some data-driven insights that could change how they think about fruit salads.

Even though we couldn’t present the full results and implementation of this client project, we hope this is a small example of how the end-to-end solutions we develop at Datarock can add significant value to our client’s operations. We deliver a ‘full-stack’ solution, from developing data driven machine learning solutions, to creating scalable cloud native data and machine learning pipelines and web applications to deliver ongoing benefits to our clients. And while it is possible to develop these solutions to operate within a client’s tenancy, it is often easier and provides more long term benefits to clients to offer this as a hosted service through our optimised cloud environment.

Stay tuned for our next blog. Although we could not show you the live version of our Fruit Salad Evaluator Tool for confidentiality reasons, next time we will showcase another long running web application that is available for everyone!

* References to fruit salads should be taken with a pinch of salt. Or a scoop of ice cream.

Streamlining a fruit salad data process with Shiny and AWS

The problem

The requirements

Proving the concept

Making it fast and stable

Custom images

Key technical learnings

Impact

Reviewing common core photography setups

Unearth possibility

Info@datarock.com.au

F.A.Q.

Privacy Policy

Streamlining a fruit salad data process with Shiny and AWS

The problem

The requirements

Proving the concept

Making it fast and stable

Custom images

Key technical learnings

Impact

Reviewing common core photography setups

Unearth possibility

Info@datarock.com.au

linkedin

F.A.Q.

Privacy Policy