Practical Machine Learning

Using Machine Learning to predict Beauty & Health services

How to use your SQL Database with XGBoost to solve Multiclass classification problems with AWS Sagemaker

Niclas Gustafsson

Published in

The Startup

11 min readJun 18, 2020

Source: Illustration 117031967 © Ylivdesign — Dreamstime.com

TL;DR;

In this article I share some learnings from a recent ML multiclass classification setup I recently deployed for a client. I hope that this real-life example born from cross-breeding the two schools of: good enough™ and time-to-market™️ will bring an extra dimension compared to all theoretical posts out there analysing data sets of irises and digits with no time or budget contraints😁.

Source: Illustration 163377292 © Chipus — Dreamstime.com

I also try to keep this post at an intermediate level, not too advanced and not too basic as there’s plenty of both out there. What does this mean? Well, I won’t go in to detail about the inner workings of XGBoost, neither will I talk about ML as an obscure black box. I will talk a bit about hyper parameter tuning and feature processing. So for the advanced data scientist out there eating ML models for breakfast this might be an interesting read, more for the actual domain of problem solved (and I’d appreciate your feedback 🙏. And if you are starting out, this might give you some pointers on what to (and not to) do as well.

As I wrote this piece I realised it became quite long, so I split it up in three logical pieces:

This first, which will focus a bit on background and on the thoughts before kicking off an ML project. I’ll touch upon the topic of gathering and getting data ready, expectations from a machine learning model and finally I’ll share some thoughts that might be helpful when starting.
Secondly, I’ll dive in to the infrastructure, the actual AWS components that I used for this set up. I’ll explain a little behind why I chose what I did and took the steps I did. This, so you don’t have to find out the hard way after way too much coffee that the brilliant Idea that you started the day with just doesn’t work.
Lastly I’ll go through how I used the actual output from the ML model in the a real web application and give you some tips regarding what to think about and what not to do with the output.

Background

The client for which I set up this solution, operates a SaaS platform that enables their clients to make their calendars available for online bookings. A key parts of the SaaS service is to collect data relevant to the services offered by their clients. Data such as price, service duration, rules regarding booking margins, shared equipment utilised during treatments, descriptions of services provided and so on.

Some of these key meta data items regarding a bookable service was, due to various legacy reasons missing for many services. This was by no means a critical problem due to the design of the system as whole, utilising good search indexes etc. This specific data however would no-doubt be valuable to have greater coverage of, to be able to do analytics on and provide greater visibility for these services, which would probably yield more bookings.

Before we start

I like to think of Machine Learning classification problems in the same way that I would if we would use the grey matter between our ears to categorise things. For example, getting back home from the shop, unpacking your groceries, you probably want to put your things in the right place, right? We need to recognise the item (interpret its features) and figure out where it’s going (predict its class), freezer, refrigerator or cupboards. ML Classification problems are quite similiar, except on steroids.

As when using your brain, some things are still valid for an ML model, you need to have sufficient data to be able to guess/predict the outcome. If all my items from the grocery store would be identical boxes i would have a hard time getting them to the right place. If maybe I could see that there’s condensation happening on half of them (=”feature”) I’d probably predict that they should go in to the freezer or refrigerator.

So the first thing when starting to think about using a Machine Learning model to solve a classification problem is to map out what information could possible benefit you to guess the outcome.

Or maybe: What information could be useful to progress the decision/ prediction.

Figuring our what these variables (features) are and how to extract and clean them from whatever data store they reside in is half the job of training a ML model.

So back to my particular model. The data that was missing and what I was trying to predict was the actual type of service that was referenced by the customer. We had about 300 different classes to chose from.

I chose some key metrics/features to extract and use as feature data:

duration of the service (in minutes)
cost of service (divided in to buckets, (i.e. 0–9 USD, 10–19 USD
free text description of the service
description of the business venue
classification of the business venue.

People not familiar with how a machine learning model works even on a high level, might consider such technology more of an arcane skillset acquired by developers and data scientist only god knows how.

…I should know, I got the below message from a colleague in from the customer support team after we deployed the first version.

… Maybe it would be nice if it would stay this way? Enjoying all the privileges of wizards and witches… Wait, didn’t they incidentally get burnt alive in the 16th century? Hm, no, maybe we should explain that this is not magic and simply a computer recognising patterns that we put in front of it.

Let’s recap what a (multiclass) classification problem is. Going back to my title image, remembering Silicon Valley episode “Not Hotdog”? In case you have not seen it, a short key bit is included below. And of course you’ll want to watch it again if you’ve seen it. I’ll wait here.

Source: HBO / Youtube

That’s a (binary) classification problem being solved. The input is the photo and the output (prediction by the ML model) is one of two options: Hotdog or Not Hotdog.

The difficult crowd above clearly had higher expectations from poor Jian Yang. They were probably expecting an Multiclass Classification problem to be solved instead of a binary one? The difference is that for a multiclass classification problem we predict from a range of outcomes(classes): Hotdog, Pizza etc.

Usually the output is not as cut and dry as above, predictions as usually expressed as probabilities ranging from 0–1 (where 1 is equal to 100%) for each class. So for this, uh, tasty, image below, we would maybe see something like:

So, back to my project, predicting Service Types for the different services created by the SaaS users.

I had about 300 different classes that I was going to build a ML model to predict from. I won’t lie: initially I did think that this would be a bit of a challenge. But as it turns out, the model performed extremely well.

Evaluating by a metric that made sense for my business problem: “Does the top-3 predictions include the correct value?”. The model performed extremely well and included the correct prediction for 92% of the test cases.

Thinking about this for a second, it may not be that surprising. One of the features that I used was the description of the service to be predicted. Which is probably the best-case of feature selection. We want to choose features that best direct us to the class we want to predict. And by definition a description of the service might just be that.

So for your specific project, make sure that you map out what data that could influence a ML algorithm and how to extract it. I’ll look more into ways of evaluation a ML model (precision, recall, F1-scores) in the next part.

Feature Engineering

Ok, Now we have our data extracted from the depths of our data vaults, nice! Next step will be to refine the data, a process often referred to as Feature Engineering. And the purpose of that is to make this data in a format that is consumable by the ML model…in an efficient way.

So we want to go from something like: “20USD 40min Beard Trim Trim and shape beard, razor finish if desired.” to something like this:

💡 When doing a POC you probably want to get an early heads-up if your idea will hold up before you commit to much time and resources. Maybe it’s worth starting with manually exporting and doing some hands on manipulation of the data. Just r̶e̶m̶e̶m̶b̶e̶r document the steps that you take. There’s nothing quite like the feeling of reaching a great result and realizing you have no way of recreating it.

I choose to pour all my data in to a single container, no specific order at all, I mixed the different data sources and just made sure that numeric values of different meaning did not mix up by suffixing them. I only care about two, the price and the duration. Which I suffixed with the currency and “min” for minutes. So 99SEK and 99min, preventing the ML model to mix them up with numbers occurring from other parts of the data set, keeping them as two separate features. There are of course more bulletproof ways to do this but this level of isolation worked fine for this small project.

Extract features

In the simplified example below, which could represent the training data extracted from my PostgreSQL RDS, we have two items to train on.

I created two SQL Views to extract the information from the RDS Database, the first extract the data that we know the classes for (the training data) and another almost identical SQL View for the unknown data (the prediction data). Difference is the green field below, which represent the class, is included in the first and not the second.

Next we need to process this data a bit. So from the obviously-fake-but-representative-example above:

“20USD 40min Beard Trim Trim and shape beard, razor finish if desired”
“60USD 30min Skin fades and tapers. Combover/Gentlemen cut. Undercut. Texture for smooth transition. Shampoo or rinse, blow dry and styled if desired. This is the most booked service for your best look.”

We will process this in three steps:

Cleaning & Remove Stop Words

Fist we clean the data by removing stop words, this is a language specific operation so we need some list of stop words to check agains. I used the popular NLTK library. In this step I would also remove and clean out special characters that might otherwise introduce additional unwanted features.

Actual code I used is not pretty (but Good Enough™️) and there’s more correct ways of cleaning out special characters out there , I just removed the ones that caught my attention when doing some inspection:

whole_set[‘strings’].replace(r”([\.\-:;\t\n\r,/()_•+*%&\’\!\#]|\d+sek)”,’ ‘,regex=True, inplace=True)

The result after this operation would look something like this:

“20USD 40min Beard Trim Trim shape beard razor finish desired”
“60USD 30min Skin fades tapers Combover Gentlemen cut Undercut Texture smooth transition Shampoo rinse blow dry styled desired This booked service best”

Stemming

Next we reduce the words to their stems as we are more interested in the meaning of the word and not as much of it’s different forms, this will reduce the number of features and will make sure the the model interpret “hair cutting” and “hair cut” the same way.

20usd 40min beard trim trim shape beard razor finish desir
60usd 30min skin fade taper combov gentlemen cut undercut textur smooth transit shampoo rins blow dri style desir this book servic best

Context with n-grams

Since we are going to break down the text to indexed features, the ML model might be tricked to believe the two different inputs “Don’t be cheap, get an expensive haircut.” vs. “We have cheap haircuts” are quite similar.

One thing we can to do it make the ML model a bit aware of the surrounding context. So, for instance, instead of keeping track of occurrences of individual words in our data, we pair the words with it’s neighbour and keep track of that combination (which we call a feature) going forward:

Not using n-grams:2 above would create two identical features (coloured above), vs no overlap in features when using n-gram size of 2.

💡Using n-gram and what size to select is one of the things that you can experiment with during model training. Just keep in mind that using n-grams of higher magnitude increases the number of features quite fast.

Likewise, for our example above: “20usd 40min beard trim trim shape beard razor finish desir”, the n-grams would be:

So far, so good, We are still able to see what’s going on with the data quite easy. Now we will transform the data in to something more machine readable.

First we’ll create and index of all the features, putting a number on each feature as shown in the images above. Then we’ll count the frequency of the feature for each sample as well as take the total occurrence of this feature in the whole data set in to consideration.

We’ll use TF-IDF, which stands for term-frequency times inverse document-frequency. This technique is useful to handle when we have a data with features that are occurring more frequent and we want to prevent those features taking too much attention from the ML algorithm.

After applying the TF-IDF logic to the samples we now have an representation of the below form for our “20usd 40min beard trim trim shape beard razor finish desir” sample:

(For my training data I ended up with roughly 150 000 (n-gram:2)features)

See this link for more information about the inner workings of TF-IDF.

Next we need to export this information in to a format understandable by XGBoost that we will use to train the model. We use the libsvm format, which is “<label> <index1>:<value1> <index2>:<value2>…” which looks something like:

Class 151 to be trained with the features 3981, 4126, 4264 etc.

💡Worth noting after training the model for later use, you’ll need to produce the features the same way as when training. So saving the features and their corresponding index is crucial for reusing the ML model with new data. I used scikit-learn’s joblib to dump the feature data to disk and store it together with the other artifacts on S3 (More on this in the next post).

That was a bit high level about getting the data ready: Figuring out and extracting features from your store (in my case a SQL RDS database), cleaning and processing the features. Next I’ll dive in to the details with code for the different AWS building blocks I used getting the end-to-end pipeline running: AWS Glue, AWS SageMaker, AWS S3, AWS Aurora and more.

Credits:
Background vector created by rawpixel.com — www.freepik.com