Machine Learning Basics with Ludwig
Written by Rudolf Olah
We’re going to walk through the basics of machine learning and applying it to the problem of email spam prediction. We are going to do this by loading training data sets, building a few machine learning models and then checking their performance on test data sets. The data sets we are using are small enough to be understandable and the models only have a few features. This makes it easier to puzzle out how the machine learning algorithm and neural networks are interpreting the input and producing the output.
Machine learning and deep learning techniques have become popular topics over the past five years. Machine learning is being applied to many problems and we are seeing some interesting and effective results. Machine learning is a way to process and analyze big data and make predictions about future data. For instance, you can analyze millions of emails to identify which ones are spam and which ones are not, and you can analyze this based on hundreds or thousands of features or signals.
To get you started with machine learning, you could start with TensorFlow, a machine learning library open-sourced by Google in 2015. Alternatively, you could start with scikit-learn, or perhaps another library.
However if you want to get started quickly, you should consider trying out Ludwig.
Ludwig is a free/open source machine learning library created by the software engineers are Uber. It combines TensorFlow, SciKit-Learn and other Python packages into an easy-to-use package where you can load data sets from CSV files, train your machine learning model, and then apply it to new data. It can visualize the performance of the machine learning models you build and show you comparisons. It can save a snapshot of your model so that you do not have to re-train it from scratch and just need to feed it additional data. It is by far the simplest way to get started in applying machine learning.
Assumed Knowledge
For this article, we assume that you know how to write code in Python and how to run Python programs. We are using Python 3.8 in this article.
I am also assuming that you have a small amount of knowledge about machine learning, such as what a neural network is.
Installing Python and Ludwig
Let’s begin by installing virtual env and creating the environment:
sudo pip install virtualenv virtualenv –python=python3.8 env
Then let’s install the dependencies:
source env/bin/activate pip install -r requirements.txt
Install English-specific encodings for spacy language processing:
python -m spacy download en
Installing Ludwig
The dependencies list looks like this:
ludwig[text,viz]==0.3.1
This will install Ludwig, TensorFlow, SciKit-Learn, Pandas, Numpy. It will also install SpaCy for text and natural language processing. Additionally, it will install visualization tools like matplotlib and seaborn.
Starting Code
In the Python files that we create, you should start with this chunk of code:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
import tensorflow as tf
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)
import os.path
import pandas as pd
from ludwig.api import LudwigModel
The first part will disable the output of some logging from TensorFlow. This can be a bit noisy. The environment variable TF_CPP_MIN_LOG_LEVEL
controls the level of logging. Its value can range from 0 to 4, where the highest number 4 will show more logs (info, warning, error, fatals) and the lowest number 0 will show no logging information.
The other half of that code will import the necessary modules to be able to use Ludwig.
And with that we’re ready to write our first machine learning program.
Basic Prediction with Ludwig
We are going to build a machine learning model and train it to identify which emails are spam and which are not. Let’s take a look at the data set we are going to use to train the model.
The Dataset: Email Spam
Our data set is a CSV (comma-separated values) file with three columns:
- subject
- content
- spam
All three columns are strings of text. However, as we will see in the next section, they are processed differently when it comes to defining our machine learning model.
Here is a sample of the data (you can check out the code repo on Github to see all of the data):
"cruise winner!","Hello! You just won a cruise!","spam"
"BANK WARNING","This is your bank, warning you that someone has tried to access your account","spam"
"RE: AI meeting","The agenda for our company meeting is AI and machine learning. Please be on time. Thanks!","not_spam"
As you can see from this sample, the last column in some rows of data is “spam” and in other rows it is “not_spam”.
Important Notes about the Data Set
One important thing to note here is that I, the author, of this article, have classified the data rows as either spam or not spam.
It is up to you to double-check the data set you are using for training matches is correct.
It is up to you to analyze the data set and see if there are biases and make sure to correct for them.
Another important thing: this sample of data is not enough to properly train the machine learning model, there are not enough data points. The small data set in the code repo has some more data points and it is just barely enough to train the model. It’s a good starting point and we are using a small data set so we can understand roughly what is happening within our model. Remember that companies such as Google, Facebook and Uber train their machine learning models on data sets where there are millions of data points. To train this model further you can use “online” training which is provided by the Ludwig method: https://ludwig-ai.github.io/ludwig-docs/api/LudwigModel/#train_online
Ludwig Configuration: Input and Output
So we have a data set with three columns. Our machine learning model needs input features to “learn” from, to be trained on, and it needs output features that we would like to predict or calculate. The “features” of a data set describe the shape of the data.
Input Features
The input features for the email data set are going to be the “subject” and the “content” columns.
With Ludwig, you can start with this very simple definition of model input features:
input_features = [
{
'name': 'subject',
'type': 'text',
},
{
'name': 'content',
'type': 'text',
},
]
Ludwig provides default values for the encoding of the columns that are used as input features and it provides default configurations and parameters for the underlying machine learning neural network to use.
You can override the default configuration of the input features and specify your own configuration:
input_features = [
{
'name': 'subject',
'type': 'text',
'preprocessing': {
'lowercase': True,
},
},
{
'name': 'content',
'type': 'text',
'preprocessing': {
'lowercase': True,
},
'encoder': 'stacked_parallel_cnn',
'reduce_output': 'sum',
'activation': 'relu',
},
]
Preprocessing Data
The preprocessing step we are using for the subject and content fields is “lowercase”. This means an email with the subject “Hello WORLD” will look the same as “hello world” and this could affect your training results and the performance of the machine learning model.
Other preprocessing options for text features include controlling how characters are mapped, how words and characters are tokenized, and the number of common words to consider. You can read more about them here: https://ludwig-ai.github.io/ludwig-docs/user_guide/#text-features-preprocessing
Encoding Data into a Neural Net
The “text” type of input feature uses the spaCy natural language processing library to tokenize the text. For example, “hello world” in an email subject will be transformed into a list of two word tokens, “hello” and “world”. The machine learning model will keep track of these tokens and how often they are used across all the data points.
For the “subject” input feature we’re using the defaults that Ludwig supplies for the “text” feature type.
However, for the “content” input feature, we are overriding the default encoder, we are overriding how the output is reduced in the encoder and how the activation of neural net nodes functions. The encoder specified is stacked_parallel_cnn
. This is a convolutional neural network that has multiple layers that are stacked. Within each stacked layer there are parallel layers. As inputs are processed they go through these layers which combine and activate in a particular way.
You can read more about it in the Ludwig User Guide: https://uber.github.io/ludwig/user_guide/#stacked-parallel-cnn-encoder
Remember, Ludwig gives you easier access to creating machine learning models and creating a pipeline for data and model training.
It is up to you to understand how and why to use specific encoders and specific options for the encoders. With respect to the spam emails you may get better results with a different encoder entirely than the one used here.
Output Features
After defining the input features, we need to define the output features:
output_features = [
{
'name': 'spam',
'type': 'category',
}
]
We want to know which category the given input features amount to. The category is either spam
or not_spam
.
After we train the model on the dataset, we will be making a prediction for another dataset that only includes the two input features and we expect the prediction of the output feature category
to match our expectation.
Training and Testing a Ludwig Machine Learning Model
Now that we have the input and output features of the model, we can define the model configuration:
base_model_definition = {
'input_features': input_features,
'output_features': output_features,
}
We need a function that will load an existing model or create a new model using that model configuration:
def load_or_create_model(model_dir, model_config, **model_kwargs):
if os.path.exists(model_dir):
print('Loading the model: {}'.format(model_dir))
return (LudwigModel.load(model_dir, **model_kwargs), True)
else:
print('Defining the model')
return (LudwigModel(model_config, **model_kwargs), False)
First we check if the model already has been trained and saved to a directory. If that’s the case, we simply use LudwigModel.load
to load up the pre-trained model. However, if that is not the case, we need to create a new LudwigModel
using the model configuration.
Now let’s use this function to try to create/load the model:
base_model, base_model_loaded = load_or_create_model(
'trained/basic', base_model_definition, gpus=[0], gpu_memory_limit=2000
)
if not base_model_loaded:
print('Training the model')
base_model.train(
training_set='./datasets/spam_train.csv',
test_set='./datasets/spam_test.csv',
skip_save_processed_input=True,
)
base_model.save('trained/basic')
Notice that we supplied two keyword arguments, gpus
and gpu_memory_limit
, if you do not have a GPU you can remove those. If you have one or more GPUs or more GPU memory available, you can update the list of GPU IDs and increase the memory limit. When running just on CPU with no GPU, it can take much longer to train a model. We want a faster feedback loop if we are adjusting multiple parts of the model and trying to compare a group of machine learning models.
Back to the code! If the model is newly created, we’re going to use the train
method with a training set and a test set. The training set is the data that is processed and used to train the model. The test set is another set of data (usually a subset of the training) that is used to check the machine learning model for accuracy. If the model is trained well, it will have a higher accuracy rate on the test data set.
Additionally, we skip over saving any processed input, we want the input to be processed every time. It’s just one less thing to delete if you’re changing the model and re-training a few times).
Then we save the model after it has been trained into a directory.
Now we’re going to evaluate the model against the test dataset:
stats = base_model.evaluate(
dataset='./datasets/spam_test.csv'
)[0]
print(stats)
When we print out the stats, we shall see the accuracy of the model. When you run the code, you will see that the performance is at a certain level. To improve that performance you can delete the trained model from its directory and then update the model configuration and re-run the program.
Creating a Second Model
Now let’s create a second model that is adjusted and compare its performance to the first model.
We begin by copying the first model’s configuration:
print('Creating a 2nd model classifier with some adjustments')
other_model_definition = base_model_definition.copy()
Then we override the input features:
other_model_definition['input_features'] = [
{
'name': 'subject',
'type': 'text',
'preprocessing': {
'lowercase': True,
},
},
{
'name': 'content',
'type': 'text',
'preprocessing': {
'lowercase': True,
},
'encoder': 'bert', # See: https://ludwig-ai.github.io/ludwig-docs/user_guide/#bert-encoder
'reduce_output': 'avg',
'activation': 'tanh', # Activation used for *all* layers.
'num_filters': 64,
'stacked_layers': [
[
{ 'filter_size': 2 },
{ 'filter_size': 3 },
{ 'filter_size': 4 },
{ 'filter_size': 5 }
],
[
{ 'filter_size': 2 },
{ 'filter_size': 3 },
{ 'filter_size': 4 },
{ 'filter_size': 5 }
],
[
{ 'filter_size': 2 },
{ 'filter_size': 3 },
{ 'filter_size': 4 },
{ 'filter_size': 5 }
]
],
},
]
There are a lot of changes to unpack here.
Differences Between the First Model and the Second Model
Encoders
The first difference in our models is that this second model uses the BERT encoder when encoding the content
input feature. This means that the text is processed and encoded using BERT (Bidirectional Encoder Representations from Transformers). This encoder is pre-trained to work on English text. This is important because if you want to classify spam in other languages, you will either need to use a BERT model that’s pre-trained on your target language or you will need to train it yourself.
If you want to change the language, add pretrained_model_name_or_path
and set it to one of the values here: https://huggingface.co/transformers/pretrained_models.html
Later on you can change the encoder from BERT to GPT or T5 or ELECTRA. There are many text encoders available with Ludwig.
Reduction of the Neural Net Output
The next difference is that we’re are reducing the output of the neural network from 'sum'
to 'avg'
. This reduces the output tensor using the average value instead of the sum, for example if we’re seeing 0.5 and 0.5, we’d get 1.0 in the first model and 0.5 in the second model.
Other ways to reduce the output include 'min'
, 'max'
, 'concat'
, 'last'
and 'null'
. Again, you will need to read further about each of these to understand which is the best to use.
This is one of the important parts of machine learning. You have to understand the problem you’re applying machine learning to and then you have to understand what tools are available in the toolbox and dig deep into the details to see if you can produce a more accurate machine learning model.
Neural Net Activation Method
The activation method is a function applied to the output. The first model uses the default which is the relu
function. The second model uses tanh
instead. “Relu” means “rectified linear unit” while “tanh” means “hyperbolic tangent”.
Here are charts of how these activation functions look:
You can find more activation functions in the TensorFlow (or Keras) documentation: https://keras.io/api/layers/activations/#layer-activation-functions
Number of Filters
The number of filters is also different between the two models. In the first model, we’re using the default which is 256, while in the second model we’re using far fewer filters, 64. The number of filters controls how many output channels there are.
If you want to learn more about filters and how they work in a neural network as part of machine learning, check out this article: https://machinelearningmastery.com/convolutional-layers-for-deep-learning-neural-networks/
Filter Size in Each Neural Network Layer
The filter size controls the width of the convolutional filter. The filter size is set within each layer of the neural network with different filter sizes in different layers.
Training the Second Model
Now let’s load or create the model and train it:
other_model, other_model_loaded = load_or_create_model(
'trained/basic_other', other_model_definition, gpus=[0], gpu_memory_limit=2000
)
if not other_model_loaded:
print('Training the model')
other_model.train(
training_set='./datasets/spam_train.csv',
test_set='./datasets/spam_test.csv',
skip_save_processed_input=True,
)[0]['training']
other_model.save('trained/basic_other')
other_stats = other_model.evaluate(
dataset='./datasets/spam_test.csv'
)[0]
Comparing the Performance
Ludwig provides multiple visualization functions to show you graphs and charts of the performance and accuracy of the machine learning models. Remember that Ludwig is made so you can create many models and compare their performance easily to select the most promising model and adjust its parameters further.
Let’s see how well our first model for email spam detection works in comparison to our second model:
visualize.compare_performance(
test_stats_per_model=[stats, other_stats],
output_feature_name='spam',
model_names=['Base Model', 'Other Model'],
)
We’re using the compare_performance
function from Ludwig, you can read more about it here: https://ludwig-ai.github.io/ludwig-docs/user_guide/#compare_performance
Applying the Models and Using Them to Predict Spam Emails
We have two machine learning models that are trained and tested and we have compared their performance. Let’s use those models and predict some more spam emails.
Here’s a convenient way to run the predictions and print out the result:
def print_predictions(unpredicted_emails, model):
prediction_result, output_directory = model.predict(dataset=unpredicted_emails)
emails = unpredicted_emails.join(prediction_result)
for index, row in emails.iterrows():
print('{} ({:.6f}): {} / {}'.format(
row.get('spam_predictions'),
row.get('spam_probability'),
row.get('subject')[0:30],
row.get('content')[0:30],
))
We’re using the predict
method on the model and this is what you would use to see if a new email is spam. If the model is incorrect, you can train it further by correcting the data and re-training model (or by using online training).
And now let’s run the two models and predict spam emails in the “unpredicted” CSV file:
unpredicted_emails = pd.read_csv('./datasets/spam_unpredicted.csv')
print('Prediction Results for Base Model')
print_predictions(unpredicted_emails, base_model)
print('Prediction Results for Other Model')
print_predictions(unpredicted_emails, other_model)
Here’s how to run the code:
$ source env/bin/activate
$ python basic.py
Then you will see the comparison between the models and then the prediction for each email and whether or not the model considers it to be spam.
Where To Go From Here
With this base of knowledge, you can do the following:
- Find new datasets and create new models for them: you could try recognizing digits on a larger dataset, or train a classifier for something other than spam emails.
- Tweak the models: iterate on the models and tune them in different ways to see which will get you the best performance.
- Learn TensorFlow directly: skip Ludwig and learn how to create a neural network using TensorFlow and learn how all the layers and configuration parameters work.
- Get the example running on a cloud-based GPU: use Lambda Labs or Google Colab to create a compute instance that includes a GPU for you to train and test the models on.