Machine Learning with TypeScript and TensorFlow: Training your first model

February 3, 2025

Artificial Intelligence (AI) is a broad and complex field, and getting started can be overwhelming. In this article, we learn how Machine Learning is related to AI and train our first model to determine whether a given piece of text is a spam message.

Artificial Intelligence and Machine Learning

Artificial Intelligence focuses on developing systems that can perform tasks that typically require human intelligence. For years, we’ve used AI to solve problems such as finding the shortest route to a destination or filtering emails based on keywords. However, not all AI systems learn from data. Instead, many rely on predefined rules or search algorithms.

Machine Learning is a branch of AI in which we create models that learn from data instead of being explicitly programmed with rules. Instead of following fixed instructions, models learn to recognize patterns. Based on this, they can make predictions to assist in decision-making.

For example, ChatGPT was trained on a large dataset containing diverse texts. It does not think on its own. Instead, it generates responses based on patterns in its training data. Thanks to this, it can predict the most appropriate response to a given question.

In Machine Learning, a model is a system trained to recognize patterns and make predictions based on what it has learned. For example, we train spam detection models with examples of both spam and non-spam messages and state which is which. One of the most popular tools for building such models is TensorFlow, developed by Google.

Preparing the data

To train our model, we provide the training data in JSON.

dataset.json

[

{ "text": "Congratulations! You won a free iPhone. Click here to claim.", "label": 1 },

{ "text": "Are we still meeting for coffee tomorrow?", "label": 0 },

{ "text": "Get 90% off on all products! Limited time offer.", "label": 1 },

{ "text": "Can you send me the report by tonight?", "label": 0 },

...

]

Each item in the dataset consists of the text and the label indicating whether it’s spam.

DatasetItem.ts

export enum DatasetLabel {

NotSpam = 0,

Spam = 1,

}

export interface DatasetItem {

text: string;

label: DatasetLabel;

}

We often use numbers for labels to make it easier for the model to process the data.

Processing each word

To train the model, we need to transform our text into numbers. The first step is to split the sentences into words. A reliable way of doing that is using the natural library. It allows us to split a sentence, such as Hello, world! into an array of words like ['Hello', 'world'].

Next, we need to transform each word into a number. A straightforward way to do that is to generate the md5 hash for each word using the createHash function built into Node.js

tokenizeText.ts

import { WordTokenizer } from 'natural';

import { createHash } from 'crypto';

export function tokenizeText(text: string) {

const tokenizer = new WordTokenizer();

const tokenizedText = tokenizer.tokenize(text);

return tokenizedText.map((word) => {

const hash = createHash('md5').update(word).digest('hex');

const reducedHash = hash.slice(0, 8);

return parseInt(reducedHash, 16);

});

}

Above, we take only first 8 characters of the hash into account since the probability of them not being unique is low.

Now, we can run our function on every element from our dataset.

import { DatasetLabel } from './DatasetItem';

export interface ProcessedDatasetItem {

input: number[];

output: DatasetLabel;

}

const rawDatasetData = await readFile('./dataset.json', 'utf8');

const dataset: DatasetItem[] = JSON.parse(rawDatasetData);

const processedData: ProcessedDatasetItem[] = dataset.map((entry) => {

return {

input: tokenizeText(entry.text),

output: entry.label,

};

});

The above solution is not perfect since words such as “win” and “winning” will produce completely different hashes while having a similar meaning. We could improve this by implementing stemming or lemmatization.

Adding padding

In our training data, each sentence can have a different length, resulting in hashes with different lengths as well. However, TensorFlow requires us to ensure that each input has the same length. The most straightforward way to achieve that is to add padding with zeros to each input.

First, we need a way to figure out the longest input to know how many zeros we need to append.

function getLongestTokenLength(processedDatasetItems: ProcessedDatasetItem[]) {

const allInputLengths = processedDatasetItems.map((entry) => entry.input.length);

return Math.max(...allInputLengths);

}

Then, we can use this knowledge to use zeros for each input.

export function addPaddingToInput(input: number[], longestTokenLength: number) {

const padding = Array(longestTokenLength - input.length).fill(0);

return [...input, ...padding];

}

Creating tensors

TensorFlow uses tensors, which are multi-dimensional arrays structured and optimized for machine learning.

We need to convert our tokenized texts into a two-dimensional tensor. In our case, each row is a sentence, and each column is a word in the sentence.

We also need to convert our labels into a one-dimensional tensor. We also need to apply one-hot encoding.

processTextData.ts

import { readFile } from 'fs/promises';

import { tensor2d, oneHot, tensor1d } from '@tensorflow/tfjs-node';

import { DatasetItem } from './DatasetItem';

import { ProcessedDatasetItem } from './ProcessedDatasetItem';

import { tokenizeText } from './tokenizeText';

export async function processTextData() {

const rawDatasetData = await readFile('./dataset.json', 'utf8');

const dataset: DatasetItem[] = JSON.parse(rawDatasetData);

const processedData: ProcessedDatasetItem[] = dataset.map((entry) => {

return {

input: tokenizeText(entry.text),

output: entry.label,

};

});

// Get the longest tokenized text length for padding

const maxTokenLength = getLongestTokenLength(processedData);

const paddedTokenizedText = processedData.map((entry) => {

return addPaddingToInput(entry.input, maxTokenLength);

});

// Convert arrays to tensors

const inputTensor = tensor2d(paddedTokenizedText);

const allOutputs = processedData.map((entry) => entry.output);

const labelTensor = oneHot(tensor1d(allOutputs, 'int32'), 2);

return { inputTensor, labelTensor, maxTokenLength };

}

Preparing the model

In machine learning, a model is a structure that learns from data and makes predictions. A neural network is a type of model that’s inspired by a human brain. It consists of layers of artificial neurons that process information. Every layer transforms the input into something more useful. Multiple layers can work together to learn complex patterns.

Creating the model

In our case, we will create a sequential model where layers are stacked in a straight line.

import { sequential } from '@tensorflow/tfjs-node';

const spamModel = sequential();

First, we need to add a dense layer. It receives the data we’ve prepared and learns how to interpret it into meaningful patterns. It will learn which words are common in spam messages.

const { inputTensor, labelTensor, maxTokenLength } = await processTextData();

spamModel.add(

layers.dense({

inputShape: [maxTokenLength],

units: 8,

activation: 'relu',

}),

);

Above, we’re defining the shape of our input data by specifying the maximum input length. By writing units: 8, we specify that we want to use eight neurons in this layer. It’s a good balance between complexity and speed.

The activation function decides whether a neuron should activate (send its output forward) or remain inactive. Above, we’re choosing the ReLU (Rectified Linear Unit) activation function, which is very fast and works well in most cases.

We’re also adding another dense layer as the final output layer in our model. Its job is to make the final decision: is the message spam or not spam?

1	spamModel.add(layers.dense({ units: 2, activation: 'softmax' }));

The softmax activation function ensures that the output consists of the probabilities that sum to 100%. For example, a message might have an 80% probability of being spam and a 20% probability of not being spam.

After defining the model’s layers, we need to compile it before training.

spamModel.compile({

optimizer: 'adam',

loss: 'categoricalCrossentropy',

metrics: ['accuracy'],

});

Above, we tell TensorFlow how the model should learn by specifying:

How to optimize the weights. In a neural network, weights are numbers that control how much influence each input has on the final decision.
The optimizer updates the model’s weights to improve the predictions during training. The Adam optimizer works well for text classification problems.
The loss function that measures how far the model’s predictions are from the correct answer. Categorical cross entropy tells the model how bad its guess was. If it makes a wrong guess, it gets a penalty, and the model learns by making the penalty smaller over time
The metric we want to use to track the model’s performance during training. Accuracy measures how often the model predicts correctly and is the most intuitive metric for classification problems.

Training the model

Finally, we can train our neural network with the processed data using the fit function and save it on our drive.

trainSpamClassifier.ts

import { sequential, layers } from '@tensorflow/tfjs-node';

import { processTextData } from './processTextData';

async function trainSpamClassifier() {

const { inputTensor, labelTensor, maxTokenLength } = await processTextData();

const spamModel = sequential();

spamModel.add(

layers.dense({

inputShape: [maxTokenLength],

units: 8,

activation: 'relu',

}),

);

spamModel.add(layers.dense({ units: 2, activation: 'softmax' }));

spamModel.compile({

optimizer: 'adam',

loss: 'categoricalCrossentropy',

metrics: ['accuracy'],

});

await spamModel.fit(inputTensor, labelTensor, {

epochs: 10,

batchSize: 2,

});

await spamModel.save('file://./model');

}

trainSpamClassifier().then(() => {

console.log('Training complete! Model saved.');

});

By specifying epochs: 10, we state that we want to train for ten full cycles over the same dataset to improve accuracy. We set the batchSize to two to train in small groups of 2 messages at a time to make learning stable.

Making predictions

Using the save('file://./model') method stores the model on our hard drive in the model directory. We can now use it to make predictions.

First, we need to load our model and determine the size of the inputs.

predictSpam.ts

import { LayersModel, loadLayersModel } from '@tensorflow/tfjs-node';

import { DatasetLabel } from './DatasetItem';

async function predictSpam(textMessage: string) {

const spamModel: LayersModel = await loadLayersModel(

'file://./model/model.json',

);

// Extract shape correctly from the first input layer

const modelInputShape = spamModel.inputs[0].shape;

// Get expected input size

const maxTokenLength = modelInputShape[1];

// ...

}

We want to determine if a given text message is spam or not. To do that, we need to parse it the same way we parsed the data from our dataset.

predictSpam.ts

import { LayersModel, loadLayersModel, tensor2d } from '@tensorflow/tfjs-node';

import { DatasetLabel } from './DatasetItem';

import { addPaddingToInput } from './processTextData';

import { tokenizeText } from './tokenizeText';

async function predictSpam(textMessage: string) {

const spamModel: LayersModel = await loadLayersModel(

'file://./model/model.json',

);

// Extract shape correctly from the first input layer

const modelInputShape = spamModel.inputs[0].shape;

// Get expected input size

const maxTokenLength = modelInputShape[1];

// Tokenize the input text message

const tokenizedInput = tokenizeText(textMessage);

const paddedTokenizedInput = addPaddingToInput(

tokenizedInput,

maxTokenLength,

);

// Convert input data into a tensor

const inputTensor = tensor2d([paddedTokenizedInput]);

// ...

}

Finally, we can make a prediction and dispose of the tensors when they are no longer necessary to free up the memory.

predictSpam.ts

import { LayersModel, loadLayersModel, tensor2d } from '@tensorflow/tfjs-node';

import { DatasetLabel } from './DatasetItem';

import { addPaddingToInput } from './processTextData';

import { tokenizeText } from './tokenizeText';

async function predictSpam(textMessage: string) {

const spamModel: LayersModel = await loadLayersModel(

'file://./model/model.json',

);

// Extract shape correctly from the first input layer

const modelInputShape = spamModel.inputs[0].shape;

// Get expected input size

const maxTokenLength = modelInputShape[1];

// Tokenize the input text message

const tokenizedInput = tokenizeText(textMessage);

const paddedTokenizedInput = addPaddingToInput(

tokenizedInput,

maxTokenLength,

);

// Convert input data into a tensor

const inputTensor = tensor2d([paddedTokenizedInput]);

const predictionTensor = spamModel.predict(inputTensor);

if (Array.isArray(predictionTensor)) {

throw new Error('Expected a single Tensor, but got an array.');

}

const predictionWithHighestChance = predictionTensor.argMax(1);

const prediction = predictionWithHighestChance.dataSync()[0];

inputTensor.dispose();

predictionTensor.dispose();

predictionWithHighestChance.dispose();

return prediction === DatasetLabel.Spam;

}

Our prediction tensor returns both the chance of the message being spam and the chance of it not being spam. The following code checks which chance is higher:

1 2	const predictionWithHighestChance = predictionTensor.argMax(1); const prediction = predictionWithHighestChance.dataSync()[0];

Our predictSpam function returns a promise that resolves to a boolean. If it’s true, it means that a particular text is spam.

predictSpam(

'Get 90% off on all products! Limited time offer.',

).then((isSpam) => {

if (isSpam) {

console.log('Spam');

} else {

console.log('Not spam');

}

});

Summary

In this article, we explored Artificial Intelligence (AI) and Machine Learning (ML), understanding how Machine Learning allows models to learn from data instead of following predefined rules.

To apply this knowledge, we built a spam detection model using TensorFlow. First, we prepared a dataset of messages labeled as spam or not. Next, we processed the text data, converting words into hashed numbers. We then built a neural network model, compiled it, and trained it over multiple cycles to improve its accuracy. Finally, we saved the trained model and used it to make predictions about new messages.

The code from this article explains the basics, but there are various ways in which our solution can be improved. Feel free to experiment with different techniques to make the model even better.