An introduction to machine learning – part three

Nearly everyone who keeps up to date with technological developments has heard of the term ‘machine learning’. It’s a nice buzzword that is being used more and more in recent years. But what does machine learning actually entail? What is the technology that lies behind it?

To coincide with the launch of Clocktimizer’s machine learning engine, we will be releasing three blogs written by Susan Brommer. Susan is Clocktimizer’s resident AI genius and the person responsible for building much of our own machine learning engine. In the blog series we will explain what machine learning is, how it came to be, and what we can expect from it in the future. We will even go into some detail of the inner mechanisms of machine learning. In the previous blogs, we discussed why machine learning is important, and laid out the history of machine learning. In this third and final blog we are going to take a look at how a machine can learn.

The types of machine learning

There is no straight answer to the question “how does a machine learn?” There are many different ways in which a machine can learn. Broadly speaking, there are two types of machine learning: supervised learning and unsupervised learning.

In supervised learning, the machine is given data for which we already know what the result should look like. Say we want to predict house prices based on the size of the house. We might have data on the size of houses and their price on the real estate market. In this case we have data on the result (the price). The machine can now try to learn the relation between the size and the price, so that it can estimate the price of houses where we only have data on their size.

In unsupervised learning, the machine is given data for which we do not know what the result should look like. You can use this type of ‘learning’when you want to discover about how the data is structured. For example, if you have a collection of songs, you can ask the machine to categorise the songs based on the beats per minute, the occurrence of certain frequencies, and the length of the song. The machine will seek for structure and categorize the songs. It can then recommend you new songs which are similar to songs you like to listen to, based on the structure that it has learned.

Preparing the data

Although there are many different ways in which a machine can learn, the underlying process is always the same. First, you need to prepare the data and choose an algorithm. Second, you need to train the machine. Finally, you need to test the machine.

Preparing the data consists of two steps: cleaning the data, and transforming the data so that the machine can work with it. Raw data is often noisy, inconsistent, and incomplete. Although  machine learning algorithms can normally handle a certain amount of unclean data, it is always better to use clean data if you are able to.

novaplex

After cleaning the data, it should be transformed so that the machine can work with it. A machine usually works with the relevant properties of the data, rather than the data itself. These properties are called features. For example, in spam detection algorithms, the features would be whether the email header is written in all-caps, whether certain spam words (“BUY NOW!!!”) occur, or the grammatical correctness of the text.

Often, your choice of features will depend on the problem you are trying to solve. If you want to recognise characters in an image, the features may be the number of black pixels along horizontal and vertical directions. However, if you want the machine to recognise objects in an image, the features may be the edges, lines, or circles in an image. Coming up with the correct features is a task that almost always involves a human. The machine might help to figure out whether a certain feature is useful or not. However, coming up with the feature is often a task reserved for a human mind.

Choosing the algorithm

Together with the process of cleaning the data, you will need to choose what algorithm to use. This depends many factors: What kind of data do you have? What do you want to know? How accurate should the machine be? How much time and computing resources do you have? What is the goal of the machine’s learning? Three of the most commonly occurring goals of machine learning are clustering, classification, and regression.

Clustering

Clustering involves grouping of data points, where data points in the same group should have similar features, and data points in different groups, should have dissimilar features. For example, you want to cluster your customers based on their age and their income, so that the marketing department can target new customers more efficiently.

A clustering algorithm would work in the following way. The machine starts with a few clusters of a random age and a random income. Every customer is assigned to the cluster that it is most similar to. Each cluster then adjusts its age and income to be the average age and income of all customers belonging to this cluster. The last two steps are repeated, until the clusters do not change anymore. The next time you give the age and income of a customer to the machine, it will tell you which cluster it is in, and you can target this customer more efficiently.

Classification

Classification involves learning the boundaries between different classes. Say you want a machine to learn the difference between apples and oranges. You may have data on the colour and size of pieces of fruit, and you know of each piece whether it is an apple or an orange. You feed this data to the machine, and the machine will then try to create a boundary that separates apples from oranges. The next time you ask your machine to identify a piece of fruit based on size and colour, it will look at which side of the boundary the piece of fruit is located, and predict that type of fruit.

Regression

Regression involves learning the relationship among properties of the data. For example, you want a machine to predict house prices based on the the size of the house. You would feed data on house prices and sizes to the machine, and it will draw a line that fits the data the best. The next time you give your machine the size of a house, it will tell you the price that belongs to it according to the line.

Training & testing the machine

After having prepared the data and having chosen the algorithm, it’s time to let the machine learn from the data. This can take a long, long time. If it takes too long, you might want to reconsider your previous data choices. Using less data or less features can greatly decrease the time it takes your machine to learn. However, with less data and less features, your machine is also likely to be less accurate. You will have to find a balance between the time your machine takes and the accuracy it will achieve.

The final step will be to test your machine. This is much like students taking an exam: you give the machine data that it has not seen before, and then measure how well it performs. You will most certainly find it under-performs. This means you will have to tweak the algorithm, and train and test again. And again.

Every algorithm works with values called training parameters. These parameters are like the settings of the algorithm. The goal of the training and testing phase is to find the optimal parameters. What the optimal parameters are depends greatly on the goal of the machine learning and the data that is fed to the machine. So you will need to repeat training and testing until you have found parameters that give you results you are satisfied with.

Using the machine

After all this hard work, it is now time to actually use the machine. There are still some important things to point out. First of all, you should remember that the goal of machine learning is never to make perfect predictions, but rather to make guesses that are useful enough. This means that your machine will never be a hundred percent certain about its answers. It might give you information on how certain it is, like: “I am 45 percent sure that this is an apple, but it might just as well be an orange.” Or: “I think it is most likely that this house is worth 200,000 dollars, but it might just as well be a bit more or less. However, I am 95 percent sure that the price is between 180,000 and 220,000 dollars.”

The second important thing to realise, is that you are the one making decisions. Are you going to eat the apple-but-maybe-an-orange? Should you bid 180,000, 200,000, or 220,000 for a house? The machine helps you to get insight into the data that you have, but in the end it is you who stays in control.

Clocktimizer is a leading business intelligence solution that helps law firms understand who does what, when, where, and at what costs.