##### ByKatrina Wakefield, Marketing, SAS UK

### A guide to machine learning algorithms and their applications

The term ‘machine learning’ is often,incorrectly, interchanged with Artificial Intelligence[JB1], but machine learning is actually a sub

field/type of AI. Machine learning is also often referred to as predictiveanalytics, or predictive modelling.

Coined by American computer scientistArthur Samuel in 1959, the term ‘machine learning’ is defined as a “computer’sability to learn without being explicitly programmed”.

At its most basic, machine learning usesprogrammed algorithms that receive and analyse input data to predict outputvalues within an acceptable range. As new data is fed to these algorithms, theylearn and optimise their operations to improve performance, developing ‘intelligence’over time.

There are four types of machine learning algorithms:supervised, semi-supervised, unsupervised and reinforcement.

### Supervisedlearning

In supervised learning, the machine istaught by example. The operator provides the machine learning algorithm with aknown dataset that includes desired inputs and outputs, and the algorithm mustfind a method to determine how to arrive at those inputs and outputs. While theoperator knows the correct answers to the problem, the algorithm identifiespatterns in data, learns from observations and makes predictions. The algorithmmakes predictions and is corrected by the operator – and this process continuesuntil the algorithm achieves a high level of accuracy/performance.

Under the umbrella of supervised learning fall: Classification, Regression and Forecasting.

**Classification**: In classification tasks, the machinelearning program must draw a conclusion from observed values and determine to

what category new observations belong. For example, when filtering emails as ‘spam’or ‘not spam’, the program must look at existing observational data and filterthe emails accordingly.**Regression**: In regression tasks, the machinelearning program must estimate – and understand – the relationships amongvariables. Regression analysis focuses on one dependent variable and a seriesof other changing variables – making it particularly useful for prediction andforecasting.**Forecasting**: Forecasting is the process of making predictions about the future based on the past and present data, and is commonly used to analyse trends.

Semi-supervisedlearning

Semi-supervised learning is similar tosupervised learning, but instead uses both labelled and unlabelled data.Labelled data is essentially information that has meaningful tags so that thealgorithm can understand the data, whilst unlabelled data lacks thatinformation. By using this

combination, machine learning algorithms can learn to label unlabelleddata.

### Unsupervisedlearning

Here, the machine learning algorithm studies data toidentify patterns. There is no answer key or human operator to provideinstruction. Instead, the machine determines the correlations and relationshipsby analysing available data. In an unsupervised learning process, the machine learning algorithmis left to interpret large data sets and address that data accordingly. Thealgorithm tries to organise that data in some way to describe its structure. Thismight mean grouping the data into clusters or arranging it in a way that looksmore organised.

As it assesses more data, its ability tomake decisions on that data gradually improves and becomes more refined.

Under the umbrella of unsupervisedlearning, fall:

**Clustering**: Clustering involves grouping sets ofsimilar data (based on defined criteria). It’s useful for segmenting data intoseveral groups and performing analysis on each data set to find patterns.**Dimension reduction**: Dimension reduction reduces the number of variables being considered to find the exact information required.

Reinforcementlearning

Reinforcement learning focuses onregimented learning processes, where a machine learning algorithm is provided with a set of actions,parameters and end values. By defining the rules, the machine learning algorithm then tries toexplore different options and possibilities, monitoring and evaluating eachresult to determine which one is optimal. Reinforcement learning teaches themachine trial and error. It learns from past experiences and begins to adaptit* approach in response to the situation to achieve the best possible result.

### Whatmachine learningalgorithms can you use?

Choosing the right machine learning algorithmdepends on several factors, including, but not limited to: data size, qualityand diversity, as well as what answers businesses want to derive from thatdata. Additional considerations include accuracy, training time, parameters,data points and much more. Therefore, choosing the right algorithm is both acombination of business need, specification, experimentation and timeavailable. Even the most experienced data scientists cannot tell you whichalgorithm will perform the best before experimenting with others. We have,however, compiled a machinelearning algorithm ‘cheatsheet’ which will helpyou find the most appropriate one for your specific challenges.

### Whatare the most common and popular machine learning algorithms?

**Naïve Bayes Classifier Algorithm(Supervised Learning - Classification)**

The Naïve Bayes classifier is based on Bayes’ theorem and classifies every value as independent of any other value. It allows us to predict a class/category, based on a given set of features, using probability.Despite its simplicity, the classifier does surprisingly well and is often used due to the fact it outperforms more sophisticated classification methods.

**K Means Clustering Algorithm (Unsupervised Learning - Clustering)**

The K Means Clustering algorithm is atype of unsupervised learning, which is used to categorise unlabelled data,i.e. data without defined categories or groups. The algorithm works by findinggroups within the data, with the number of groups represented by the variable K.It then works iteratively to assign each data point to one of K groups based onthe features provided.

**Support Vector Machine Algorithm (Supervised Learning - Classification)**

Support Vector Machine algorithms are supervised learning models that analyse data used for classification and regression analysis. They essentially filter data into categories, which is achieved by providing a set of training examples, each set marked as belonging to one or the other of the two categories. The algorithm then works to build a model that assigns new values to one category or the other.

**Linear Regression (Supervised Learning/Regression)**

Linear regression is the most basic type of regression. Simple linear regression allows us to understand the relationships between two continuous variables.

**Logistic Regression (Supervised learning – Classification)**

Logistic regression focuses on estimating the probability of an event occurring based on the previous data provided. It is used to cover a binary dependent variable, that is where only two values, 0 and 1, represent outcomes.

**Artificial Neural Networks (Reinforcement Learning)**

An artificial neural network (ANN) comprises ‘units’ arranged in a series of layers, each of which connects to layers on either side. ANNs are inspired by biological systems, such as the brain, and how they process information. ANNs are essentially a large number of interconnected processing elements, working in unison to solve specific problems.ANNs also learn by example and throughexperience, and they are extremely useful for modelling non-linearrelationships in high-dimensional data or where the relationship amongst theinput variables is difficult to understand.

**Decision Trees (Supervised Learning – Classification/Regression)**

A decision tree is a flow-chart-like tree structure that uses a branching method to illustrate every possible outcome of a decision. Each node within the tree represents a test on a specific variable – and each branch is the outcome of that test.

**Random Forests (Supervised Learning – Classification/Regression)**

Random forests or ‘random decision forests’ is an ensemble learning method, combining multiple algorithms to generate better results for classification, regression and other tasks. Each individual classifier is weak, but when combined with others, can produce excellent results. The algorithm starts with a ‘decision tree’ (a tree-like graph or model of decisions) and an input is entered at the top. It then travels down the tree, with data being segmented into smaller and smaller sets, based on specific variables.

**Nearest Neighbours (Supervised Learning)**

The K-Nearest-Neighbour algorithm estimates how likely a data point is to be a member of one group or another. It essentially looks at the data points around a single data point to determinewhat group it is actually in. For example, if one point is on a grid and thealgorithm is trying to determine what group that data point is in (Group A orGroup B, for example) it would look at the data points near it to see whatgroup the majority of the points are in.Clearly, there are a lot of things to consider when it comes to choosing the right machine learning algorithms for your business’ analytics. However, you don’t need to be a data scientist or expert statistician to use these models for your business. At SAS, our products and solutions utilise a comprehensive selection of machine learning algorithms, helping you to develop a process that can continuously deliver value from your data.