We have already seen in previous posts that Machine Learning techniques basically consist of automation, through specific algorithms, the identification of patterns or trends which “hide” in the data. Thus, it is very important not only to choose the most suitable algorithm (and its subsequent parameterisation for each particular problem), but also to have a large volume of data of a sufficient quality.
The selection of the algorithm is not easy. If we look it up on the internet, we can find ourselves in an avalanche of very detailed items, which at times, more than helping us, actually confuse us. Therefore, we are going to try and give some basic guidelines to get started. There are two fundamental questions which we must ask ourselves. The first is:
What is it that we want to do?
To respond to this question, it may come in handy to reread two posrs that we posted earlier in our LUCA blog, “The 9 tasks on which to base Machine Learning”, and “The 5 questions which you can answer with Data Science”. The crux of the matter is to clearly define the objective. To solve our problem, then, we will consider what kind of task we will have to undertake. This may be, for example, a classification problem, such as spam detection or spam; or a clustering problem, such as recommending a book to a customer based on their previous purchases (Amazon’s recommendation system). We can also try to figure out, for example, how much a customer will use a particular service. In this case, we would be faced with a regression problem (estimating a value).
If we consider the classic customer retention problem, we see that we can address it from different approaches. We want to do customer segmentation, yes, but which strategy is best? Is it better to treat it as a classification problem, clustering or even regression? The key clue is going to be to ask us the second question.
What information I have to achieve my objective?
If I ask myself, “My clients, do they group together in any way, naturally?”, I have not defined any target for the grouping. However, if I ask the question in this other way: Can we identify groups of customers with a high probability of requesting the service to be stopped as soon as their contract ends, we have a perfectly defined goal: whether the customer will deregister, and we want to take action based on the response we get. In the first case, we are faced with an example of unsupervised learning, while the second is supervised learning.
In the early stages of the Data Science process, it is very important to decide whether the “attack strategy” will be monitored or unsupervised, and in the latter case define precisely what the target variable will be. As we decide, we will work with one family of algorithms or another.
In supervised learning, algorithms work with “labelled data”, trying to find a function that, given the input data variables, assigns them the appropriate output tag. The algorithm is trained “historical” data and thus “learns” to assign the appropriate output tag to a new value, that is, it predicts the output value.
For example, a spam detector analyses the history of messages, seeing what function it can represent, depending on the input parameters that are defined (the sender, whether the recipient is individual or part of a list, if the subject contains certain terms etc), the assignment of the “spam” or “not spam” tag. Once this function is defined, when you enter a new unlabelled message, the algorithm is able to assign it the correct tag.
Supervised learning is often used in classification issues, such as digit identification, diagnostics, or identity fraud detection. It is also used in regression problems, such as weather predictions, life expectancy, growth etc. These two main types of supervised learning, classification and regression, are distinguished by the target variable type. In classification cases, it is of categorical type, while in cases of regression, the target variable is numeric.
Although in previous posts we spoke in more detail about different algorithms, we have already moved forward with some of the most common:
1. Decision trees
2. Classification of Naïve Bayes
3. Regression by least squares
4. Logistic Regression
5. Support Vector Machines (SVM)
6. “Ensemble” Methods (Classifier Sets)
Unsupervised learning occurs when “labelled” data is not available for training. We only know the input data, but there is no output data that corresponds to a certain input. Therefore, we can only describe the structure of the data, to try to find some kind of organization that simplifies the analysis. Therefore, they have an exploratory character.
For example, clustering tasks look for groupings based on similarities, but there is no guarantee that these will have any meaning or utility. Sometimes, when exploring data without a defined goal, you can find curious but impractical spurious correlations. For example, in the graph below, published on Tyler Vigen Spurious Correlations’ website, we can see a strong correlation between per capita chicken consumption in the United States and its oil imports.
Unsupervised learning is often used in clustering, co-occurrence groupings, and profiling issues. However, problems that involve finding similarity, link prediction, or data reduction can be monitored or not.
The most common types of algorithms in unsupervised learning are:
2.Analysis of major components
3.Decomposition into singular values (singular value decomposition)
4. Independent Component Analysis
Which algorithm to choose?
Once we are clear whether we are dealing with a supervised or unsupervised learning case, we can use one of the famous “cheat-sheet” algorithms (what we would call “chop”), to help us choose which one we want to start working with. We leave as an example one of the most well-known, the scikit-learn. But there are many more, such as the Microsoft Azure Machine Learning Algorithm cheat sheet.
So, what is reinforcement learning?
Not all ML algorithms can be classified as supervised or unsupervised learning algorithms. There is a “no man’s land” which is where reinforcement learning techniques fit. This type of learning is based on improving the response of the model using a feedback process. They are based on studies on how to encourage learning in humans and rats based on rewards and punishments. The algorithm learns by observing the world around it. Your input information is the feedback you get from the outside world in response to your actions. Therefore, the system learns from trial and error.
It is not a type of supervised learning, because it is not strictly based on a set of tagged data, but on monitoring the response to actions taken. It is also not unsupervised learning, since when we model our “apprentice” we know in advance what the expected reward is.
Don’t miss out on a single post. Subscribe to LUCA Data Speaks.