In recent years much progress has been made in solving complex problems thanks to Artificial Intelligence algorithms. These algorithms need a large volume of information to discover and learn, continuously, hidden patterns in the data. However, this is not the way the human mind learns. A person does not require millions of data and multiple iterations to solve a particular problem, since all they need are some examples to solve it. In this context, techniques such as semi-supervised learning or semi-supervised learning are playing an important role nowadays.
Within Machine Learning techniques, we can find several well-differentiated approaches (see Figure 1). The supervised algorithms deal with labeled data sets and their objective is to construct predictive models, either classification (estimating a class) or regression (estimating a numerical value). These models are generated from tagged data and, subsequently, make predictions about the non-tagged data. However, unsupervised algorithms use unlabelled data and their objective, among others, is to group them, depending on the similarity of their characteristics, in a set of clusters. Unlike the two more traditional approaches (supervised learning and unsupervised learning), semi-supervised algorithms employ few tagged data and many unlabelled data as part of the training set. These algorithms try to explore the structural information contained in the non-tagged data in order to generate predictive models that work better than those that only use tagged data.
Semi-supervised learning models are increasingly used today. A classic example in which the value provided by these models is observed is the analysis of conversations recorded in a call center. With the aim of automatically inferring characteristics of the interlocutors (gender, age, geography, …), their moods (happy, angry, surprised, …), the reasons for the call (error in the invoice, level of service, quality problems, …), among others, it is necessary to have a high volume of cases already labeled on which to learn the patterns of each type of call. The labeling of these cases is an arduous task to achieve, since labeling audio files, in general, is a task that requires time and a lot of human intervention. In these situations where labeling of cases is scarce, either because it is expensive, requires a long collection time, requires a lot of human intervention or simply because it is completely unknown, the semi-supervised learning algorithms are very useful thanks to its operating characteristics. However, not all problems can be addressed directly with these techniques, since there are some essential characteristics that must be present in the problems to be able to solve them, effectively, using this typology of algorithms.
Probably the first approach on the use of unlabelled data to construct a classification model is the Self-Learning method, also known as self-training, self-labeling, or decision-directed learning. Self-learning is a very simple wrapper method and one of the most used methods in practice. The first phase of this algorithm is to learn a classifier with the few data labeled. Subsequently, the classifier is used to predict unlabelled data and its predictions of higher reliability are added to the training set. Finally, the classifier is retrained with the new training set. This process (see Figure 2) is repeated until no new data can be added to the training set.
In the semi-supervised approach a certain structure is assumed in the underlying distribution of the data, that is, the data closest to each other are assumed to have the same label. Figure 3 shows how the semi-supervised algorithms adjust, iteration after iteration, the decision boundary between the labels. If only labeled data is available, the decision boundary is very different from the boundary learned when incorporating the underlying structure information of all untagged data.
Another situation in which semi-labeled data is useful is in the detection of anomalies, since it is a typical problem in which it is difficult to have a large amount of tagged data. This type of problem can be approached with an unsupervised approach. The objective of this approach is to identify, based on the characteristics of the data, those cases that differ greatly from the usual pattern of behavior. In this context, the subset of tagged data can help to evaluate the different iterations of the algorithm, and thus, guide the search for the optimal parameters of the analyzed algorithm.
Finally, with the examples above, it is demonstrated that the use of non-tagged data together with a small amount of tagged data can greatly improve the accuracy of both supervised and unsupervised models.
Written by Alfonso Ibañez and Rubén Granados