As we did in our experiment on the Titanic dataset in Azure Machine Learning Studio, we will continue with the “Learning by doing” strategy because we believe that the best way to learn is to carry out small projects, from start to finish.
A Machine Learning project may not be linear, but it has a series of well-defined stages:
1. Define the problem
2. Prepare the data
3. Evaluate different algorithms
4. Refine the results
5. Present them
On the other hand, the best way to get to know a new platform or tool is to work with it. And that is precisely what we are going to do in this tutorial: get to know Python as a language, and as a platform.
What is NOT necessary to follow this tutorial?
The objective of this experiment is to show how a simple Machine Learning experiment in Python can be done. Different people with different profiles can work with ML models. For example, a Social Sciences researcher, or a financial expert, Insurance broker, Marketing agent etc. They all want to apply the model (and understand how it works). A developer who already knows other languages/ programming environments, may want to start learning Phyton. Or a Data Scientist that works developing new algorithms in R, for example, and wants to start working in Python. So, instead of making a list of the prerequisites to follow the tutorial, we will detail what is not needed:
- You do not have to understand everything at first. The goal is to follow the example from start to finish and get a real result. You can take note of the questions that arise and use the function help (“FunctionName”) of Python to learn about the functions that we are using.
- You do not need to know exactly how algorithms work. It is convenient to know their limitations, and how to configure them. But you can learn little by little. The objective of this experiment is to lose the fear of the platform and keep learning with other experiments!
- You do not have to be a programmer. The Python language has a quite intuitive syntax. As a clue to begin to understand it, it is convenient to look at the function’s calls (e.g. function ()) and in the assignment of variables (e.g. a = “b”). The important thing now is to “start”, little by little, you can learn all the details.
- You do not have to be an expert in Machine Learning. You can learn gradually about the advantages and limitations of different algorithms, how to improve in the different stages of the process, or the importance of evaluating accuracy through cross-validation.
As it is our first project in Python, let’s focus on the basic steps. In other tutorials we can work on other tasks such as preparing data with Panda or improving the results with PyBrain.
What is Python?
Python is an interpreted programming language, oriented to high level objects and dynamic semantics. Its syntax emphasizes the readability of code, which facilitates its debugging and, therefore, promotes productivity. It offers the power and flexibility of compiled languages with a smooth learning curve. Although Python was created as a general-purpose programming language, it has a series of libraries and development environments for each of the phases of the Data Science process. This, added to its power open source characteristics and ease of learning, has led it to take the lead from other languages of data analytics through Machine Learning such as SAS (leading commercial software so far) and R (also open source, but more typical of academic or research environments).
In addition to libraries of scientific, numerical tools, analysis tools and data structures, or Machine Learning algorithms such as NumPy, SciPy, Matplotlib, Pandas or PyBrain, which will be discussed in more detail in another posts of the tutorial, Python offers interactive programming environments oriented around Data Science. Among them we find:
1. The Shell or Python interpreter, which can be launched from the Windows menu, is interactive (executes the commands as you write), and is useful for simple tests and calculations, but not for development.
2. IPython: It is an extended version of the interpreter that allows highlighting of lines and errors by means of colours, an additional syntax for the shell, and autocompletion by means of a tabulator.
3. IDE or Integrated Development Environments such as Ninja IDE, Spyder, or the one we will work with, Jupyter. Jupyter is a web application that allows you to create and share documents with executable code, equations, visualization, and explanatory text. Besides Python, it is compatible with more than 40 programming languages, including: R, Julia, and Scala and integrates very well with Big Data tools, such as Apache Spark.
What steps are we going to take in this tutorial?
What steps are we going to take in this tutorial?
So that they are not too long, we are going to divide the work into different posts.
- Introduction: An experiment for all
- Python for all (1): Installation of the Anaconda environment.
- Python for all (2): What are the Jupiter Notebook ? Create Notebook and practice easy commands.
- Python for all (3): ScyPy, NumPy, Pandas…. What libraries do we need?
- Python for all (4): We start the experiment properly. Data loading,exploratory analysis (dimensions of the dataset, statistics, visualization,etc.)
- Python for all (5) Final: Creation of the models and estimation of theiraccuracy