Machine learning is like a garage sale
Machine learning is a field of computer science where mathematical systems “learn” from massive data to adjust their internal models using linear algebra and statistics. It is not a new discipline; machine learning has actually been around for decades but only recently did it get the exposure to make it popular.
There are a lot of courses and resources to learn about machine learning which I found overwhelming. It is very difficult to find a course that would have both the necessary theory and a good amount of practice. I tried a few courses which meant that I had to go back to the basics every time - frustrating. There are a lot of ways you could learn and think about machine learning, but the way I see it, applying AWS machine learning could be as simple as organizing a garage sale.
This is the model I am thinking of when I think about machine learning:
- Gathering: You gather the things you want to sell
- Cleaning: You clean them
- Hypothesis: You make a hypothesis of how much you are going to sell each item for
- Training: You try to sell each item, adjusting your price with each offer/refusal.
- Validation: You sell your remaining items
Machine learning is part of a broader discipline called Data Science, and there cannot be data science without data so the first thing to do is to gather some data.
Just as much as you wouldn’t do a garage sale with only 3 items, you can’t do machine learning without having enough data. But how much data is enough data? The amount of data needed is very relative on what your goals are in terms of accuracy (how well the model matches reality) and which model you want to apply.
Data gathering can be done through a number of techniques.
- Collecting data generated by sensors in a system you are observing.
- Using old records, data that has been collected through the years like old excel sheets, or data stored in a database.
- Using public datasets like Kaggle, Data.gov, AWS public data sets, Google public data sets, Wikipedia and Tensorflow models -to name a few.
To get the best price at a garage sale, you usually want to wipe the dust off or fix loose parts before putting items up for sale.
Likewise, data gathered for machine learning requires cleaning. There are different problems with raw data.
- The dataset might have missing values. e.g. a sensor got disconnected momentarily.
- Data might need to be unified. e.g. some data in Celsius, some in Fahrenheit.
- Data might be on very different scales i.e. having, in the same dataset, a feature representing the number of inhabitants of a country and a feature for the number of official languages in that country.
- Data might need some other transformation i.e. a dataset with categorical variables representing a rating with “excellent”,“good”,“fair”,“average”, etc. might have to be transformed into numerical values from 1 to 10.
Cleaning can be done manually or programmatically.
Your goal is to develop a hypothesis that articulates what you are attempting to prove or predict through machine learning.
In the garage sale example, this is where the initial prices are decided. You may know that an item is sought after and decide to ask for a high starting price. In this case, you applied what is called domain knowledge to create an initial model that has a more accurate representation of reality.
Domain knowledge introduces a bias because you expect that some features will have more
influence over others on the desired predictions. Additionally, if your goal is to group data into coherent classes, domain knowledge helps determine factors like the number of classes to be expected.
During the training phase, the model’s parameters get adjusted and this is where the learning actually happens.
Coming back to our garage sale analogy, you realize that a lot of the items you sold went quickly while others, although sparking interest, did not sell. What you decide to do next is gradually adjust your prices: you increase the price of items similar to those that got sold quickly and you decrease the price of the ones that did not sell. Potential buyers who are willing to negotiate may help you find the price that would generate the most gains without having them walk away. This will go on iteratively until you manage to sell all your items (or decide to call it a day).
For machine learning, you train your model using data pre-selected for training. The training phase will adjust the model’s parameter values. Training data is data similar to the data the model will be predicting on -it can be labeled data which is data that contains both features and the expected outcome. In the garage sale example, training data is each item you have sold and the price they sold for.
The second phase of training involves validating your trained model.
During your garage sale, you have adjusted your prices and at this step you know roughly at what price the remaining items are going to be sold for, provided that you have sold similar items. Hence, having only sold clothes at the garage sale, it is difficult to estimate other types of item because you have no reference. Let’s imagine the model for estimating clothing prices uses brand, fabric, style and wear or discoloration. The problem arises because these features might not be useful to estimate prices on other items like kitchen appliances or furniture!
Indeed, it is important to make sure that you have additional validation data because your model might only work for your training data, which would mean that your model might fail when running on new data i.e. when you actually try to predict something from outside your training data set. You want to make sure that it can predict the expected outcome on both your training and validation data.
Machine learning can be summarized as the process of obtaining a trained model which will be used to make inferences or predictions.
The variety of models and the math behind machine learning might be overwhelming, but you usually don’t need to go too much in depth to start discovering interesting features from your data.
I suggest reading this Microsoft document which introduces the main concepts of machine learning (summarized in this cheat sheet). After familiarizing yourself with the main concepts and vocabulary, you can use a number of pre built models (or explore a simple one) by following the garage sale method I described. Examples of prebuilt models include TensorFlow for Poets, which will introduce you to image classification and this Kaggle competition kernel which walks you through the methodology to make predictions on the Titanic dataset.
I hope this article was helpful to understand the methodology used for machine learning!