Machine Learning 101–in less than 10 mins
ML as a subject is extremely vast and the skills required to be a practitioner in any of the sub-fields of ML is truely intimidating. But for most of us an 101 level common knowledge of ML is quintessential to be able to navigate and understand this field. If you have 10 mins to spare, you will walk away with a basic understanding of the AIML space today in 2022–23.
Machine Intelligence Require that an algorithm be able to
- Acquire knowledge from data
- Apply that knowledge to make inferences
What are the common types of ML algorithms?
- Supervised
- Regression problem
- Classification problem - Unsupervised
- Clustering problem
- Dimensionality Reduction problem - Semi-Supervised
- When you have a small set of data with labels, you can generate a model with that smaller dataset and then with that trained model generate labels for a larger dataset with missing labels for supervised learning. Often used with deep learning use cases. - Reinforcement
- Learning optimal response by trial and error and getting a feedback
How to implement an ML solution?
Machine learning starts with a business problem being addressed by use of data. When the entire process is broken down into stages (often iterative), you get the following stages:
Business Sponsorship
- Business Problem: A business problem is your motivation to start an effort with AI/Ml.
- Frame ML Problem : You should always define the success metrics in terms of business outcomes that can be clearly measured.
Once there is such well defined business problem and sponsorship to embark on the ML journey — then you look at getting the right set of data.
Data Engineering
- Identify valuable data sources
- Collect Data & then Integrate data in the ML pipeline
- This is called Data Ingestion - Data Preparation
This involves
- Data Analysis
- Data Transformation
- Data Cleaning e.g.
(a) Handling Missing values — imputation is a technique to replace a missing value with a value from the data set. Alternately, you can throwaway such data points with missing values.
(b) Ordered data — May need to shuffle the data to remove order.
(c) Handling outliers — usually removed. - Test / Validation / Training data — This usually involves splitting the available dataset for training, testing and validation. This technique of splitting is called cross-validation. There are some common techniques such as :
- Randomly split a part of the data and train the rest — called validation technique. This way some significant data is unseen by the model when training.
- Leave one out (LOO) — each time train the model by leaving out one record for validation. Finally select the model that has least amount of error. This is computationally very expensive.
- K-fold — split the data in k-sets/folds. Then train (k-1) set of data folds each time and record the validation error. - Data visualization — e.g. Scatter diagram, Histogram for ease of understanding the nature of the data.
Who is responsible for the data engineering work for Machine Learning?
This part of the ML workflow involves data engineering teams. Once data is available in a useful form for exploration and ML modeling, the next step is for data scientists to experiment with building an appropriate ML model that can satisfy the business requirements.
Data Science Workflow
Data science workflow broadly involves the following stages:
- Feature Engineering -> raw data to higher level of representation of features.
— Binning (categorizing continuous data into discrete classes) Combining one or more simple feature into one
- transform by taking log, polynomial power of feature values or target variables to make it more linear
- Data encoding
- Text feature engineering — (a) remove stop words (stemming), punctuation (b) lowercasing (c) cutting off very high or very low percentile frequency words (d) Term-Frequency Inverse-Document-Frequency (TF-IDF)
- Identify correct data biases and imbalances - Model Training
- Parameter tuning
- Loss Function — a formula to calculate error of a model prediction for a data set. (a) Square — regression, classification (b) Hinge — for classification only, robust for outliers (c) Logistic — for classification only — better for skewed class distribution
- Regularization — to better fit the unseen data. Reduce overfitting.
- Learning parameter. The objective is to set an appropriate value so that the training algorithm converges to a proper fit. Otherwise you end up with a situation where the parameter is (a) Decaying too aggressively — when algorithm does not find the optimum fit. (b) Decaying too slowly — and then the algorithm bounces around, does not converge to optimum fit. - Model evaluation is required to take care of
- Over-fitting and under-fitting
- Bias-Variance Trade-off — Bias is distance between actual and predicted values. Variance is variation in prediction across data samples. If all errors are in the same direction then the model is highly biased. If the errors are scattered in all direction then bias is low. Variance is high if the error is high for most data points. Variance is low if error is low across most data points. This is a good measure of fitting too.
- Evaluation metrics provides a numerical way to quantify the fit. A number of common metrics are:
1. Root Mean Square Error (RMSE)
2. Mean Absolute Path Error (MAPE)
3. R2 — compare the model against the best picked constant
4. For classification problem — one can use
a. Confusion Matrix
b. ROC curve
c. Precision-Recall - ML workflow is iterative. Often to get a better result more data is needed. That is — Data Augmentation / Feature Augmentation.
Who is responsible for the data science work for Machine Learning?
All the above techniques require deep knowledge in the respective branch of Data Science. E.g., if you are dealing with textual natural language, or image recognition — these skills are very specialized.
ML Operations
ML workflow does not end at finding a satisfactory model. It also involves the following
- Tracking all ML artifacts including features, ML model inputs such as hyperparameters, ML model versions and evaluation metrics.
- Model Deployment, usually also needs automation for:
a. Re-trained and continuous delivery
b. A/B testing, blue green deployment
c. Optimal resource utilization & auto-scaling - Model Monitoring
1. Business goal metrics
2. Technical monitoring — Model Drift over time.
3. Non-functional monitoring such as performance, security aspects etc.
Who is responsible for the operations work for Machine Learning?
ML Engineer or MLOPs Engineer or DevOps Engineers roles are involved in the deployment, monitoring, automation and such platform aspects of Machine Learning workload production.
Also read: What Problems Do ML Ops Solve?
Roles involved in Machine Learning
As described in the respective sections above you are primarily looking at the following roles
- Business SME
- Data Engineers
- Data Scientists
- MLOps / ML Engineers
Apart from these roles you may also need Product Managers, Documentation Engineers, Project Managers, Scrum Masters etc.
Also read: How to set-up a ML SaaS/PaaS Product Team
How do I get started?
- Start from a specific role in ML projects — e.g., MLOps, Data Scientist, Data Engieer or SME who understands the business domain.
- Specialize on a subfield in ML
As ML is vast field, it is important to identify a sub-area of ML to start building knowledge/skills. But you still need to know the fundamental elements of Data Science.
A wide variety of common business use cases can be solved with AI and ML. Below is a simple list of common problems that are getting implemented widely using AI/ML.
Sometime multiple techniques are required to be applied to address a real-life use case.
As a reference to the common use cases, I found this insightful view based software revenue. This is a more practical view of what is in-demand:
Many of the above use cases involves unstructured data types such as text, image, audio & video — which often requires deep-learning capabilities involving specialized hardware and software such as programs embedded into cameras, GPUs especially for image and video use cases.
Deep-learning or statistical machine learning algorithms all require vector algebra or linear algebra. This is because
- Computer understands, stores and processes all data as forms of 0 and 1s. For example, each pixel in an image has an intensity value that ranges from 0 to 255. A value of 0 represents a black pixel and 255 represents a white pixel.
- Most of the ML & deep learning algorithms deal with such vector representation of data.
So in summary, you will need embark on a learning journey involving:
a) Learning the ML algorithms in your chosen sub-field & understanding the mathematical basics behind those
b) hands-on knowledge of tools and frameworks for implementing your use cases.