The phrase ‘Democratizing Machine Learning’ is being used by many companies small and large, including us @VerifAI and in this article.
de·moc·ra·tize — democratizing:
— introduce a democratic system or democratic principles to.
“public institutions need to be democratized”
— make (something) accessible to everyone.
“mass production has not democratized fashion”
In the context of this article, the second definition is more fitting than the first.
Every company may have a different notion of what they mean by Democratizing Machine Learning and AI.
To make a figurative analogy: A democratic government may give its people the right to vote, but perhaps the voting ballots are hard to decipher.
Such a country would still be a democracy in principle, but in practice this would be a democracy that doesn’t serve all its constituents well.
For example, a country that educates people as to how to use polling booths or makes the ballots very simple, in addition to easy access, would serve all its constituents well.
Democracy and Machine Learning ?
Similar figurative analogies apply to democratizing Machine Learning.
While companies may provide access to ML tools and API’s, they may not provide a level of abstraction or ease of use that makes sense for a developer or a professional who may not be a Machine Learning expert.
To truly democratize machine learning, would mean that a wide range of professionals (including, but not limited to, developers, engineers, scientists, marketeers, accountants, radiologists, cardiologists, law enforcement officers, judges, lawyers and other domain experts) could use Machine Learning to improve their productivity. To Democratize Machine Learning, we need to make Machine Learning not only accessible to everyone but also easily useableby everyone.
We can broadly segment Machine Learning techniques into branches called Supervised Learning, Unsupervised Learning and Reinforcement Learning. Democratizing each of these branches of Machine Learning presents their own challenges.
Supervised learning is currently the most widely used form of machine learning, across all industry segments, and, one of the most common applications of Supervised Learning is Classification.
Let take a closer look at what it means to Democratize Machine Learning in the context of a building a Classification app.
Democratizing Classification — Supervised Learning
Classification is one of the most widely used Supervised Machine Learning techniques. This includes classifying images, text and other kinds of data. According to a McKinsey report, Classification techniques will have global economic impact of $2T on multiple industry segments. This number maybe hard to fathom , but, what this says is that Classification of data (images, text, voice) is being used widely across many industry segments, embedded in many ‘killer apps’ and workflows.
In traditional software (Software 1.0), classification is done by writing an algorithm to classify an image, for instance. These algorithms need to be continuously updated to catch ‘unseen’ or ‘different’ conditions. In contrast Machine Learning uses example data to produce a ‘classification algorithm’ that can be generalized to classify unseen conditions quite accurately (e.g. new images).
Classification use case: An interesting classification problem in the banking/lending industry is to assess the current risk of an active loan, based on historical data and the consumer’s profile and behavior. Given a user profile , and data about the loan terms etc., the Machine Learning model, can classify a particular customer’s loan to be safe, risky or unsafe.
A comprehensive data-set was published by LendingClub, a peer-to-peer lending company, the dataset contained about a million active loans, and each loan was described in terms of around 52 features (variables). These features include information about the loan, such as the term of the loan, interest rate, loan amount, loan purpose etc. The data also contained features about the customer such as address , annual income, last payment, months since last delinquent payment , total late fees etc.
Democratizing Classification for the bank loan analyst would be to provide the analyst software that would produce the highest accuracy machine-learning model and predict a customers loan to be safe, risky or unsafe.
Choosing the right features to include in the Machine Learning model is critical to the accuracy of the Classifier.
A good effort towards democratizing Classification would be to automate the following steps: Feature Selection, Feature Mapping and Model Creation.
Automatic Feature Selection algorithms are a key part of democratizing machine learning. For the example LendingClub data, the original number of features were 52 (of which 35 were numerical and 18 were categorical).
The final number of features selected, that produced the highest accuracy Classifier was 14. There were a number of features excluded : 37 (of which 24 were numerical and 13 were categorical).
VerifAI’s algorithms excludes redundant features automatically, to produce the highest accuracy Classifier, for the lending-club data. The loan analyst (domain expert) should not have to know which features to exclude or include to achieve a high accuracy model. Software algorithms need to do a good job doing Automatic Feature Engineering.
Automatic Feature Mapping is another key set of techniques that transforms the input user data into a form (numerical values) that the Machine Learning Algorithms can understand. Automatic Feature Mapping hides the complexity and details of data transformation required before feeding the data into a DNN (Deep Neural Network) or other machine learning models.
For instance the feature loan_term column may have values such as “60 months”, “42 months” , “36 months” etc. These strings values are automatically encoded into numerical values a machine learning model can understand, using a feature hasher, factorizer, one hot encoding or other encoding algorithms . The model’s accuracy is dependent on how the inputs are encoded and interpreted by the machine learning algorithms, thus it is important to map and encode the features accurately to input into a ML model. ML engineers spend a lot of time mapping features into usable inputs for models. Automatic Feature Mapping (AutoMapper) is an important step towards democratizing ML.
Automatic Model Selection and Creation is the next important step towards democratizing Machine Learning. Model selection is a complex process that ML engineers learn over time with experience. This process comprises of choosing the best-fit ML model for a given data-set, and then tuning the hyper-parameters for the chosen model to improve accuracy and reduce over-fitting.
For a Classification problem, there can be many models we can use: For instance: SVM (Support Vector Machines), DecisionTrees, Random Forests, DNN etc.
Classifiers accuracy can vary significantly on differently data-sets, making them highly data-dependent. To make Classifiers robust, we need model-selection algorithms to mitigate data-dependent variance.
A fully automatic feature engineering and model creation loop, is an essential step towards democratizing ML Classification problems.
Democratizing Reinforcement Learning
Reinforcement Learning (RL) is a disruptive technology with significant potential to improve many challenging engineering and business problems. In reinforcement learning a software agent interacts with an environment by taking actions and getting back rewards for these actions, with a goal of maximizing future cumulative rewards.
Sequential Decision Making: Any process where a sequence of decisions need to be made to maximize (or minimize) a future reward can be modeled as a Reinforcement Learning problem. For example , an RL agent(s) could make a sequence of decisions that could (a) maximizing the click-thru-rate for a customer on your advertising platform, (b) reduce the cost of your home heating and cooling bill over time, (c) minimize attacks on your enterprise network.
Reinforcement Learning is based on a Markov Decision Process (MDP) model, where the action taken by the agent in an environment , impacts the next state in the environment and the reward generated by the environment. The outcomes are partly random and partly controlled by the agent taking the actions.
The multi-armed bandit : RL uses the notion of exploration and exploitation to learn an optimal policy that could maximize future reward. A good example of the tradeoff RL makes at each iteration can be seen by studying the multi-armed-bandit problem , wherein each arm of the bandit (each arm of the bandit pulls a slot-machine lever) provides rewards according to some unknown probability distribution. In that case, exploration is trying many bandit arms, while exploitation is operating a particular bandit arm.
By trading off exploration versus exploitation an RL agent is able to sample a very large state space and make decisions that can maximize future reward better than any ‘directed random’ technique.
To democratize reinforcement learning, we need to enable users to express their intent in a manner that can be mapped into an RL problem. There are many current efforts in industry and research towards democratizing RL. Some of these efforts include Facebook’s Horizon, AWS SageMaker-RL, OpenAI’s SpinningUp, Google’s Dopamine and others.
Capturing user Intent: The key issue here is one of capturing the user’s intent and mapping it to an RL algorithm and a DNN architecture. While there has been a lot of effort on optimizing RL algorithms , there needs to be a significant effort to capture user intent, and elevate the abstraction level of the RL-Environment to solve many real world engineering, scientific and business problems.
Checks and Balances — Verifying Machine Learning Models
All good democracies have checks and balances that protect the long term sustainability of a nation. Software development based on training neural networks has been termed as Software 2.0, perhaps appropriately so. To sustain this new paradigm, we need to be able to verify Machine Learning models and maintain checks and balances. A large number of decisions are being made by Machine Learning models, and this trend is accelerating into every aspect of our lives.
Courts , Law enforcement, DMV , Insurance Companies, Banks , Hospitals, Schools and many more institutions are using or will be using Machine Learning to make predictions about our daily lives.
Black Box Models: An incorrect prediction may affect individuals in an unintended and undesirable manner. If we project out into to the future with a hypothesis that more decisions will be made by machine learning models than humans, verifying machine learning models is an area that needs attention and innovation.
Explainable and Interpretable Machine Learning are active fields of research, where the algorithms try to explain the decisions made by black-box models such as a DNN or an SVM Classifier. These algorithms also compute the probability of the accuracy of the predictions, and weight them by individual features. Interpretable ML techniques provide insight into the ML models and the bias in the input training data. Some of these algorithms can help us explain and correct mis-predictions and introspect a Machine Learning model. Interpretable and verifiable ML algorithms are a key part of checks and balances required to democratize Machine Learning.
With the advent of new ML algorithms such as generative adversarial networks (GANS) , it is possible to skew the inputs just enough to get an incorrect prediction from a ML model. There will perhaps be many more advances in Machine Learning like GANS that will make verification of ML models a significant challenge. On the flip side, technologies such as RL and GANS can be used to verify complex systems, in a manner that is impossible by traditional methods.
Next Generation Verification: To democratize Machine Learning we need a verification methodology to verify Models and Algorithms produced by Software 2.0 and Software 1.0.
Impact of Democratizing Machine Learning
While it maybe non trivial to measure the impact of Machine Learning, many industry luminaries and academics have compared AI/ML to ‘electricity’ that will enable the 4th industrial revolution. Even if we were to discount the hyperboles describing AI and Machine Learning, it is still technology that will impact a majority of industries and many aspects of society. For instance, we may not have autonomous self driving cars (level 5) in the near horizon, contrary to the hype, but, we will still see human assisted self-driving cars , autonomous package delivery vehicles and fully autonomous golf carts in the near future.
The economic impact of Machine Learning will depend on how accessible , verifiable , interpretable and useable we make make these technologies for people to build upon. There are significant implications of how Machine Learning will impact many current jobs. Current tasks that require repetitive computer action, will be automated. Democratization of ML will lead to an understanding and acceptance of such technologies into our daily lives. This in turn may help alleviate the concerns of job displacement and will lead to new jobs that don’t exist today.
An investor recently told me that Machine Learning is already democratized since his 11yr old son (6th grader) is building ML apps using Python and TensorFlow (this is very encouraging!). While his observation is accurate in multiple dimensions, I argue that we are at the very beginning stages of democratizing Machine Learning.
“mass production has not democratized fashion — yet!”
Democratizing Machine Learning to us means empowering people from all disciplines to solve problems that are currently deemed unsolvable.