-This page is part of the Responsible AI series.
Index
- Introduction
- Interpretability, Explainability, Transparency
- Applications
- Further readings
Introduction
Interpretability, Explainability, Transparency
Definition of Transparency
Transparency can be a vague term that can be difficult to measure. How might one determine if an algorithmic system is transparent enough? Would dumping large files of data, source code, and documentation onto the public make a system transparent? Maybe, but this type of transparency is not meaningful to the general public as the documentation may not be directly relevant or comprehensible. Data dumps can also lead to information overload as one might not know where to start to understand how they are impacted by the system. We should argue instead for the promotion of meaningful transparency.
Meaningful transparency is driven by stakeholder information needs. It means delivering the information pertinent to each stakeholder group in the format that is best suited for their understanding.
From a societal viewpoint, there are three levels of meaningful algorithmic transparency:
- Level 0: Visibility Baseline. This might include basic information on the existence of the system, its scope and owner.
- Level 1: Process Visibility. This includes disclosures about the system’s design and the processes that govern it. This information is helpful to assess the system’s implementation of responsible use safeguards.
- Level 2: Outcome Visibility. This includes disclosures related to the outcomes that the system produces. This information should be assessed to understand the system’s compliance with Responsible Use principles: fairness, explainability, security, safety, robustness, and privacy.

Sources
Definition of Interpretability
There is no mathematical definition of interpretability. A (non-mathematical) definition written by Miller (2017) is: "Interpretability is the degree to which a human can understand the cause of a decision". Another one is: "Interpretability is the degree to which a human can consistently predict the model’s result". The higher the interpretability of a machine learning model, the easier it is for someone to comprehend why certain decisions or predictions have been made. A model is better interpretable than another model if its decisions are easier for a human to comprehend than decisions from the other model.
Why interpretability is important
- Interpretability in machine learning is useful because it can aid in trust. As humans, we may be reluctant to rely on machine learning models for certain critical tasks, e.g., medical diagnosis, unless we know "how they work."
- Safety. There is almost always some shift in distributions between model training and deployment. Failures to generalize or phenomena like Goodhart's Law, such as specification gaming, are open problems that could lead to issues in the near future. Approaches to interpretability, which explain the model's representations or which features are most relevant, could help diagnose these issues earlier and provide more opportunities to remedy the situation.
- Third, and perhaps most interestingly, contestability. As we delegate more decision-making to ML models, it becomes important for the people to appeal these decisions. Black-box models provide no such recourse because they don't decompose the decision into anything contestable.
When we don't need interpretability
- Interpretability is not required if the model has no significant impact, socially or financially.
- Interpretability is not required when the problem is well studied (e.g. optical character recognition).
- Interpretability might enable people or programs to manipulate the system. Problems with users who deceive a system result from a mismatch between the goals of the creator and the user of a model. Credit scoring is such a system because banks want to ensure that loans are only given to applicants who are likely to return them, and applicants aim to get the loan even if the bank does not want to give them one. This mismatch between the goals introduces incentives for applicants to game the system to increase their chances of getting a loan. If an applicant knows that having more than two credit cards negatively affects his score, he simply returns his third credit card to improve his score, and organizes a new card after the loan has been approved. While his score improved, the actual probability of repaying the loan remained unchanged. The system can only be gamed if the inputs are proxies for a causal feature, but do not actually cause the outcome. Whenever possible, proxy features should be avoided as they make models gameable.
How interpretability is achieved
Interpretability vs Explainability
A model is interpretable if it is capable of being understood by humans on its own. One can look at the model parameters or a model summary and understand exactly how a prediction was made. Another term for these types of models is an intrinsically interpretable model. An example of interpretable model is a decision tree: to understand how a prediction was made, one simply has to traverse down the nodes of the tree. Intepretable models are also referred to as White Box.
The explainable models on the other hand are black-box, functions with inputs and outputs too complicated for humans to understand, that require additional method/technique to be able to understand how the model works.
A black box model is a system that does not reveal its internal mechanisms. In machine learning, “black box” describes models that cannot be understood by looking at their parameters (e.g. a neural network).
In general, this classification based on human comprehension has no formal way to easily discriminate, and the goal of interpretation can be achieved independently from this classification, based only on the type of model and the questions we seek to answer.
Please note that Model-agnostic methods for interpretability treat machine learning models as black boxes, even if they are not.
Transparency as interpretability
Lipton et al. define "transparency as interpretability" as the model's properties that are useful to understand and can be known before the training begins. In this case, model are instrinsicly interpretable.
- Simulatibility: Can a human walk through the model's steps? This property addresses whether or not a human could go through each step of the algorithm and check if each step is reasonable to them. Linear models and decision trees are often cited as interpretable models using such justifications.
- Decomposability: Is the model interpretable at every step or with regards to its sub-components? This property addresses whether or not a human could understand what the model is doing at each step. For example, in a decision tree whose nodes correspond to easily identifiable factors like age or height, the model's prediction can be interpreted in terms of what decisions are taken at different nodes of the tree. In general, such a detailed analysis (of decisions taken by the model per-timestep) is difficult because the model's performance is very tightly coupled with the representations used (i.e. raw features).
- Algorithmic Transparency: Does the algorithm confer any guarantees? This question asks if our learning algorithm has any desirable properties which are easy to understand. For example, we might know that the algorithm only outputs sparse models, or perhaps it always converges to a specific type of solution. In these cases, the resulting learned model can be more amenable to analysis. On the contrary, deep learning do not allow a direct analogy to the notion of some unique set of weights that perform well on the task at hand.
Post-hoc interpretability
Lipton et al. define "post-hoc interpretability" as the things we can learn from the model after training has finished.
- Text Explanation: Can the model explain its decision in natural language, after the fact? Like how humans can provide post-hoc justifications for their actions, it could be informative to have models that can also explain, probably as natural language statements.
- Visualization/Local Explanations: Can the model identify what is important to its decision-making? This question focuses on how the inputs and outputs change relate to one another.
Saliency maps are a broad class of approaches that look at how a change in the input (or parts of the input) changes the output. A straightforward way to do this is to take the derivative of the loss function with respect to the input. Beyond this simplistic approach, many modifications involve averaging the gradient, perturbing the input, or local approximations.
- Explanation by Example: Can the model show what else in the training data it thinks are related to this input/output? This question asks for what other training examples are similar to the current input. This is probably less important for understanding what the model is doing.
Deliverables of interpretation methods
- Feature summary statistic: Many interpretation methods provide summary statistics for each feature. Some methods return a single number per feature, such as feature importance, or a more complex result, such as the pairwise feature interaction strengths, which consist of a number for each feature pair.
- Feature summary visualization: Most of the feature summary statistics can also be visualized. Some feature summaries are actually only meaningful if they are visualized and a table would be a wrong choice. The partial dependence of a feature is such a case. Partial dependence plots are curves that show a feature and the average predicted outcome. The best way to present partial dependences is to actually draw the curve instead of printing the coordinates.
- Model internals (e.g. learned weights): The interpretation of intrinsically interpretable models falls into this category. Examples are the weights in linear models or the learned tree structure (the features and thresholds used for the splits) of decision trees. The lines are blurred between model internals and feature summary statistic in, for example, linear models, because the weights are both model internals and summary statistics for the features at the same time. Another method that outputs model internals is the visualization of feature detectors learned in convolutional neural networks. Interpretability methods that output model internals are by definition model-specific (see next criterion).
- Data point: This category includes all methods that return data points (already existent or newly created) to make a model interpretable. One method is called counterfactual explanations. To explain the prediction of a data instance, the method finds a similar data point by changing some of the features for which the predicted outcome changes in a relevant way (e.g. a flip in the predicted class). Another example is the identification of prototypes of predicted classes. To be useful, interpretation methods that output new data points require that the data points themselves can be interpreted. This works well for images and texts, but is less useful for tabular data with hundreds of features.
- Intrinsically interpretable model: One solution to interpreting black box models is to approximate them (either globally or locally) with an interpretable model. The interpretable model itself is interpreted by looking at internal model parameters or feature summary statistics.
Other differences between interpretability methods
- Model-specific or model-agnostic? Model-specific interpretation tools are limited to specific model classes. The interpretation of regression weights in a linear model is a model-specific interpretation, since – by definition – the interpretation of intrinsically interpretable models is always modelspecific. Tools that only work for the interpretation of e.g. neural networks are model-specific. Model-agnostic tools can be used on any machine learning model and are applied after the model has been trained (post hoc). These agnostic methods usually work by analyzing feature input and output pairs. By definition, these methods cannot have access to model internals such as weights or structural information.
- Local or global? Does the interpretation method explain an individual prediction or the entire model behavior? Or is the scope somewhere in between? Read more about the scope criterion in the next section.
Sources
Papers
Applications
Interpretability solutions
Interpretability solutions are inserted in the specific pages of the algorithms. Here's the list:
Random Forest
Post-hoc Explainability solutions
Others
- Alibi is an Python library aimed at machine learning model inspection and interpretation. The focus of the library is to provide high-quality implementations of black-box, white-box, local and global explanation methods for classification and regression models.
- Shapash is a Python library designed to make machine learning interpretable and comprehensible for everyone. It offers various visualizations with clear and explicit labels that are easily understood by all.
Mixed solutions
InterpretML
Transparency for LLMs
Further readings
TO BE REVIEWED
In the context of machine learning and artificial intelligence, explainability and interpretability are often used interchangeably but they have some differences, even if they are very closely related:
Interpretability is about the extent to which a cause and effect can be observed within a system, i.e. the extent to which you are able to predict what is going to happen, given a change in input or algorithmic parameters. It’s being able to look at an algorithm and go yep, I can see what’s happening here.
Explainability, meanwhile, is the extent to which the internal mechanics of a machine or deep learning system can be explained in human terms. It’s easy to miss the subtle difference with interpretability, but consider it like this: interpretability is about being able to discern the mechanics without necessarily knowing why. Explainability is being able to quite literally explain what is happening.
Sources
Explainability frameworks for each algorithm
Explainability for logistic regression
Explainability for Random Forest
Built-in function in
Sources
Explainability for XGBoost
XGBoost has built-in explainability methods, based basically on weight (i.e. the frequency of a feature in splitting the trees), cover (i.e. the frequency of a feature in splitting the data across all trees weighted by the number of training data points that go through those splits) or gain (i.e. the average training loss reduction gained when using a feature for splitting).
Unfortunately, these methods are not always consistent within each other, making difficult to compare feature importance from different models.
Therefore, it is advisable for XGBoost to use SHAP value, since it doesn't present these inconsistency issues.
Sources
Explainability for imaging recognition
- one for gradcam: model-aware techniques, extracting information from the model, e.g. activation of the inner layers, nice heatmaps created
- perturbation method (e.g. LIME) works on data, introducing small perturbation and see how it reflects on the output
Explainability for Graph Neural Networks
Sources
Explainability frameworks
SHAP (SHapley Additive exPlanations)
SHAP values are based on Shapley values, a concept coming from game theory: what Shapley does is quantifying the contribution that each player brings to the game, what SHAP does is quantifying the contribution that each feature brings to the prediction made by the model. A “game” concerns a single observation. Indeed, SHAP is about local interpretability of a predictive model.
SHAP values are based on the idea that the outcome of each possible combination of feature (= coalition of players) should be considered to determine the importance of a single feature (= player).

In math, this is called a “power set” and can be represented as a tree.
Each node represents a coalition of features. Each edge represents the inclusion of a feature not present in the previous coalition.
The cardinality of a power set is 2^n, where n is the number of elements of the original set. Now, SHAP requires to train a distinct predictive model for each distinct coalition in the power set, meaning 2 ^ F models. These models are completely equivalent to each other for what concerns their hyperparameters and their training data, the only thing that changes is the set of features included in the model.
As seen above, two nodes connected by an edge differ for just one feature, in the sense that the bottom one has exactly the same features of the upper one plus an additional feature that the upper one did not have. Therefore, the gap between the predictions of two connected nodes can be imputed to the effect of that additional feature. This is called “marginal contribution” of a feature.
Therefore, each edge represents the marginal contribution brought by a feature to a model.
Sources
LIME (Local Interpretable Model-Agnostic Explanations)
LIME is a perturbation method, i.e. it introduces small perturbations in the input and see how they are reflected in the predictions. The perturbation can make sense to humans (e.g., words or parts of an image), even if the model is using more complicated components, making easy the interpretability.
Sources