Advanced

SHAP & LIME

Most production AI models are black boxes: a set of inputs goes in, a prediction comes out, and the intermediate reasoning is not directly observable. Post-hoc explainability methods — applied after the model is trained — attempt to explain individual predictions without requiring access to or modification of the model architecture. SHAP and LIME are the two most widely used post-hoc methods and represent different philosophical approaches to the problem.

LIME — Local Interpretable Model-Agnostic Explanations

LIME (Ribeiro et al., 2016) generates an explanation for a single prediction by fitting a simple, interpretable model (typically linear regression) to locally sampled perturbations of the input. The intuition: even if the model is globally non-linear and complex, it can often be approximated by a linear function in the small neighbourhood around a specific input.

LIME algorithm

Take the input instance x to be explained
Generate a set of perturbed samples z₁, z₂, ... zₙ in the neighbourhood of x (e.g., randomly mask features or tokens)
Query the black-box model f for predictions on each perturbed sample: f(z₁), f(z₂), ..., f(zₙ)
Weight each sample by its proximity to x (closer samples get higher weight)
Fit a sparse linear model g to the weighted samples: g is the explanation
The coefficients of g represent feature importances for this specific prediction

LIME strengths

Model-agnostic — works on any classifier or regressor
Produces human-readable feature attribution (e.g., highlighted text spans, image superpixels)
Fast for a single instance explanation
Easy to implement and widely supported in Python (lime library)

LIME limitations

Instability: the random perturbation sampling can produce different explanations on repeated runs
Local fidelity does not guarantee global consistency
Neighbourhood definition (what counts as "close"?) is arbitrary and affects results
Does not satisfy mathematical consistency guarantees

SHAP — SHapley Additive exPlanations

SHAP (Lundberg & Lee, 2017) is grounded in cooperative game theory. It assigns each feature a contribution value (Shapley value) that represents the average marginal contribution of that feature across all possible subsets of features. SHAP values satisfy a set of desirable mathematical properties that LIME does not guarantee.

SHAP axioms

Efficiency: Feature contributions sum to the difference between the model output and the expected model output (f(x) − E[f(X)]). The full prediction is accounted for.
Symmetry: Two features that contribute equally to the model receive equal SHAP values, regardless of order
Dummy: A feature that has no influence on any prediction receives a SHAP value of 0
Linearity / Additivity: For a model that is a sum of two sub-models, the SHAP values are the sum of the SHAP values for each sub-model

Because computing exact Shapley values requires summing over all 2ⁿ feature subsets (exponential in the number of features), SHAP provides efficient approximations for specific model types:

SHAP variant	Model type	Speed
TreeSHAP	Tree ensembles (XGBoost, LightGBM, Random Forest)	O(TLD²) — very fast
DeepSHAP	Deep neural networks	Fast — uses backpropagation
LinearSHAP	Linear models (logistic regression)	Exact, very fast
KernelSHAP	Any model (model-agnostic)	Slow — uses LIME-style sampling

SHAP vs LIME: Practical Comparison

Dimension	SHAP	LIME
Mathematical guarantees	Yes — efficiency, symmetry, dummy, additivity	No formal guarantees
Consistency	Deterministic for tree/linear models	Stochastic — can vary between runs
Model support	Fast for trees/linear; slow (KernelSHAP) for arbitrary models	Fully model-agnostic; similar speed everywhere
Global summaries	Yes — beeswarm, bar, and waterfall plots across full dataset	Primarily local; no native global view
Interaction effects	SHAP interaction values available for trees	Not supported
Recommended for	Production explainability; regulatory audits; model debugging	Quick local explanations; NLP/text classification; prototyping

Explainability for Large Language Models

Applying SHAP and LIME to LLMs is significantly more complex than for tabular or image models. The "features" in text are tokens, but token-level attribution is often not meaningful to non-technical audiences. Key approaches:

Attention visualisation: Plots which input tokens the model attended to. Easy to implement but critiqued as not reliably representing causal importance — high attention does not guarantee causal contribution.
SHAP for text classifiers: Works well for classification heads on top of LLMs (e.g., sentiment, intent). Token masking + prediction comparison produces token attributions.
Gradient-based attribution: Integrated Gradients (Sundararajan et al.) computes attribution via the gradient of the output with respect to the input embeddings — theoretically grounded, faster than sampling-based methods.
Logit lens / probing: Analyses intermediate layer representations to understand what information is encoded at each layer of the transformer — more of a model debugging tool than a user-facing explanation.

Visualising Explanations for Non-Technical Audiences

SHAP visualisations (from shap library)

Waterfall plot: Shows how each feature pushes the prediction up or down from the base value — best for a single instance explanation
Beeswarm plot: Shows distribution of SHAP values for all features across the full dataset — best for global feature importance
Dependence plot: Shows how SHAP value for one feature varies with feature value — surfaces non-linear relationships

Communication to stakeholders

For regulators: "The three most influential factors in this credit decision were..."
For affected individuals: highlight which inputs led to a negative outcome and whether any were within their control
For developers: waterfall + dependence plots to debug unexpected feature usage
For executives: bar chart of global feature importances, not per-instance detail

Checklist: Do You Understand This?

What is the core idea behind LIME — what kind of model does it fit and to what data?
Why does SHAP use game-theoretic Shapley values rather than simple gradient-based attribution?
Name the four SHAP axioms and explain what the efficiency axiom means in practical terms.
When would you choose LIME over SHAP, and when would you choose SHAP over LIME?
What is the main criticism of attention visualisation as an explanation method for LLMs?
Which SHAP visualisation would you use to explain a single prediction to a regulator?