A hands-on walkthrough of Error Analysis, Model Overview, Data Analysis, Feature Importances, and Counterfactuals — applied to a real-world bank account fraud detection model.

What Is Responsible AI – And Why Does It Matter?

As machine learning models increasingly influence high-stakes decisions from credit approvals and fraud detection to hiring and medical diagnosis the question of how a model reaches its conclusions becomes just as important as whether it reaches correct ones. A model that achieves 84% accuracy on a fraud detection task may still be systematically failing a particular demographic, treating missing data as a red flag, or drawing on features that should have no business relevance.

Responsible AI (RAI) is the discipline of building, evaluating, and deploying machine learning systems that are transparent, fair, reliable, and accountable. Microsoft has embedded this philosophy directly into Azure Machine Learning through its RAI Dashboard a unified interface combining error analysis, model explainability, data exploration, and counterfactual what-if analysis into a single navigable view.

This post walks through each component of the RAI Dashboard using a real LightGBM classifier trained on the Bank Account Fraud Detection dataset. Rather than focusing on how to set up the dashboard, we focus on how to read and interpret what it shows you for both technical practitioners and business stakeholders.

Environment Compatibility Note

A common issue when using the RAI Dashboard is a warning about incompatible package versions. Ensure your environment pins raiwidgets, responsibleai, interpret-community, and erroranalysis to matching versions from the RAI requirements file. Mismatched packages can grey out components like the Error Heatmap.


The Dataset: Bank Account Fraud Detection (NeurIPS 2022)

The model in this walkthrough was trained on the Bank Account Fraud (BAF) Dataset, introduced at NeurIPS 2022. It is specifically designed to benchmark fraud detection under realistic conditions including class imbalance, temporal drift, and demographic fairness considerations. We use the Base variant of the dataset.

Each row represents a single bank account opening application. The target variable indicates whether the application was fraudulent (1 = Fraud, 0 = Not Fraud). The dataset is intentionally challenging fraud cases are a small minority of all applications, and several features exhibit distribution shifts that make generalisation difficult.

  • Binary Classification
  • Tabular Data
  • Fraud Detection
  • Class Imbalance
  • NeurIPS 2022
FeatureDescription
housing_statusApplicant’s housing arrangement (BA, BC, BE, BF, …)
current_address_months_countMonths lived at current address
prev_address_months_countMonths at previous address (-1 = missing)
has_other_cardsBinary — whether applicant holds other payment cards
device_osOperating system of the device used to apply
velocity_6hApplications submitted from same source in past 6 hours
phone_home_validWhether the provided home phone number is valid
incomeNormalised income score (0–1 scale)
credit_risk_scoreInternal credit risk score
name_email_similaritySimilarity score between applicant name and email address

Constructing the RAI Dashboard with Azure ML Components

The RAI Dashboard is built as a pipeline of registered components pulled from the Azure ML registry. Each component targets a specific analytical lens. It is critical to fetch all components at the same version mixing versions is a common source of dashboard rendering failures.

# Connect to the Azure ML registry that hosts RAI components
registry_name = "azureml"
ml_client_reg = MLClient(credential, subscription_id, resource_group,
                          registry_name=registry_name)

# Pull all analytical components at the same version
rai_ctor        = ml_client_reg.components.get(name="rai_tabular_insight_constructor", label="latest")
v               = rai_ctor.version
rai_explain     = ml_client_reg.components.get(name="rai_tabular_explanation",     version=v)
rai_counterfact = ml_client_reg.components.get(name="rai_tabular_counterfactual",  version=v)
rai_error       = ml_client_reg.components.get(name="rai_tabular_erroranalysis",   version=v)
rai_gather      = ml_client_reg.components.get(name="rai_tabular_insight_gather",  version=v)
rai_scorecard   = ml_client_reg.components.get(name="rai_tabular_score_card",      version=v)
print(f"RAI components version: {{v}}")

The rai_gather component is the final step — it assembles all individual insights into the unified dashboard. Pipeline flow:

rai_ctorrai_error + rai_explain + rai_counterfact rai_gather RAI Dashboard

Error Analysis — Where Does the Model Struggle?

The Error Analysis component answers a fundamental question: is model error randomly distributed, or concentrated in specific subgroups? A model with a 16% overall error rate could be failing uniformly, or catastrophically wrong for one particular cohort. The Error Tree reveals which scenario you are in.

Figure 1: Error Analysis — Decision tree showing error concentration across feature-based cohorts
Figure 1: Error Analysis — Decision tree showing error concentration across feature-based cohorts
Reading the Error Tree

Each node shows errors / total. The root node confirms the global picture — 780 out of 4,900 test instances were misclassified (~15.9% error rate). The first split on housing_status == BA immediately isolates a problematic subgroup. A second split on has_other_cards <= 0.00, then a third on current_address_months_count > 35.50, leads to the darkest, most error-saturated node in the tree.

🚨 Critical Finding

The worst-performing cohort — applicants with BA housing status, no other cards, and current address tenure exceeding 35 months — has an error rate of 340 out of 525 (64.8%). The model is getting nearly 2 out of 3 predictions wrong for this group. The right branch (non-BA applicants, 4,054 total) has an error rate of just 8.8%.

For a fraud model, misclassification is not symmetric. A false negative (missing actual fraud) carries direct financial loss. A false positive (flagging a legitimate customer) creates friction and operational cost. The error tree tells risk teams exactly which customer profiles should trigger additional human review, and tells data scientists exactly where to focus retraining.


Model Overview — Performance at a Glance

The Model Overview tab presents aggregate performance metrics, a confusion matrix, metric visualisations, and a probability distribution plot — a 360-degree view of how the model behaves across the full test set.

Aggregate Metrics
Figure 2: Aggregate performance metrics across the full test set
Figure 2: Aggregate performance metrics across the full test set

At first glance, 84.1% accuracy appears reasonable. However, for a fraud detection model, accuracy alone is misleading due to class imbalance. The metrics that matter more are the False Negative Rate (18.5%) — representing fraud cases the model missed — and the False Positive Rate (15.9%) — representing legitimate customers incorrectly flagged.

Confusion Matrix
Figure 3: Confusion Matrix — true vs. predicted labels for Fraud and Not Fraud
Figure 3: Confusion Matrix — true vs. predicted labels for Fraud and Not Fraud

Of the 54 actual fraud cases, the model correctly identified 44 but missed 10 (false negatives = undetected fraud). Of approximately 4,846 legitimate applications, 770 were incorrectly flagged as fraudulent — a significant false alarm rate that would create substantial operational burden if every flagged case required manual review.

Recall Score
Figure 4: Recall Score — Model achieves 81.5% recall on the Fraud class
Figure 4: Recall Score — Model achieves 81.5% recall on the Fraud class

The recall bar confirms a fraud class recall of 81.5% — the model catches just over 4 in 5 fraud cases. The remaining 18.5% miss rate is a tangible risk in a banking context. Depending on business risk appetite, the decision threshold may need to be lowered, or targeted retraining on the cohorts identified in Error Analysis may improve recall without inflating false positives.

Probability Distribution
Figure 5: Probability distribution — score separation between Fraud (blue) and Not Fraud (orange)
Figure 5: Probability distribution — score separation between Fraud (blue) and Not Fraud (orange)

The orange distribution (Not Fraud) is tightly clustered between 0 and 0.2, while the blue distribution (Fraud) is concentrated between 0.65 and 1.0. The wide gap between 0.2 and 0.65 indicates a clean decision boundary — the model is making confident, not marginal, predictions. This separation is a hallmark of a well-calibrated model.


Data Analysis — Understanding Your Data Distribution

The Data Analysis tab allows you to explore feature distributions across cohorts and inspect individual predictions. It is particularly useful for surfacing data quality issues, class imbalance, and patterns in misclassified cases.

Income Distribution
Figure 6: Income distribution across index-based data segments
Figure 6: Income distribution across index-based data segments

The income feature shows a remarkably consistent distribution across all five data segments. The median sits around 0.8 with an IQR of approximately 0.4–0.8. This uniformity is a positive signal — no significant temporal drift, and the model is not learning from a skewed income representation in any segment.

Velocity_6h Distribution
Figure 7: velocity_6h distribution showing consistent spread with outliers
Figure 7: velocity_6h distribution showing consistent spread with outliers

The velocity_6h feature captures applications submitted from the same source in the preceding 6 hours. The median sits around 5,000–7,500 with outliers reaching 15,000+. These extreme outliers are a classic fraud signal — representing coordinated or automated application bursts.

Incorrect Predictions — Table View
Figure 8: Table view of incorrect predictions with TrueY and PredictedY exposed
Figure 8: Table view of incorrect predictions with TrueY and PredictedY exposed

The table view filters to incorrect predictions (780 cases). Every visible row shows TrueY = Not Fraud, PredictedY = Fraud — these are false positives. More importantly, every visible row has prev_address_months_count = -1, encoding missing or unavailable previous address history.

💡 Data Quality Insight

The systematic presence of prev_address_months_count = -1 across false positives strongly suggests the model has learned to associate missing address history with fraud risk. While statistically valid, this could disproportionately penalise young applicants, recent immigrants, or people who have recently relocated — raising fairness concerns.


Feature Importances — What Is Driving Predictions?

The Feature Importances tab provides both global (aggregate) and local (individual) explanations powered by SHAP values computed by the interpret-community library.

Global Feature Importance
Figure 9: Aggregate Feature Importance — Top 4 model-wide influential features for the Fraud class
Figure 9: Aggregate Feature Importance — Top 4 model-wide influential features for the Fraud class

Globally, the model relies on four features: device_os (~0.43), housing_status (~0.40), phone_home_valid (~0.37), and prev_address_months_count (~0.36). The tight clustering of scores indicates the model is not over-reliant on any single predictor. The prominence of device_os is notable: in fraud detection, the OS is a proxy for whether an automated or spoofed device was used.

Individual Feature Importance — Absolute Values
Figure 10: Local Feature Importance (absolute) — 5 selected false positive instances
Figure 10: Local Feature Importance (absolute) — 5 selected false positive instances
Figure 11: Local Feature Importance — housing_status dominates for selected instances
Figure 11: Local Feature Importance — housing_status dominates for selected instances

Selecting five misclassified instances reveals a striking pattern: housing_status is the single most important feature for nearly all selected rows, with importance scores around 1.0–1.2. This is directly consistent with the Error Tree — BA housing status is the first and most powerful split, and local explanations confirm the model disproportionately penalises this housing category at the individual level.

Signed Feature Importance — Class: Fraud
Figure 12: Signed Local Feature Importance — negative values push toward Not Fraud, positive toward Fraud
Figure 12: Signed Local Feature Importance — negative values push toward Not Fraud, positive toward Fraud

The signed importance view is the most interpretively rich chart. Negative SHAP values push toward Not Fraud; positive values push toward Fraud. The tooltip reveals that for Rows 43, 50, 59, and 123, housing_status SHAP values are strongly negative (around –1.0 to –1.17) — meaning housing_status is actually pulling these predictions away from fraud — yet the model still predicted Fraud.

⚠ Interpretive Tension

For most selected false positives, housing_status pushes the model away from a fraud prediction in signed terms, yet these cases are still classified as Fraud. Other features not shown in the top-4 view are overriding the housing status signal. Expanding the slider to show more features will reveal which combination is tipping the balance.


Counterfactuals — What Would Change the Outcome?

Counterfactual analysis answers one of the most practically useful questions in ML: What is the minimum change to this applicant’s profile that would flip the model’s prediction? This is the What-If engine of the RAI Dashboard, powered by the DiCE (Diverse Counterfactual Explanations) library.

Probability Scatter — Index 2 (Current Class: Not Fraud)
Figure 13: Counterfactual scatter — probability distribution for Index 2 (Not Fraud)
Figure 13: Counterfactual scatter — probability distribution for Index 2 (Not Fraud)

The scatter plot shows the distribution of predicted probabilities for Index 2 — a case currently classified as Not Fraud. Each horizontal band represents a different feature level (income on y-axis). The goal of counterfactual analysis is to find the smallest perturbation that moves this data point into the Fraud prediction region.

Top Features to Perturb
Figure 14: Features ranked by frequency of appearance in counterfactual examples that flip prediction
Figure 14: Features ranked by frequency of appearance in counterfactual examples that flip prediction

device_os leads at approximately 70%, followed by housing_status and proposed_credit_limit at ~65%. Notably, several features including phone_home_valid, phone_mobile_valid, has_other_cards, source, and device_fraud_count have a bar height of zero — no counterfactual path required changing these features, confirming they provide no leverage at the decision boundary.

Counterfactual Examples Table
Figure 15: 10 counterfactual profiles that would flip Row 2 from Not Fraud to Fraud
Figure 15: 10 counterfactual profiles that would flip Row 2 from Not Fraud to Fraud

The reference datapoint (Row 2): device_os = linux, housing_status = BC, proposed_credit_limit = 200, name_email_similarity = 0.664, credit_risk_score = 113, velocity_6h = 9,406. All 10 counterfactual examples flip this to Fraud through consistent patterns: changing device_os to windows or other, shifting housing_status to BA, BE, or BF, dramatically increasing proposed_credit_limit to 848–1,897, lowering name_email_similarity (name/email mismatch), and reducing velocity_6h to 2,000–5,800.

FeatureReference (Row 2)Common Counterfactual Pattern
device_oslinuxwindows, other, x11
housing_statusBCBA, BE, BF
proposed_credit_limit200848–1,897 (much higher)
name_email_similarity0.6640.088–0.391 (lower)
credit_risk_score113100–348
velocity_6h9,4062,056–5,812 (lower)
✅ Business Interpretation

Counterfactuals give risk teams an actionable language for fraud signals. A legitimate applicant on a linux device applying for a modest credit limit with a matching name and email is low risk. The same applicant on an unrecognised device, applying for a much higher limit, with mismatched contact details and a BA housing status, crosses into the model’s fraud territory. These are concrete, auditable thresholds — not a black box.


Beyond This Dashboard: RAI Components Not Covered
Fairness Assessment

The rai_tabular_fairness component evaluates model performance across sensitive demographic groups — measuring demographic parity, equalised odds, and disparate impact. Given that prev_address_months_count = -1 is prevalent in false positives, a Fairness Assessment would quantify whether this disproportionately affects specific cohorts such as recent immigrants or young applicants.

Causal Analysis

The rai_tabular_causal component estimates the causal effect of feature changes on outcomes. While Feature Importances tells you which features are associated with fraud predictions, Causal Analysis answers whether changing a feature would actually cause a different outcome — a distinction critical for policy decisions.

Score Card

The rai_tabular_score_card generates a structured PDF report summarising model performance, feature importances, and fairness metrics suitable for stakeholder communication and regulatory documentation. In this walkthrough it did not render due to the environment compatibility issue.


Paving the Way: RAI Impact Assessment & Microsoft’s RAI Standard
Microsoft’s Responsible AI Impact Assessment

Before deploying any AI system, Microsoft’s RAI Impact Assessment guide recommends a structured evaluation of potential harms and benefits — identifying affected stakeholders, mapping failure modes, and assessing severity for vulnerable populations. The insights from the RAI Dashboard feed directly into this assessment. The 64.8% error rate for the BA cohort is a concrete, quantified starting point for that conversation.

Microsoft’s Responsible AI Standard

Microsoft’s Responsible AI Standard defines six principles — Fairness, Reliability & Safety, Privacy & Security, Inclusiveness, Transparency, and Accountability — and maps them to concrete engineering requirements. The RAI Dashboard directly operationalises several: Error Analysis supports Reliability, Explainability supports Transparency, Fairness Assessment supports Fairness, and Counterfactuals support Accountability.

📖 Coming Next

A future post in this series will walk through completing a full Responsible AI Impact Assessment for the bank account fraud detection model — using the findings from this dashboard as the evidence base. We will also explore how the RAI Standard’s requirements map to specific dashboard outputs and what remediation looks like in practice.

Responsible AI is not a checkbox — it is an ongoing practice of interrogating your model’s behaviour from multiple angles. For the bank account fraud detection model evaluated here, the dashboard has surfaced four concrete, actionable findings: a critical failure cohort (BA, no cards, long address tenure); a false alarm pattern linked to missing address history; housing status and device OS as dominant prediction drivers; and a clear counterfactual signature for what tips a profile from safe to suspicious. These are not abstract metrics — they are the starting point for a more reliable, more transparent, and more equitable model.

Leave a Reply

Your email address will not be published. Required fields are marked *