Enhancing AI Interpretability: A Guide to Self-Explaining Concept Architectures

8 Apr 2025

Authors:

(1) Sanchit Sinha, University of Virginia (sanchit@virginia.edu);

(2) Guangzhi Xiong, University of Virginia (hhu4zu@virginia.edu);

(3) Aidong Zhang, University of Virginia (aidong@virginia.edu).

Table of Links

Abstract and 1 Introduction

3.2 Self-supervised Contrastive Concept Learning

3.3 Prototype-based Concept Grounding

3.4 End-to-end Composite Training

4 Experiments and 4.1 Datasets and Networks

4.2 Hyperparameter Settings

4.3 Evaluation Metrics and 4.4 Generalization Results

4.5 Concept Fidelity and 4.6 Qualitative Visualization

5 Conclusion and References

Appendix

Abstract

With the wide proliferation of Deep Neural Networks in high-stake applications, there is a growing demand for explainability behind their decision-making process. Concept learning models attempt to learn high-level ‘concepts’ - abstract entities that align with human understanding, and thus provide interpretability to DNN architectures. However, in this paper, we demonstrate that present SOTA concept learning approaches suffer from two major problems - lack of concept fidelity wherein the models fail to learn consistent concepts among similar classes and limited concept interoperability wherein the models fail to generalize learned concepts to new domains for the same task. Keeping these in mind, we propose a novel self-explaining architecture for concept learning across domains which - i) incorporates a new concept saliency network for representative concept selection, ii) utilizes contrastive learning to capture representative domain invariant concepts, and iii) uses a novel prototype-based concept grounding regularization to improve concept alignment across domains. We demonstrate the efficacy of our proposed approach over current SOTA concept learning approaches on four widely used real-world datasets. Empirical results show that our method improves both concept fidelity measured through concept overlap and concept interoperability measured through domain adaptation performance.

1 Introduction

Deep Neural Networks (DNNs) have revolutionized a variety of human endeavors from vision to language domains. Increasingly complex architectures provide state-of-the-art performance which, in some cases has surpassed even humanlevel performance. Even though these methods have incredible potential in saving valuable man-hours and minimizing inadvertent human mistakes, their adoption has been met with rightful skepticism and extreme circumspection in critical applications like medical diagnosis [Liu et al., 2021; Aggarwal et al., 2021], credit risk analysis [Szepannek and Lubke, 2021 ¨ ], etc.

With the recent surge in interest in Artificial General Intelligence (AGI) through DNNs, the broad discussion around the lack of rationale behind DNN predictions and their opaque decision-making process has made them notoriously black-box in nature [Rudin, 2019; Varoquaux and Cheplygina, 2022; D’Amour et al., 2020; Weller, 2019]. In extreme cases, this can lead to a lack of alignment between the designer’s intended behavior and the model’s actual performance. For example, a model designed to analyze and predict creditworthiness might look at features that should not play a role in the decision such as race or gender [Bracke et al., 2019]. This, in turn, reduces the trustworthiness and reliability of model predictions (even if they are correct) which defeats the purpose of their usage in critical applications [Hutchinson and Mitchell, 2019; Raji et al., 2020].

In an ideal world, DNNs would be inherently explainable by their inductive biases, as it is designed to keep stakeholders in account. However, such an expectation is gradually relaxed with the increasing complexity of the data which in itself drives up the complexity of the architectures of DNNs to fit said data. Several approaches to interpreting DNNs have been proposed. Some approaches assign relative importance scores to features deemed important like LIME [Ribeiro et al., 2016], Integrated Gradients [Sundararajan et al., 2017], etc. Other approaches rank training samples by their importance to prediction like influence functions [Koh and Liang, 2017], data shapley [Ghorbani and Zou, 2019], etc.

However, the aforementioned methods only provide a posthoc solution and to truly provide interpretability, a more accessible approach is required. Recently, there have been multiple concept-based models that incorporate concepts during model training [Kim et al., 2018; Zhou et al., 2018]. It is believed that explaining model predictions using abstract human-understandable “concepts” better aligns the model’s internal working with the human thought process. Concepts can be thought of as abstract entities - shared across multiple samples providing a general model understanding. The general approach to train such models is to first map inputs to a concept space. Subsequently, alignment with the concepts is performed in the concept space and a separate model is learned on the concept space to perform the downstream task.

The ideal method to extract concepts from a dataset would be to manually curate and define what concepts best align with the requirements of stakeholders/end-users using extenarXiv:2405.00349v2 [cs.LG] 5 May 2024 sive domain knowledge. This approach requires manual annotation of datasets and forces models to extract and encode only the pre-defined concepts as Concept Bottleneck Models [Koh et al., 2020; Zaeem and Komeili, 2021] do. However, with increasing dataset sizes, it becomes difficult to manually annotate each data sample, thus limiting the efficiency and practicality of such approaches [Yuksekgonul et al., 2022].

As a result, many approaches incorporate unsupervised concept discovery for concept-based prediction models. One such architecture is Self-Explaining Neural Networks (SENN) proposed in [Alvarez-Melis and Jaakkola, 2018]. The concepts are extracted using a bottleneck architecture, and appropriate relevance scores to weigh each concept are computed in tandem using a standard feedforward network. The concepts and relevance scores are then combined using a network to perform downstream tasks (e.g. classification). Even though such concept-based explanations provide a clear explanation to understand neural machine intelligence, concept-based approaches are not without their faults. One critical problem we observed is that concepts learned across multiple domains using concept-based models are not consistent among samples from the same class, implying low concept fidelity. In addition, concepts are unable to generalize to new domains implying a lack of concept-interoperability.

In this paper, we propose a concept-learning framework with a focus on generalizable concept learning which improves concept interoperability across domains while maintaining high concept fidelity. Firstly, we propose a salient concept selection network that enforces representative concept extraction. Secondly, our framework utilizes self-supervised contrastive learning to learn domain invariant concepts for better interoperability. Lastly, we utilize prototype-based concept grounding regularization to minimize concept shifts across domains. Our novel methodology not only improves concept fidelity but also achieves superior concept interoperability, demonstrated through improved domain adaptation performance compared to SOTA self-explainable concept learning approaches. Our contributions are - (1) We analyze the current SOTA self-explainable approaches for concept interoperability and concept fidelity when trained across domains - problems that have not been studied in detail by recent works. (2) We propose a novel framework that utilizes a salient concept selection network to extract representative concepts and a self-supervised contrastive learning paradigm for enforcing domain-invariance among learned concepts. (3) We propose a prototype-based concept grounding regularizer to mitigate the problem of concept shift across domains. (4) Our evaluation methodology is the first to quantitatively evaluate the domain adaptation performance of self-explainable architectures and comprehensively compare existing SOTA self-explainable approaches.

This paper is available on arxiv under CC BY 4.0 DEED license.

Up Next →

Forget Manual Annotation: AI Can Now Learn Concepts on Its Own