Skip to content

Introduction, selected papers and possible corresponding codes in our review paper "A Survey of Neural Trees"

License

Notifications You must be signed in to change notification settings

zju-vipa/awesome-neural-trees

Repository files navigation

Awesome Neural Tree Papers :art:

Selected papers and possible corresponding codes in our review paper "A Survey of Neural Trees" [arXiv Version]

If you find there is a missed paper or a possible mistake in our survey, please feel free to email me or pull a request here. I am more than glad to receive your advice. Thanks!

Introduction

Neural trees (NTs) refer to a school of methods that combine neural networks (NNs) and decision trees (DTs), for which we present a comprehensive review in this survey. Our keynote is to identify how these approaches enhance the model interpretability and suggest possible solutions to the remaining challenges. Besides, we provide a discussion about other considerations like conditional computation and promising directions towards this field, in hope to advance the practice of NTs.

News!

New work of NTs and possible changes in this survey will be notified here.

  • Add "ProGReST: Prototypical Graph Regression Soft Trees for Molecular Property Prediction" [Paper], an interesting work that combines prototype learning, soft decision trees and Graph Neural Networks for compound property prediction.
  • Add "Large-Scale Category Structure Aware Image Categorization" [Paper] & "Leveraging Class Hierarchies with Metric-Guided Prototype Learning" [Paper], two methods incorporating the class hierarchy into the regulariser of the general objective function in section 3. Although a regulariser is certainly feasible in utilize the class hierarchy, it is curiously rare in practice and these two methods are the only two examples we know of.
  • Add "Distillation Decision Tree" [Paper], which interprets the black-box model by distilling its knowledge into decision trees, belonging to section 2.2.

Citation

If you find this survey useful for your research, please consider citing

@article{li2022survey,
  title={A Survey of Neural Trees},
  author={Li, Haoling and Song, Jie and Xue, Mengqi and Zhang, Haofei and Ye, Jingwen and Cheng, Lechao and Song, Mingli},
  journal={arXiv preprint arXiv:2209.03415},
  year={2022}
}

Thanks!

A Taxonomy of Current Methods

In this survey, we present a thorough taxonomy for NTs that expresses the gradual integration and co-evolution of NNs and DTs.


1. Non-hybrid: NNs and DTs Cooperated Approaches.

These approaches are the very first to combine NNs and DTs. They employ DTs as auxiliaries for NNs as well as using NNs as tools to improve the design of DTs. In these approaches, "neural" and "tree" are separated, i.e., one of NN and DT is assigned to accomplish a specific task, while the other performs as its assistant or interpreter. It can date back to the early 1990s when DTs were supposed to provide structural priors for NNs or extract rules from a trained NN. They are perceived implicit combinations of NNs and DTs, because NNs and DTs still operate on their own paradigms and no hybrid model is produced.

1.1 DTs as Structural Priors of NNs.

Adopt DTs to approximate the target concept of NNs in terms of logical descriptions, i.e., use DTs to design NNs with structural priors.

  • Entropy net: the first practice
    • "Entropy nets: from decision trees to neural networks", Proceedings of the IEEE, 1990
    • "Layered neural net design through decision trees", ISCAS, 1990
    • "Comparison between entropy net and decision tree classifiers", IJCNN, 1990

An example for the original entropy net:


  • ID3-based algorithm that converts DTs into hidden layers:

    • "A machine learning method for generation of a neural network architecture: a continuous ID3 algorithm", IEEE Transactions on Neural Networks, 1992
  • Soft thresholding of entropy net:

    • "Neural implementation of tree classifiers", IEEE Transactions on Systems, Man, and Cybernetics, 1995
  • Oblique trees version of entropy net:

    • "A mapping from linear tree classifiers to neural net classifiers", ICNN, 1994
    • "Fast Training Algorithms for Multi-layer Neural Nets", IEEE Transactions on Neural Networks, 1991
  • A knowledge acquisition system that concerns about the knowledge form DTs and NNs prefer:

    • "Implementation and refinement of decision trees using neural networks for hybrid knowledge acquisition", Artificial Intelligence in Engineering, 1995
      • Katsuhiko Tsujino, Shogo Nishida
      • [Paper]
  • Create a disjunctive normal form formula for each class and reformulate the first layer of entropy net:

    • "Initializing neural networks using decision trees", published by Rutgers University, 1990
    • "Initialization of neural networks by means of decision trees", Knowledge-Based Systems, 1995
      • Irena Ivanova, Miroslav Kubat
      • [Paper]
    • "On mapping decision trees and neural networks", Knowledge-Based Systems, 1999

1.2 DTs for Approximating NNs.

Apply DTs to interpret trained NNs by approximating the network (mostly in terms of input-output mapping) and mimic the decision boundaries implicitly learned by the hidden layers.

1.2.1 Pedagogical Techniques.

Treat NNs as black-box approaches, and extract the input-output rule directly form NNs regardless of its intermediate layers.

A description of pedagogical techniques:


  • TREPAN: the first practice of DT-based interpretation

    • "Extracting Tree-Structured Representations of Trained Networks", NeurIPS, 1995
      • Mark Craven, Jude Shavlik
      • [Paper]
    • "Using Sampling and Queries to Extract Rules from Trained Neural Networks", Machine Learning Proceedings, 1994
      • Mark Craven, Jude Shavlik
      • [Paper]
  • Remove insignificant input neurons from the trained network before inducing the DT

    • "Decision Tree Extraction using Trained Neural Network", SMARTGREENS, 2020
  • Turn the 𝑚-of-𝑛 type splits into standard splitting tests such as C4.5

    • "Decision Tree Extraction from Trained Neural Networks", AAAI, 2004
      • Darren Dancey et al. (CART tree)
      • [Paper]
    • "Converting a trained neural network to a decision tree DecText - decision tree extractor", Dissertation, 2000
      • Olcay Boz (propose discretization algorithms to handle continuous features)
      • [Paper]
    • "Extracting decision trees from trained neural networks", Pattern Recognition, 2020
    • "NeC4.5: neural ensemble based C4.5", IEEE Transactions on Knowledge and Data Engineering, 2004
      • Zhi-Hua Zhou, Yuan Jiang (use NN ensembles)
      • [Paper]
    • "Extracting decision trees from trained neural networks", Pattern Recognition, 1999
      • R.Krishnan, et al. (adapt genetic algorithms to generate artificial examples)
      • [Paper]
  • generalize the TREPAN algorithm

    • "An investigation of TREPAN utilising a continuous oracle model", Data Analysis Techniques and Strategies, 2011
      • William A Young et al. (develop DTs derived from continuous-based models)
      • [Paper]
    • "Trepan-Plus: An Extension of a Decision Tree Extraction Algorithm Utilizing Artificial Neural Networks", Intelligent Engineering Systems Through Artificial Neural Networks, 2007
      • M Rangwala et al. (extend TREPAN to multi-class regression problems)
      • [Paper]
    • "Extracting fuzzy symbolic representation from artificial neural networks", NAFIPS, 1999
      • Maciej Faifer et al. (use fuzzy representation during the tree-induction process)
      • [Paper]
    • "DT+GNN: A Fully Explainable Graph Neural Network using Decision Trees", arXiv, 2022
      • Peter Müller et al. (incorporate DTs into graph neural networks)
      • [Paper]
  • NNs perform significance analysis to select attributes

    • "ANN-DT: an algorithm for extraction of decision trees from artificial neural networks", IEEE Transactions on Neural Networks, 1999
1.2.2 Decompositional Techniques.

Decompositional techniques concern units in the hidden layers. They extract rules from the trained NN at the level of individual neurons, thus gaining insight into their inner structures.

  • Extract axis-aligned trees

    • "Rule extraction from neural networks via decision tree induction", IJCNN, 2001
      • Makoto Sato, Hiroshi Tsukimoto (CRED algorithm that deals with NNs with one hidden layer)
      • [Paper]
    • "DeepRED – Rule Extraction from Deep Neural Networks", International Conference on Discovery Science, 2016
      • Jan Ruben Zilke, Eneldo Loza Mencía, Frederik Janssen (Extend CRED algorithm by deriving intermediate rules for every additional hidden layer)
      • [Paper]
    • "Eclectic rule extraction from Neural Networks using aggregated Decision Trees", ICECE, 2012
      • MD. Ridwan Al Iqbal (Simplify rules by logical simplification)
      • [Paper]
  • Extract oblique and multivariate trees

    • "NeuroLinear: From neural networks to oblique decision rules", Neurocomputing, 1997
      • Rudy Setiono, Huan Liu (oblique trees)
      • [Paper]
    • "Towards Interpretable ANNs: An Exact Transformation to Multi-Class Multivariate Decision Trees", 2020
      • Duy T Nguyen, Kathryn E Kasmarik, Hussein A Abbass (multivariate trees)
      • [Paper]
1.2.3 Eclectic Techniques.

Typically, eclectic techniques model the relationship between the last hidden layer and the outputs, then infer the input-output relationship from the magnitudes of the weights in a NN. However, few of them are based on DTs. Therefore, this survey comprises approaches that extract partial knowledge contained in the hidden layers and do not demand their approximation target is the input-output relationship

  • Extended C-Net algorithm

    • "Towards Interpretable ANNs: An Exact Transformation to Multi-Class Multivariate Decision Trees", arXiv, 2020
      • Duy T Nguyen, Kathryn E Kasmarik, Hussein A Abbass (a typical eclectic technique)
      • [Paper]
  • Generalized eclectic techniques

    • "Global Model Interpretation Via Recursive Partitioning", HPCC/SmartCity/DSS, 2018
      • Chengliang Yang, Anand Rangarajan, Sanjay Ranka (CART tree learned from the contribution matrix and applied to scene understanding tasks)
      • [Paper] [Code]
    • "Interpreting CNNs via Decision Trees", CVPR, 2019
      • Quanshi Zhang et al. (encodes all potential decision modes of the CNN in a coarse-to-fine manner)
      • [Paper]
1.2.4 Optional: DTs for Regularizing NNs.

These approaches train NNs that resemble compact DTs through crafted regularization terms, so that they can be well-approximated by small DTs.

  • Tree regularization

    • "Beyond Sparsity: Tree Regularization of Deep Models for Interpretability", AAAI, 2018
      • Mike Wu et al. (a complexity penalty function that aims to optimize the deep model for interpretability and human-simulatability)
      • [Paper] [Code]
    • "Regional Tree Regularization for Interpretability in Deep Neural Networks", AAAI, 2020
      • Mike Wu et al. (extended work of tree regularization that encourage a deep model to be approximated by several separate DTs specific to pre-defined regions of the input space)
      • [Paper]
  • L1-orthogonal regularization

    • "Enhancing Decision Tree Based Interpretation of Deep Neural Networks through L1-Orthogonal Regularization", ICMLA, 2019
      • Nina Schaaf, Marco Huber, Johannes Maucher (so that they do not need a surrogate network to estimate the tree loss)
      • [Paper]

1.3 NNs for Designing DTs.

NNs specialized for designing more reasonable DTs and do not benefit from it, which is rare in practice.

  • Induce DTs from more valid data filtered by the trained NNs
    • "An integrated approach of neural network and decision tree to classification", ICMLC, 2005
      • Xiao-Ye Wang, Xiu-Xia Liang, Ji-Zhou Sun
      • [Paper]

2. Semi-hybrid: NNs Leveraging Class Hierarchies.

DT algorithms have two key ideas: "decision" and "tree". We view that "decision" is decision branches implemented by routing functions, and "tree" is the model topology that identifies a class hierarchy. This survey tries to include NNs that draw on a part of ideas from DTs, either the class hierarchy or the decision branches, but not both. These approaches are considered to be half and implicit integration of NNs and DTs. They borrow some inherent ideas from DTs for NNs, instead of designing a NN that works like a DT. However, we only concern about the former in our taxonomy, i.e., NNs leveraging class hierarchies. These approaches can not implement the class hierarchy in the network structure due to the absence of decision branches, so they resort to incorporating the hierarchical relations into a NN directly. There are three lines of research: hierarchical architecture, hierarchical loss function and label embedding based methods, corresponding to different strategies of employing class hierarchies.

Overview of NNs exploiting class hierarchies:


Functions and notations in the figure are adopted from Bertinetto et al.:

2.1 Hierarchical Architecture.

These methods attempt to incorporate the class hierarchy into the architecture of the classifier, so that the networks are designed to be branched and each branch is tasked to identify the concept abstraction at one specific level of the class hierarchy.

  • Hierarchical multi-label classification using local neural networks

    • "An integrated approach of neural network and decision tree to classification", ICMLC, 2014
      • Ricardo Cerri, Rodrigo C Barros, André CPLF De Carvalho
      • [Paper]
    • "Reduction strategies for hierarchical multi-label classification in protein function prediction", BMC Bioinformatics, 2016
    • "HD-CNN: Hierarchical Deep Convolutional Neural Networks for Large Scale Visual Recognition", ICCV, 2015
    • "Your "Flamingo" is My "Bird": Fine-Grained, or Not", CVPR, 2021
      • Dongliang Chang et al. (disentangle coarse and fine features)
      • [Paper] [Code]
  • Branch the network after a shared trunk

    • "Learning to Make Better Mistakes: Semantics-aware Visual Food Recognition", ACM international conference on Multimedia, 2016
      • Hui Wu et al. (a single network backbone shared by multiple fully-connected layers)
      • [Paper]
    • "Do Convolutional Neural Networks Learn Class Hierarchy?", TVCG, 2017
      • Alsallakh Bilal et al. (deep CNN with branches at intermediate layers to fit the coarser-grained labels)
      • [Paper]
    • "Fine-Grained Representation Learning and Recognition by Exploiting Hierarchical Semantic Embedding", ACM international conference on Multimedia, 2018
      • Tianshui Chen et al. (introduce an attention mechanism to incorporate the coarse-grained results for learning finer-grained features)
      • [Paper]
  • Interactions between different branches

    • "Label Hierarchy Transition: Modeling Class Hierarchies to Enhance Deep Classifiers", arXiv, 2021
      • Renzhen Wang, et al (propose label hierarchy transition matrices whose column vectors represent the conditional label distributions of classes between two adjacent hierarchies)
      • [Paper]
    • "Label Relation Graphs Enhanced Hierarchical Residual Network for Hierarchical Multi-Granularity Classification", CVPR, 2022
      • Jingzhou Chen et al. (propose the hierarchical residual network in which granularity-specific features from parent levels are added to features in children levels.)
      • [Paper] [Code]

2.2 Hierarchical Loss Function.

Incorporate the class hierarchy into loss functions, which exploits the underlying hierarchical relationships and produce predictions coherent with the pre-defined class hierarchy.

  • Predict at each grain separately, i.e., first take a model pre-trained on one level, then tune it using labels from other levels

    • "Coherent Hierarchical Multi-Label Classification Networks", NIPS, 2020
    • "Learning Hierarchical Visual Representations in Deep Neural Networks Using Hierarchical Linguistic Labels", arXiv, 2018
  • Utilize the hierarchical constraints directly

    • "Learning hierarchical similarity metrics", CVPR, 2012
      • Nakul Verma et al. (probabilistic nearest-neighbor classification based framework)
      • [Paper]
    • "Discriminative Transfer Learning with Tree-based Priors", NIPS, 2013
      • Nitish Srivastava, Russ R Salakhutdinov (DNN with hierarchical priors over the parameters of the classification layer)
      • [Paper]
    • "Large-Scale Object Classification Using Label Relation Graphs", ECCV, 2018
      • Jia Deng et al. (encodes semantic relations into a directed acyclic graph and compute a loss defined on it)
      • [Paper] [Code]
    • "Making Better Mistakes: Leveraging Class Hierarchies With Deep Networks", CVPR, 2020
      • Luca Bertinetto et al. (incorporate class hierarchy into the cross-entropy loss)
      • [Paper]

2.3 Label Embedding.

These approaches aim to encode the class hierarchy into embeddings whose relative locations or possible interactions represent the semantic relationships.

  • DeViSE method that utilizes unannotated Wikipedia text

    • "DeViSE: A Deep Visual-Semantic Embedding Model", NIPS, 2013
  • Embeddings whose pair-wise dot products correspond to semantic similarity between classes

    • "Hierarchy-Based Image Embeddings for Semantic Image Retrieval", WACV, 2019
  • Employ the entailment cones to learn order-preserving embeddings

    • "Hierarchical Image Classification Using Entailment Cone Embeddings", CVPR, 2020
  • Label embeddings in zero-shot classification

    • "Latent Embeddings for Zero-Shot Classification", CVPR, 2016
    • "Evaluation of Output Embeddings for Fine-Grained Image Classification", CVPR, 2015

3. Hybrid: Neural Decision Trees.

Neural Decision Trees (NDTs) are hybrid NN models that implement both the class hierarchy and the decision branches. Their core idea is to exploit NNs in the tree design by making the routing functions differentiable, thus allowing gradient descent-based methods to optimize.

3.1 Neural Decision Trees with Class Hierarchies.

If a NDT satisfies this property, it means each internal node is assigned a specific and intermediate classification task. This "divide-and-conquer" strategy and stepwise inference process make the model more interpretable, because each node in the tree is responsible and distinguishable from other nodes.

3.1.1 Data-Driven NDTs.

Data-driven methods employ data-dependent heuristics to perform local optimization. The resultant tree will lead to a recursive partitioning of the input space 𝑋 through a cascade of tests and define the output of each leaf in terms of examples falling within it.

A comparison between NDTs according to whether it implements a class hierarchy and whether it is data-driven:


  • Informativeness-related splitting functions with NN-based routers

    • "Neural trees-using neural nets in a tree classifier structure", ICASSP, 1991
      • J-E Stromberg, Jalel Zrida, Alf Isaksson (train a small NN at each internal node, which singles out one unique class that gains the most class purity)
      • [Paper]
    • "Evolutionary design of neural network tree-integration of decision tree, neural network and GA", CEC, 2001
      • Qiangfu Zhao (use genetic algorithms to maximize the information gain ratio at each internal node)
      • [Paper]
    • "Hybrid decision tree", Knowledge-Based Systems, 2002
      • Zhi-Hua Zhou, Zhao-Qian Chen (Divide the instance space into ordered attributes and unordered attributes)
      • [Paper]
    • "Classification trees with neural network feature extraction", CVPR, 1992
      • Heng Guo, Saul B Gelfand (find two clusters that minimizes a Gini impurity criterion, then find a good split through back-propagation)
      • [Paper]
  • Decrease the classification error at each node to be extended

    • "Growing and pruning neural tree networks", IEEE Transactions on Computers, 1993
      • A Sakar et al. (Divide the instance space into ordered attributes and unordered attributes)
      • [Paper]
  • Grows during competitive learning

    • "Competitive neural trees for pattern classification", IEEE Transactions on Neural Networks, 1998
      • Sven Behnke, Nicolaos B Karayiannis
      • [Paper]
  • Map examples into the label embedding space and predicts using a NDT

    • "Label Embedding Trees for Large Multi-Class Tasks", NIPS, 2010
      • Samy Bengio, Jason Weston, David Grangier
      • [Paper]
    • "Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition", NIPS, 2011
      • Jia Deng, Sanjeev Satheesh, Alexander Berg, Fei Li
      • [Paper]
  • Apply data-driven architectures to fuzzy NDTs

    • "Globally optimal fuzzy decision trees for classification and regression", IEEE Transactions on Pattern Analysis and Machine Intelligence, 1999
      • Alberto Suárez, James F Lutsko (superimpose a fuzzy structure over the skeleton of a CART tree and introduce a global optimization algorithm)
      • [Paper]
    • "Budding Trees", ICPR, 2014
      • Ozan Irsoy, Olcay Taner Yildiz, Ethem Alpaydin (a FDT model that can be dynamically adjusted)
      • [Paper] [Code]
    • "Convolutional Decision Trees for Feature Learning and Segmentation", GCPR, 2014
    • Dmitry Laptev, Joachim M Buhmann (applies fuzzy NDTs to image segmentation by extracting the most informative and interpretable features)
    • [Paper]
    • "Neural Decision Trees", arXiv, 2017
      • Randall Balestriero (map similar inputs to the same hash value)
      • [Paper]
    • "Neuro-fuzzy decision trees", Neural Systems, 2006
      • Rajen B Bhatt, M Gopal (employs fuzzy ID3 algorithm for DT-renovation)
      • [Paper]
  • Incremental learning

    • "Hybrid decision tree", Knowledge-Based Systems, 2022
      • Zhi-Hua Zhou, Zhao-Qian Chen (two example-incremental tasks, one hypothesis-driven constructive induction mechanism)
      • [Paper]
    • "Soft decision trees", ICPR, 2012
      • Ozan Irsoy, Olcay Taner Yıldız, Ethem Alpaydın (sigmoid-based FDT whose splits are made by checking if there is an improvement over the validation set)
      • [Paper]
    • "A Neural Tree with Partial Incremental Learning Capability", ICMLC, 2007
      • Mu-Chun Su, Hsu-Hsun Lo (choose a target class at each internal node and train a small NN to separate the positive patterns from negative ones)
      • [Paper]
3.1.2 Bigot NDTs.

A bigot NDT tends to have a pre-defined structure and determine leaves (pure classes or fixed class distributions) by priors or algorithms such that induce the class hierarchy.

  • Bigot NDTs with pure leaves

    • "Structure-driven induction of decision tree classifiers through neural learning", Pattern Recognition, 1997
      • Ishwar K Sethi et al. (randomly initialized leaf classes)
      • [Paper]
    • "NBDT: Neural-Backed Decision Trees", ICCV, 2020
      • Alvin Wan et al. (assign a specific concept to each terminal and intermediate node by WordNet)
      • [Paper] [Code]
  • Bigot NDTs with determined leaf distributions

    • "Distilling a Neural Network Into a Soft Decision Tree", AI*IA, 2017
      • Nicholas Frosst, Geoffrey Hinton (jointly optimize leaves and other parameters via back-propagation)
      • [Paper] [Code]
    • "Policy-gradient Methods for Decision Trees", ESANN, 2016
      • Aurélia Léon, Ludovic Denoyer (use Monte Carlo approximation to expect the gradient of the objective function)
      • [Paper]
    • "Deep Neural Decision Forests", ICCV, 2015
      • Peter Kontschieder et al. (propose a derivative-free strategy to solely optimize the leaf parameters, as a convex optimization problem)
      • [Paper] [Code]
    • "Deep Regression Forests for Age Estimation", CVPR, 2018
      • Wei Shen et al. (regression forests for age estimation)
      • [Paper] [Code]
    • "Neural Decision Forests for Semantic Image Labelling", CVPR, 2014
      • Samuel Rota Bulo, Peter Kontschieder (decision forests for semantic image labelling)
      • [Paper]
    • "Neural Prototype Trees for Interpretable Fine-Grained Image Recognition", CVPR, 2021
      • WMeike Nauta, Ron van Bree, Christin Seifert (prototype-based ante-hoc methods)
      • [Paper] [Code]
    • "Learn decision trees with deep visual primitives", SSRN, 2022
  • Bigot NDTs for knowledge distillation

    • "Tree-Like Decision Distillation", CVPR, 2021
      • Jie Song et al. (layer-wise dissect the decision process of a DNN)
      • [Paper]
    • "Tree-Like Branching Network for Multi-class Classification", LNNS, 2021
      • Mengqi Xue, Jie Song, Li Sun, Mingli Song (mine the underlying category relationships from a trained teacher network and determines the appropriate layers on which specialized branches grow)
      • [Paper]
    • "Distilling a Neural Network Into a Soft Decision Tree", AI*IA, 2017
      • Nicholas Frosst, Geoffrey Hinton (use a trained NN to provide soft targets for training a fuzzy NDT)
      • [Paper] [Code]
    • "TNT: An Interpretable Tree-Network-Tree Learning Framework using Knowledge Distillation", Entropy, 2020
      • Jiawei Li et al. (transfer knowledge between tree models and DNNs)
      • [Paper]
    • "KDExplainer: A Task-oriented Attention Model for Explaining Knowledge Distillation", arXiv, 2021

3.2 Expert NDTs (NDTs without Class Hierarchies)

NDTs without class hierarchies restrain themselves little and perform arbitrary predictions at the leaves. Their leaf nodes are usually classifiers rather than determined classes or distributions.

  • Hierarchical Mixtures of Experts (HME)

    • "Hierarchical Mixtures of Experts and the EM Algorithm", Neural computation, 1994
      • Michael I Jordan, Robert A Jacobs (the original HME, a tree-structured model for regression and classification.)
      • [Paper]
    • "Classification using hierarchical mixtures of experts", NNSP, 1994
      • Steve R Waterhouse, Anthony J Robinson (each leaf expert is non-linear and performs multi-way classification)
      • [Paper]
    • "Bayesian Hierarchical Mixtures of Experts", arXiv, 2012
      • Christopher M Bishop, Naonori Ueda, Steve Waterhouse (bayesian treatments of the HME model to prevent the severe overfitting caused by maximum likelihood)
      • [Paper]
  • Generalized HMEs in advanced frameworks

    • "Attention Convolutional Binary Neural Tree for Fine-Grained Visual Categorization", CVPR, 2020
      • Ruyi Ji et al. (incorporate convolutional operations along edges and use attention transformer modules to capture discriminative features)
      • [Paper] [Code]
    • "NDT: Neual Decision Tree Towards Fully Functioned Neural Graph", arXiv, 2017
      • Han Xiao (reformulate the non-differentiable information gain in the form of Dirac symbol and approximate it as a continuous function)
      • [Paper]
    • "Decision Forests, Convolutional Networks and the Models in-Between", arXiv, 2016
      • Yani Ioannou et al. (hybrid model between decision forests and convolutional networks)
      • [Paper]
    • "Deep Neural Decision Trees", arXiv, 2018
      • Yongxin Yang, Irene Garcia Morillo, Timothy M Hospedales (bin each feature of the input instance and determine the leaf node it will arrive)
      • [Paper] [Code]
    • "ViT-NeT: Interpretable Vision Transformers with Neural Tree Decoder", ICML, 2022
      • Sangwon Kim, Jaeyeal Nam, Byoung Chul Ko (transformer version of ProtoTree with expert leaves)
      • [Paper] [Code]
  • Expert NDTs with architecture search phase

    • "Adaptive Neural Trees", ICML, 2019
      • Ryutaro Tanno et al. (greedily choosing the best option between going deeper and splitting the input space)
      • [Paper] [Code]
    • "Self-Born Wiring for Neural Trees", ICCV, 2021
      • Ying Chen et al. (NDTs that are self-born from a large search space)
      • [Paper] [Code]