Why do Decision Trees Work?

Win-Vector Blog 2017-01-11

In this article we will discuss the machine learning method called “decision trees”, moving quickly over the usual “how decision trees work” and spending time on “why decision trees work.” We will write from a computational learning theory perspective, and hope this helps make both decision trees and computational learning theory more comprehensible. The goal of this article is to set up terminology so we can state in one or two sentences why decision trees tend to work well in practice.

Introduction

Newcomers to data science are often disappointed to learn that the job of the data scientist isn't tweaking and inventing new machine learning algorithms.

In the “big data” world supervised learning has been a solved problem since at least 1951 (see [FixHodges1951] for neighborhood density methods, see [GordonOlshen1978] for k-nearest neighbor and decision tree methods). Some reasons this isn't as well known as one would expect include:

  • Statisticians (the writers of the mentioned references) use different terminology than computer scientists (the typical current audience). Essentially any time you see a statistician use terms like “consistent”, “efficient”, or “asymptotic” you need to check if they are talking about provably optimal estimation procedures.
  • “Big data” didn't arrive until much later, so we have to wait for [HalevyNorvigPereira2009] and many others to re-observe that supervised machine learning is easy whey you have a lot of data.
  • Prospective machine learning tinkerers do not want this to be true. They seem to hope to have the pleasure of getting paid to use their programming chops to re-implement all of machine learning from scratch in their favorite language.

Decision Trees obviously continued to improve after [GordonOlshen1978]. For example: CART's cross-validation and pruning ideas (see: [BreimanEtAl1984]). Working on the shortcomings of tree-based methods (undesirable bias, instability) led to some of the most important innovations in machine learning (bagging and boosting, for example see: [HastieTibshiraniFriedman2009]).

In [ZumelMount2014] we have a section on decision trees (section 6.3.2) but we restrict ourselves to how they work (and the consequences), how to work with them; but not why they work. The reason we did not discuss why they work is the process of data science, where practical, includes using already implemented and proven data manipulation, machine learning, and statistical methods. The “why” can be properly delegated to implementers. Delegation is part of being a data scientist, so you have to learn to trust delegation at some point.

However, we do enjoy working through the theory and exploring why different machine learning algorithms work (for example our write-up on support vector machines: how they work here Mount2011, and why they work here Mount2015).

In this note we will look at the “why” of decision trees. You may want work through a decision tree tutorial to get the “what” and “how” out of the way before reading on (example tutorial: [Moore] ).

Decision Trees

Decision trees are a type of recursive partitioning algorithm. Decision trees are built up of two types of nodes: decision nodes, and leaves. The decision tree starts with a node called the root. If the root is a leaf then the decision tree is trivial or degenerate and the same classification is made for all data. For decision nodes we examine a single variable and move to another node based on the outcome of a comparison. The recursion is repeated until we reach a leaf node. At a leaf node we return the majority value of training data routed to the leaf node as a classification decision, or return the mean-value of outcomes as a regression estimate. The theory of decision trees is presented in Section 9.2 of [HastieTibshiraniFriedman2009] (available for free online).

Figure 6.2 from Practical Data Science with R ([ZumelMount2014]) below shows a decision tree that estimates the probability of an account cancellation by testing variable values in sequence (moving down and left or down and right depending on the outcome). For true conditions we move down and left, for falsified conditions we move down and right. The leaves are labeled with the predicted probability of account cancellation. The tree is orderly and all nodes are in estimated probability units because Practical Data Science with R used a technique similar to y-aware scaling ([Zumel2016]).

Rpart

*Practical Data Science with R* Figure 6.2 Graphical representation of a decision tree

It isn't too hard to believe that a sufficiently complicated tree can memorize training data. Decision tree learning algorithms have a long history and a lot of theory in how they pick which variable to split and where to split it. The issue for us is: will the produced tree work about as well on future test or application data as it did on training data?

Fitting Training Data

One of the first things we have to convince ourselves is that decision trees can even do well on training data. Decision trees return piece-wise constant functions: so they are bad at extrapolation and need a lot of depth to model linear relations. Fitting on training data is performed through sophisticated search, scoring, and cross-validation methods that a lot of ink has been spilled writing about.

We can illustrate some of the difficulty by attempting to regress the function \(y=x\) using a decision tree in R ([RCoreTeam2016]).

Decision Tree Fit Example

library("rpart")library("ggplot2")d <- data.frame(x=1:100, y=1:100)model <- rpart(y~x, data=d)print(model)
## n= 100 ## ## node), split, n, deviance, yval##       * denotes terminal node## ##  1) root 100 83325.0 50.5  ##    2) x< 50.5 50 10412.5 25.5  ##      4) x< 25.5 25  1300.0 13.0  ##        8) x< 12.5 12   143.0  6.5 *##        9) x>=12.5 13   182.0 19.0 *##      5) x>=25.5 25  1300.0 38.0  ##       10) x< 37.5 12   143.0 31.5 *##       11) x>=37.5 13   182.0 44.0 *##    3) x>=50.5 50 10412.5 75.5  ##      6) x< 75.5 25  1300.0 63.0  ##       12) x< 62.5 12   143.0 56.5 *##       13) x>=62.5 13   182.0 69.0 *##      7) x>=75.5 25  1300.0 88.0  ##       14) x< 87.5 12   143.0 81.5 *##       15) x>=87.5 13   182.0 94.0 *
d$pred <- predict(model, newdata= d)ggplot(data=d, mapping=aes(x=pred, y=y)) +   geom_point() +   geom_abline(color='blue') +  ggtitle("actual value as a function of predicted value")

Dtle 1

Most write-ups on decision trees spend all of their time describing how (and how heroically) the decision tree is derived. It can be difficult: having too many variables can defeat simple subdivision, and useful individual variables may not be obvious to simple greedy algorithms (see for example [Mount2016]). So tree optimization is non-trivial, in fact it is NP complete, see [HyafilRivest1976].

In this write-up we are going to skip tree construction entirely. We are going to assume the training procedure is in fact quite difficult and well worth the cost of installing the relevant packages. We will concentrate on conditions, that if enforced, would ensure good out of sample model performance. The division is: the training data is the machine learning package's responsibility, and true production performance is the data scientist's responsibility.

Interactive Decision Tree Example

We will leave the detailed discussion of decision tree fitting techniques to others (it takes whole books) and also recommend the following demonstration that allows the user to interactively grow a decision tree attempting to predict who survived the Titanic sinking: [Smith2016].

Over-Fit

The sequential or recursive nature of the tree drives the potential problem. After the first node (or root) the data is conditioned by the node examinations. This potentially introduces a huge bias in that this conditioning depends on the training data, and not on future test or application data. This breaks exchangeability of training and test (or future application) data. It could be the case that even if the decision tree performs well on training data it may fail on new data. This is called “excess generalization error.” The why of decision trees is working out under what conditions we do not experience severe over-fit.

An important point to remember is that the expected excess generalization error can depend not only on the tree our tree construction algorithm picks, but also involves all of the trees the algorithm is optimizing over (or even potentially could have picked from). This is called a multiple comparison problem, and correctly estimating the significance of a reported training fit requires what is called a Bonferroni correction. Roughly if I let you pick the best tree over 1000 candidate trees I expect you to find a fairly good (in fact a chances 1 in 1000 good) tree even if there is no actual relation to fit and even if you were clever and only directly examined 10 trees to solve the optimization problem. So if I want to reliably determine if the returned tree really does represent an actual useful (and generalizable) found relation, I need to correct for how much “venue shopping” your fitting algorithm had available to itself.

Preventing Over-Fit

What strictures or conditions will guarantee we don't have over-fit (or large excess generalization error)? A naive argument might only allow trees of logarithmic depth, which are unlikely to be able to capture realistic effects even on training data.

[GordonOlshen1978] solved the problem by restricting trees to have only nodes with a non-negligible fraction of the training data (though “p-quantile cuts” and restricting to trees where all nodes have at least \(m^{5/8}\) of the \(m\) training examples). Notice this scheme does allow fairly deep trees. The arguments are correct, but not in the notation a computer scientist would use. The argument used (fast asymptotic convergence of empirical distributions) relies on Glivenko–Cantelli style continuity arguments, which are formally equivalent to the Vapnik–Chervonenkis (VC dimension) theory argument we will use.

The argument

A decision tree is actually a very concise way of representing a set of paths or conjunctions (every example that works down a decision tree path represents the “and” of all the relevant conditions). Each datum uses a single path to land in exactly one tree leaf which then determines the prediction. So if we could bound the chance that no tree leaf has large excess generalization error, then in turn no tree built from these leaves has large excess generalization error.

We will need a concentration inequality to do the heavy lifting for us. For convenience let's use Hoeffding's inequality (instead of something more detailed such as Chernoff bounds):

If \(\bar{X}\) is an average of a sample of \(k\) i.i.d. items (drawn from a larger ideal population) each of which is bounded between zero and one (such as the 0/1 indicator of being in our target classification class or not) then the probability of the observed average \(\bar{X}\) being far away from its theoretical or ideal expected value \(E[\bar{X}]\) falls exponentially fast with \(k\). In fact we can bound the probability of seeing a difference of \(t\) by:

\[P[|\bar{X} – E[\bar{X}]| \geq t] \leq 2 e^{-2 k t^2}\]

Notice there is no use of “Big-O” notation (or Bachmann–Landau notation or asymptotic notation). We can apply this bound immediately.

Suppose we have \(m\) training examples each labeled positive or negative and containing features from \(R^{n}\). Let our tree construction/training/optimization procedure (no matter how complicated it is) obey the simple law that it only considers trees with all leaf nodes containing at least \(m^a\) training examples (\(0 < a < 1\), \(a\) to be picked later).

We are going to look a bit at the nature of leaf nodes in a tree. A leaf node may be reached by a long path such as “\((x>2) \wedge (x>5) \wedge (x<7)\)". This conjunction ("and-statement") representing each leaf can be reduced or re-written as a conjunction involving each variable at most twice. This means the concepts represented by leaf-nodes of decision trees are essentially axis aligned rectangles (with some ends allowed be open, an inessential difference; for details see [Schapire2013]). This means there are no more than \((m+3)^{2 n}\) possible tree leaves derived from our training data (assuming we cut between our \(m\) data points; the ”\(+3\)“ is from us adjoining symbols for \(+\inf\), \(-\inf\), and no-comparison).

By Hoeffding's inequality the probability of a given leaf mis-estimating its prediction probability by more than \(t\) is no more than \(2 e^{-2 m^a t^2}\). We can apply the so-called "union bound” that the probability of any one of a number of bad events happening is no more than the sum of the probabilities of each bad event happening (a potential over count as this excludes the favorable possibility of bad events clumping up). So worst-case the odds of any leaf being off by more than \(t\) is no more than \(p = (m+3)^{2 n} 2 e^{-2 m^a t^2}\). If we pick \(m\) such that the bound on the probability of a given leaf being too far off (\(2 e^{-2 m^a t^2}\)) is minuscule, then even the larger probability of any possible leaf being to far off (\((m+3)^{2 n} 2 e^{-2 m^a t^2}\)) will be small. So we say: for a given pair of goals \(p\), \(t\) pick \(a\) and \(m\) large enough that \(p \ge (m+3)^{2 n} 2 e^{-2 m^a t^2}\) (that is such that the probability \(p\) we are willing to accept for failure is at least as large our bound on the probability of failure).

As ugly as it is, the bound \(p \ge (m+3)^{2 n} 2 e^{-2 m^a t^2}\) is something we can work with. Some algebra re-writes this as \(m \ge (-log(p/2) + 2 n log(m+3))^{1/a}/(2 t^2)^{1/a}\). We can use the fact that for \(a, b, k \ge 0\) we have \((a+b)^k \le \max((2 a)^k , (2 b)^k)\) to find a slightly looser, but easier to manipulate bound: \(m \ge \max((-2 log(p/2))^{1/a} , (4 n log(m+3))^{1/a})/(2 t^2)^{1/a}\) (that itself implies our original bound). Such \(m\) satisfies the previous sequence of bounds, so is a training set size large enough to have all the properties we want. Notice we have \(m\) on both sides of the inequality, so finding the minimum \(m\) that obeys the bound would require plugging in a few values. This isn't really an essential difficulty, it is just similar to the observation that while equations like \(y = m/log(m)\) can be solved in terms of \(m\), the solution involves notationally inconvenient functions such as the Lambert W function.

For a given fixed \(a\), \(t\), and \(\widehat{p}\) we can easily pick a training set size \(m\) such that \(p \leq \widehat{p}\) for all training sets of size at least \(m\). For example we can pick \(a=2/3\) and \(m\) such that \(m \ge max((-2 log(\widehat{p}/2))^{3/2}/t^{3}, (4 n log(m+3))^{3/2})/t^3\). For such \(m\) if we only consider trees where each leaf node has at least \(m^{2/3}\) training examples: then with probability at least \(1-\widehat{p}\) no leaf in any tree we could consider has a probability estimate that is off by more than \(t\). That is: at some moderate training set size we can build a fairly complex tree (i.e., one that can represent relations seen in the training data) that generalizes well (i.e., one that works about as well in practice as it did during training).

The argument above is essentially: the probability of error of each of the sub-concepts we are considering (the tree-leaves or reduced conjunctive expressions) is decreasing exponentially fast in training data set size. So a learning procedure that doesn't consider too many constituent hypotheses (less than the reciprocal of the error probability) will (with very high probability) pick a reliable model (one that has similar test and training performance). The Bonferroni correction (multiplying by the number of possible concepts considered) is growing slower than our probability of error falls, so we can prove we have a good chance at a good overall estimate.

Allowing some complexity lets us fit the training data, and bounding the complexity (by not allowing negligible sized tree leaves) ensures low excess generalization error.

Computational Learning Theory

The above direct argument is rarely seen as it is more traditional to pull the finished result from a packaged argument. This packaged argument is based on Vapnik–Chervonenkis (VC) dimension.

The theoretical computer science equivalent to the statistical Glivenko–Cantelli style theorems is VC dimension as used in the “Probably Approximately Correct” (or PAC) model found in computational learning theory ([KearnsVazirani1994], [Mitchell1997]). This theory is currently not as in vogue as it was in the 1990s, but it remains correct. Some of the formulations are very approachable, in particular the Pajor variation of the Sauer–Shelah lemma formulation [WikipediaSauerShelah]. The argument we just demonstrated in the previous section is essentially the one you would get by observing the VC dimension of axis-aligned rectangles is no more than \(2 n\) (something so simple we could argue it directly, but for details see [Schapire2013]). The theory would then immediately give us a bound of a form similar to what we wrote down, except with the form properly re-factored so \(m\) is only on one side of the inequality.

The above is usually presented as a fairly impenetrable “prove a bound on a weird quantity called VC dimension, using a weird argument called shattering, and the references then give you a very complicated bound on sample size.”

Of course much of the power of VC dimension arguments are they also apply when there are continuous parameters leading to an uncountable number of possible alternate hypothesis (such as the case with linear discriminants, logistic regression, perceptions, and neural nets).

As a side note: the elementary inductive proof of Pajor's formulation of the Sauer–Shelah lemma (variously credited to Noga Alon or to Ron Aharoni and Ron Holzman) is amazingly clear (and reproduced in its entirety in [WikipediaSauerShelah] (at least as of 1-1-2017)).

A Daydream Alternative

When teaching decision trees one is often asked why node decisions are thresholds on single variables. It seems obvious that you could cobble up a more powerful tree model by using thresholds against arbitrary many variable linear functions. The idea would be to run something like a logistic regression or linear discriminant analysis at each node, split the data on the learned relation, and build more nodes by recursion.

But the above isn't a popular machine learning algorithm. Our suspicion is that everyone tries their own secret implementation, notices it severely over fits on small data, and quietly moves on. Computational Learning Theory indicates indicates early overfit is a large potential problem for such a model.

The path/leaf concepts for trees built out of arbitrary linear thresholds are convex sets. Arbitrary convex sets have infinite VC dimension for even \(n=2\) (two variable or two dimensional) problems. We don't have the ability to simplify paths into bounded depth as we did with axis aligned rectangles. The VC dimension isn't unbounded for a fixed \(m\) and \(n\), but it certainly isn't polynomial in \(m\). So we can't drive as sharp bounds on moderate data set sizes. Though with an additional depth restriction (say \(n^{1/3}\)) you may have a system that works well on large data sets (just not on the small data sets people tend to tinker with).

Conclusion

We now have set up the terminology to state the reason (or “why”) decision trees work.

Roughly it is that properly constrained decision trees (those with a non-negligible minimum leaf node size) are absolutely continuous and of moderate complexity.

Properly constrained decision trees are complex enough to memorize their training data, yet simple enough to ensure low excess generalization error. With a fixed feature set and a non-negligible leaf size constraint: the number of possible decision tree leaves grows only polynomially in the size of the training set, while the odds of any one leaf being mis-estimated shrinks exponentially in the size of the training set.

Bibliography

[BreimanEtAl1984] Leo Breiman , Jerome Friedman, R.A. Olshen, Charles J. Stone, Classification and Regression Trees, Chapman and Hall/CRC, 1984 (link).

[FixHodges1951] Evelyn Fix, Joseph Lawson Hodges, “Discriminatory analysis, Nonparametric discrimination: Consistency Properties”, Project Number 21-49-004, Report 4, USAF School of Aviation Medicine, Randolph Field Texas , February 1951 (link).

[GordonOlshen1978] Louis Gordon, Richard A. Olshen, “Asymptotically Efficient Solutions to the Classification Problem”, The Annals of Statistics, 1978, Vol. 6, No. 3, pp. 515-533 (link).

[HalevyNorvigPereira2009] Alon Halevy, Peter Norvig, and Fernando Pereira, “The Unreasonable Effectiveness of Data”, IEEE Intelligent Systems, 2009, pp. 8-12 (link).

[HastieTibshiraniFriedman2009] Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements of Statistical Learning 2nd Edition, Springer Verlag, 2009 (link).

[KearnsVazirani1994] Michael J. Kearns, Umesh Vazirani, An Introduction to Computational Learning Theory, MIT Press, 1994 (link).

[HyafilRivest1976] Laurent Hyafil, Ronald L. Rivest, “Constructing optimal binary decision trees is NP-complete”, Information Processing Letters, Volume 5, Issue 1, May 1976, pp. 15-17 (link).

[Mitchell1997] Tom M. Mitchell, Machine Learning, McGraw-Hill, 1997 (link).

[Moore] Andrew Moore, “Decision Trees”, CMU (link).

[Mount2011] John Mount, “Kernel Methods and Support Vector Machines de-Mystified”, Win-Vector Blog, 2011, (link).

[Mount2015] John Mount, “How sure are you that large margin implies low VC dimension?”, Win-Vector Blog, 2015, (link).

[Mount2016] John Mount, “Variables can synergize, even in a linear model”, Win-Vector Blog, 2016 (link).

[RCoreTeam2016] R Core Team “R: A language and environment for statistical computing”, 2016, R Foundation for Statistical Computing, Vienna, Austria (link).

[Schapire2013] Rob Schapire, “COS 511: Theoretical Machine Learning”, 2013 (link).

[Smith2016] David Smith, “Interactive decision trees with Microsoft R (Longhow Lam's demo)”, Revolutions blog, 2016, (link).

[WikipediaHoeffding] Wikipedia, “Hoeffding's inequality”, 2016 (link).

[WikipediaSauerShelah] Wikipedia, “Sauer–Shelah lemma”, 2016 (link).

[WikipediaVCDimension] Wikipedia, “VC dimension”, 2016 (link).

[Zumel2016] Nina Zumel, “Principal Components Regression, Pt. 2: Y-Aware Methods”, Win-Vector blog, 2016 (link).

[ZumelMount2014] Nina Zumel, John Mount, Practical Data Science with R, Manning 2014 (link).