Ensemble Learning: The Wisdom of Crowds (of Machines)

0

No comments posted yet

Comments

Slide 15

We can see that after 5 classifiers have been queried we get a 4-1 that independtly of the output class of the last classifierit wont change the final decision of the ensemble But but can go further, pe after 3 classifiers have been queried we hava a 3-0 in favor of class 1. This seem to indicate that with a high probability the final class is going to be 1

Slide 23

There are 5 examples belongs to + class and 5 examples belongs to – class The picture shows a good classifier which correctly classifies all the examples We’ll show how a simple learning algorithm and boosting work to make a good classifier like this

Slide 25

H1 is the first classifier produced by a learning algorithm. The learning algorithm explores the sample space and learns that putting the boundary at the position shown on the picture minimizes the error rate. E1 is the minimum error given the distribution and the learning algorithm. Alpha1 is the importance of this classifier determined by the value of e1

Slide 26

The examples wrongly classified gets more weight and the ones correctly classified gets lower weights relatively So that at the next round, these wrongly classified ones will get more attention from the weak learner (It’s only description of how it works)

Slide 28

For round 2, these examples have higher weights are classified correctly since it reduces error.

Slide 1

Ensemble Learning: The Wisdom of Crowds (of Machines) Lior Rokach Department of Information Systems Engineering Ben-Gurion University of the Negev

Slide 2

About Me Prof. Lior Rokach Department of Information Systems Engineering Faculty of Engineering Sciences Head of the Machine Learning Lab Ben-Gurion University of the Negev Email: liorrk@bgu.ac.il http://www.ise.bgu.ac.il/faculty/liorr/ PhD (2004) from Tel Aviv University 2

Slide 3

The Condorcet Jury Theorem If each voter has a probability p of being correct and the probability of a majority of voters being correct is M, then p > 0.5 implies M > p. Also M approaches 1, for all p > 0.5 as the number of voters approaches infinity. This theorem was proposed by the Marquis of Condorcet in 1784

Slide 4

Francis Galton Galton promoted statistics and invented the concept of correlation. In 1906 Galton visited a livestock fair and stumbled upon an intriguing contest. An ox was on display, and the villagers were invited to guess the animal's weight. Nearly 800 gave it a go and, not surprisingly, not one hit the exact mark: 1,198 pounds. Astonishingly, however, the average of those 800 guesses came close - very close indeed. It was 1,197 pounds.

Slide 5

The Wisdom of Crowds Why the Many Are Smarter Than the Few and How Collective Wisdom Shapes Business, Economies, Societies and Nations Under certain controlled conditions, the aggregation of information in groups, resulting in decisions that are often superior to those that can been made by any single - even experts. Imitates our second nature to seek several opinions before making any crucial decision. We weigh the individual opinions, and combine them to reach a final decision

Slide 6

Committees of Experts “ … a medical school that has the objective that all students, given a problem, come up with an identical solution” There is not much point in setting up a committee of experts from such a group - such a committee will not improve on the judgment of an individual. Consider: There needs to be disagreement for the committee to have the potential to be better than an individual.

Slide 7

Does it always work? Not all crowds (groups) are wise. Example: crazed investors in a stock market bubble.

Slide 8

Key Criteria Diversity of opinion Each person should have private information even if it's just an eccentric interpretation of the known facts. Independence People's opinions aren't determined by the opinions of those around them. Decentralization People are able to specialize and draw on local knowledge. Aggregation Some mechanism exists for turning private judgments into a collective decision.

Slide 9

Teaser: How good are ensemble methods? Let’s look at the Netflix Prize Competition…

Slide 10

Supervised learning task Training data is a set of users and ratings (1,2,3,4,5 stars) those users have given to movies. Construct a classifier that given a user and an unrated movie, correctly classifies that movie as either 1, 2, 3, 4, or 5 stars $1 million prize for a 10% improvement over Netflix’s current movie recommender/classifier (MSE = 0.9514) Began October 2006

Slide 12

Learning biases Occam’s razor “among the theories that are consistent with the data, select the simplest one”. Epicurus’ principle “keep all theories that are consistent with the data,” [not necessarily with equal weights] E.g. Bayesian learning Ensemble learning 12

Slide 13

Strong and Weak Learners Strong (PAC) Learner Take labeled data for training Produce a classifier which can be arbitrarily accurate Objective of machine learning Weak (PAC) Learner Take labeled data for training Produce a classifier which is more accurate than random guessing

Slide 14

Ensembles of classifiers Given some training data Inductive learning L: Dtrain  h(), where h(): X  Y Ensemble learning L1: Dtrain  h1() L2: Dtrain  h2() ... Ensemble: LT: Dtrain  hT ()  {h1(), h2(), ... , hT ()} 14

Slide 15

Accumulated votes: 2 1 5 4 3 2 1 Classification by majority voting New Instance: x t  T=7 classifiers 0 0 t2 t1 15 Alberto Suárez (2012)

Slide 16

Popular Ensemble Methods

Slide 17

Boosting Learners Strong learners are very difficult to construct Constructing weaker Learners is relatively easy Strategy Derive strong learner from weak learner Boost weak classifiers to a strong learner

Slide 18

Construct Weak Classifiers Using Different Data Distribution Start with uniform weighting During each step of learning Increase weights of the examples which are not correctly learned by the weak learner Decrease weights of the examples which are correctly learned by the weak learner Idea Focus on difficult examples which are not correctly classified in the previous steps

Slide 19

Combine Weak Classifiers Weighted Voting Construct strong classifier by weighted voting of the weak classifiers Idea Better weak classifier gets a larger weight Iteratively add weak classifiers Increase accuracy of the combined classifier through minimization of a cost function

Slide 20

AdaBoost (Adaptive Boosting) (Freund and Schapire, 1997) Generate a sequence of base-learners each focusing on previous one’s errors (Freund and Schapire, 1996)

Slide 21

AdaBoost

Slide 22

Example Training Combined classifier

Slide 23

Example of a Good Classifier + - + + + + - - - -

Slide 24

The initial distribution

Slide 25

Round 1 of 3 + - + + + + - - - - e1 = 0.30 a1=0.42

Slide 26

How the distribution has changed?

Slide 27

+ - + + + + - - - - e2 = 0.21 a2=0.65 Round 2 of 3

Slide 28

Round 2 How the distribution has changed?

Slide 29

+ - + + + + - - - - e3 = 0.14 a3=0.92 STOP Round 3 of 3

Slide 30

Round 2 How the distribution has changed?

Slide 31

Final Hypothesis Hfinal = sign[ 0.42(h1? 1|-1) + 0.65(h2? 1|-1) + 0.92(h3? 1|-1) ]

Slide 32

Training Errors vs Test Errors Performance on ‘letter’ dataset (Schapire et al. 1997) Training error Test error Training error drops to 0 on round 5 Test error continues to drop after round 5 (from 8.4% to 3.1%)

Slide 33

Adaboost Variants Proposed By Friedman LogitBoost

Slide 34

Adaboost Variants Proposed By Friedman GentleBoost

Slide 35

BrownBoost Reduce the weight given to misclassified example Good (only) for very noisy data.

Slide 36

Bagging Bootstrap AGGregatING Employs simplest way of combining predictions that belong to the same type. Combining can be realized with voting or averaging Each model receives equal weight “Idealized” version of bagging: Sample several training sets of size n (instead of just having one training set of size n) Build a classifier for each training set Combine the classifier’s predictions This improves performance in almost all cases if learning scheme is unstable.

Slide 37

Wagging Weighted AGGregatING A variant of bagging in which each classifier is trained on the entire training set, but each instance is stochastically assigned a weight.

Slide 38

Random Forests 1. Choose T—number of trees to grow. 2. Choose m—number of variables used to split each node. m ≪ M, where M is the number of input variables. m is hold constant while growing the forest. 3. Grow T trees. When growing each tree do the following. (a) Construct a bootstrap sample of size n sampled from Sn with replacement and grow a tree from this bootstrap sample. (b) When growing a tree at each node select m variables at random and use them to find the best split. (c) Grow the tree to a maximal extent. There is no pruning. 4. To classify point X collect votes from every tree in the forest and then use majority voting to decide on the class label.

Slide 39

Variation of Random Forests Random Split Selection (Dietterich, 2000) Grow multiple trees When splitting, choose split uniformly at random from K best splits Can be used with or without pruning Random Subspace (Ho, 1998) Grow multiple trees Each tree is grown using a fixed subset of variables Do a majority vote or averaging to combine votes from different trees

Slide 40

DECORATE (Melville & Mooney, 2003) Change training data by adding new artificial training examples that encourage diversity in the resulting ensemble. Improves accuracy when the training set is small, and therefore resampling and reweighting the training set has limited ability to generate diverse alternative hypotheses.

Slide 41

Base Learner Overview of DECORATE Training Examples Artificial Examples Current Ensemble - - + + +

Slide 42

C1 Base Learner Overview of DECORATE Training Examples Artificial Examples Current Ensemble - - + + +

Slide 43

C1 C2 Base Learner Overview of DECORATE Training Examples Artificial Examples Current Ensemble - + + + - - - + + +

Slide 44

Error-Correcting Output Codes

Slide 45

Ensemble Taxonomy (Rokach, 2009)

Slide 46

Combiner Weighting methods Majority Voting Performance Weighting Distribution Summation Gating Network Meta-Learning Stacking Arbiter Trees Grading

Slide 47

Mixtures of Experts

Slide 48

Stacking Combiner f () is another learner (Wolpert, 1992)

Slide 49

Members Dependency Dependent Methods: There is an interaction between the learning runs (AdaBoost) Model-guided Instance Selection: the classifiers that were constructed in previous iterations are used for selecting the training set in the subsequent iteration. Incremental Batch Learning: In this method the classification produced in one iteration is given as prior knowledge (a new feature) to the learning algorithm in the subsequent iteration. Independent Methods (Bagging)

Slide 50

Cascading Use dj only if preceding ones are not confident Cascade learners in order of complexity

Slide 51

Diversity Manipulating the Inducer Manipulating the Training Sample Changing the target attribute representation Partitioning the search space - Each member is trained on a different search subspace. Hybridization - Diversity is obtained by using various base inducers or ensemble strategies.

Slide 52

Measuring the Diversity Pairwise measures calculate the average of a particular distance metric between all possible pairings of members in the ensemble, such as Q-statistic or kappa-statistic. The non-pairwise measures either use the idea of entropy or calculate a correlation of each ensemble member with the averaged output.

Slide 53

Kappa-Statistic

Slide 54

How crowded should the crowd be? Ensemble Selection Why bother? Desired accuracy Computational cost Predetermine the ensemble size Use a certain criterion to stops training Pruning

Slide 55

Cross Inducer Inducer-dependent (like RandomForest). Inducer-independent (like bagging)

Slide 56

Multi-strategy Ensemble Learning Combines several ensemble strategies. MultiBoosting, an extension to AdaBoost expressed by adding wagging-like features can harness both AdaBoost's high bias and variance reduction with wagging's superior variance reduction. produces decision committees with lower error than either AdaBoost or wagging.

Slide 57

Some Insights

Slide 58

Why using Ensembles? Statistical Reasons: Out of many classifier models with similar training / test errors, which one shall we pick? If we just pick one at random, we risk the possibility of choosing a really poor one Combining / averaging them may prevent us from making one such unfortunate decision Computational Reasons: Every time we run a classification algorithm, we may find different local optima Combining their outputs may allow us to find a solution that is closer to the global minimum. Too little data / too much data: Generating multiple classifiers with the resampling of the available data / mutually exclusive subsets of the available data Representational Reasons: The classifier space may not contain the solution to a given particular problem. However, an ensemble of such classifiers may For example, linear classifiers cannot solve non-linearly separable problems, however, their combination can.

Slide 59

The Diversity Paradox

Slide 60

There’s no real Paradox… Ideally, all committee members would be right about everything! If not, they should be wrong about different things.

Slide 61

No Free Lunch Theorem in Machine Learning (Wolpert, 2001) “Or to put it another way, for any two learning algorithms, there are just as many situations (appropriately weighted) in which algorithm one is superior to algorithm two as vice versa, according to any of the measures of "superiority"

Slide 62

So why developing new algorithms? The science of pattern recognition is mostly concerned with choosing the most appropriate algorithm for the problem at hand This requires some a priori knowledge – data distribution, prior probabilities, complexity of the problem, the physics of the underlying phenomenon, etc. The No Free Lunch theorem tells us that – unless we have some a priori knowledge – simple classifiers (or complex ones for that matter) are not necessarily better than others. However, given some a priori information, certain classifiers may better MATCH the characteristics of certain type of problems. The main challenge of the patter recognition professional is then, to identify the correct match between the problem and the classifier! …which is yet another reason to arm yourself with a diverse set of PR arsenal !

Slide 63

Ensemble and the No Free Lunch Theorem Ensemble combine the strengths of each classifier to make a super-learner. But … Ensemble only improves classification if the component classifiers perform better than chance Can not be guaranteed a priori Proven effective in many real-world applications

Slide 64

Ensemble and Optimal Bayes Rule Given a finite amount of data, many hypothesis are typically equally good. How can the learning algorithm select among them? Optimal Bayes classifier recipe: take a weighted majority vote of all hypotheses weighted by their posterior probability. That is, put most weight on hypotheses consistent with the data. Hence, ensemble learning may be viewed as an approximation of the Optimal Bayes rule (which is provably the best possible classifier).

Slide 65

Bias The hypothesis space made available by a particular classification method does not include sufficient hypotheses Variance The hypothesis space made available is too large for the training data, and the selected hypothesis may not be accurate on unseen data Bias and Variance Decomposition

Slide 66

Bias and Variance from Elder, John. From Trees to Forests and Rule Sets - A Unified Overview of Ensemble Methods. 2007. Decision Trees Small trees have high bias. Large trees have high variance. Why?

Slide 67

For Any Model (Not only decision trees) Given a target function Model has many parameters Generally low bias Fits data well Yields high variance Model has few parameters Generally high bias May not fit data well The fit does not change much for different data sets (low variance)

Slide 68

Bias-Variance and Ensemble Learning Bagging: There exists empirical and theoretical evidence that Bagging acts as variance reduction machine (i.e., it reduces the variance part of the error). AdaBoost: Empirical evidence suggests that AdaBoost reduces both the bias and the variance part of the error. In particular, it seems that bias is mostly reduced in early iterations, while variance in later ones.

Slide 69

Illustration on Bagging x y

Slide 70

Occam's razor The explanation of any phenomenon should make as few assumptions as possible, eliminating those that make no difference in the observable predictions of the explanatory hypothesis or theory

Slide 71

Contradiction with Occam’s Razor Ensemble Contradicts with Occam’s Razor More rounds -> more classifiers for voting -> more complicated With the 0 training error, a more complicated classifier may perform worse

Slide 72

Two Razors (Domingos, 1999) First razor: Given two models with the same generalization error, the simpler one should be preferred because simplicity is desirable in itself. On the other hand, within KDD Occam's razor is often used in a quite different sense, that can be stated as: Second razor: Given two models with the same training­set error, the simpler one should be preferred because it is likely to have lower generalization error. Domingos: The first one is largely uncontroversial, while the second one, taken literally, is false.

Slide 73

Summary “Two heads are better than none. One hundred heads are so much better than one” Dearg Doom, The Tain, Horslips, 1973 “Great minds think alike, clever minds think together” L. Zoref, 2011. But they must be different, specialised And it might be an idea to select only the best of them for the problem at hand

Slide 74

Additional Readings

URL:
More by this User
Most Viewed
Previous Page Next Page
When Cyber Security Meets Machine Learning
When Cyber...
 
 
 
Previous Page Next Page