$Id: README,v 1.27 2008-05-09 21:25:25 narsky Exp $
 
General description of the package:
-----------------------------------

Author:  Ilya Narsky narsky@hep.caltech.edu

Web page: http://www.hep.caltech.edu/~narsky/spr.html

The latest release of SPR can be obtained out of Sourceforge CVS:
cvs -z3 -d:pserver:anonymous@statpatrec.cvs.sourceforge.net:/cvsroot/statpatrec
The latest release may not be stable. It is recommended to use tagged
package releases for that reason.

List of all classifiers currently implemented in the package:
------------------------------------------------------------

- decision split (or stump)
- decision tree (2 flavors)
- bump hunter (PRIM)
- LDA and QDA
- logistic regression
- boosting an arbitrary sequence of classifiers; 
	boosting flavors: discrete and real AdaBoost, epsilon-Boost
- bagging an arbitrary sequence of classifiers
- random forest
- arc-x4
- range booster
- backprop neural net with a logistic activation function 
    (original implementation)
- combiner (mixture of experts)
- interfaces to two SNNS neural nets (without training): 
    * backprop neural net with a logistic activation function 
    * RBF
- multiclass learner
    * Allwein-Schapire-Singer algorithm
    * binary encoder

Contents
--------
1.  Documentation
2.  Classifiers
	2.1  Boosting
	2.2  Bagging
	2.3  Decision trees
	2.4  Top-down trees
	2.5  Bump hunting
	2.6  Fisher and logistic regression
	2.7  Neural networks
	2.8  Combining classifiers
	2.9  Recommendations: What classifiers to use and when
	2.10 How to choose parameters for boosted and bagged decision trees
	2.11 Cross-validation
	2.12 Gene Expression Programming
2a. Multi class learning
	2a.1 Algorithm by Allwein, Schapire and Singer
	2a.2 Binary encoder
	2a.3 Treatment of missing values
	2a.4 Indicator matrix generation for Allwein-Schapire-Singer
3.  Data and filters
4.  Imposing simple cuts on input variables
5.  Variable transformations and feature extraction
6.  Treatment of input classes
7.  Estimation of variable importance
	7.1 Correlation with the class label
	7.2 Counting decision splits
	7.3 Variable importance for neural networks
	7.4 Estimation of variable importance by random permutation 
		of class labels
	7.5 "Add N Remove R" method
        7.6 Variable interactions
	7.7 Which variable selection method is best?
8.  Handling input variables
9.  Data input and output
	9.1 Supported I/O types: Ascii and Root
	9.2 Splitting data into training and test sets
	9.3 SprIOTestApp: displaying data content
10. Computing output for trained classifiers
11. Installation instructions
12. How to use trained classifiers in your C++ code
13. Working in Root-free environment
14. Working in Root environment
15. Missing values
16. Example: Cleveland heart data
17. Memory management
18. Summary
19. Release notes


1. Documentation 
---------------- 

The math formalism behind implemented classifiers, as well as
idealistic and practical examples, can be found in two physics archive
notes:

physics/0507143 "StatPatternRecognition: A C++ Package for Statistical
Analysis of High Energy Physics Data"

physics/0507157 "Optimization of Signal Significance by Bagging
Decision Trees"

The first 4 references in physics/0507143 give a list of introductory
books on pattern classification. Although familiarity with the
material discussed in the books is not necessary to be able to use
this package, I recommend them to every physicist.

One might also want to check out the SLUO lectures on Machine
Learning:
http://www-group.slac.stanford.edu/sluo/Lectures/Stat2006_Lectures.html

The full set of links to papers and talks on StatPatternRecognition
can be found on the web page,
http://www.hep.caltech.edu/~narsky/spr.html .

My talk at the workshop at the BaBar Collaboration Meeting in December
2005 has useful tips for running the package:
http://www.hep.caltech.edu/~narsky/SPR/SPR_Workshop_Dec2005.pdf . Some
of this info may be outdated as the package has gone through
development cycles since then.


2. Classifiers and classes
--------------------------

There are two types of classifiers in the package: trained and trainable.

All trainable classifiers inherit from SprAbsClassifier. The major
method in this class is train(). The purpose of every trainable
classifier is to optimize certain parameters for efficient
signal-background separation. After training has been performed, a
trainable classifier invokes makeTrained() method to produce a
corresponding trained classifier. A trained classifier inherits from
SprAbsTrainedClassifier. The purpose of this class is to produce a
numeric outcome for a data point to classify it using response()
method.

At the moment the following trainable classifiers are implemented with
corresponding trained counterparts:

- linear and quadratic discriminant analysis (SprFisher)
- logistic regression (SprLogitR)
- a simple binary split in a chosen dimension (SprBinarySplit)
- boosting algorithms (SprAdaBoost): Discrete AdaBoost, Real AdaBoost 
    and epsilon-Boost
- bagging and random forest algorithm (SprBagger)
- arc-x4, yet another boosting algorithm (SprArcE4)
- range boosting, see Section 2.2 (SprRangeBooster)
- decision tree with a user-supplied optimization criterion
  (SprDecisionTree and SprTopdownTree)
- bump hunter (PRIM) with a user-supplied optimization criterion 
    (SprBumpHunter)
- feedforward backpropagation neural net with a logistic activation function
    (SprStdBackprop)
- multi-class learner (SprMultiClassLearner)
- combiner of classifiers, see Section 2.8 (SprCombiner)

At present most implemented classifiers deal only with two
categories. The exception is a multi-class learner (see section
2.12). The default class assignment is 0 for background and 1 for
signal. This choice, 0 and 1, is assumed throughout the package. The
user can choose any other pair of classes or group classes in any two
sets using syntax described in Section 6 and SprAbsFilter.hh.

The output of all implemented classifiers also by default ranges from
0 to 1. If you want to impose a naive cut that crudely chooses
signal-like events over background-like events, you can simply cut at
0.5 on the classifier output. To impose a cut on the classifier
output, use SprAbsTrainedClassifier::setCut() method. accept() method
of every trained classifier returns true if the output of the
classifier satistfies the imposed cut and false otherwise. For all
trained classifiers the cut on the output is by default set at
0.5. There are 3 classifiers in tha package that can optionally change
their output range from [0,1] to (-infty,+infty); these are AdaBoost,
Fisher and LogitR. useStandard() method sets the range to
(-infty,+infty) and useNormalized() sets it to [0,1]. These methods
also change the cut on the output of the classifier: useStandard()
sets it to 0 and useNormalized() sets it to 0.5.

The package also offers a set of transformations included in
SprTransformation.hh for converting the classifier output to a
desirable range. For example, to transform the conventional output of
AdaBoost ranging from -infty to +infty to [0,1], logitDouble()
transformation is applied.

SprTrainedBinarySplit produces a binary response, that is, 0 for
background-like and 1 for signal-like events. The same holds for the
bump hunter. Decision trees can return either discrete or continuous
values depending on the chosen configuration. The response of
SprTrainedAdaBoost is given by a sum of either binary (0 or 1) or
continuous (from -infty to +infty) responses of individual classifiers
multiplied by corresponding beta coefficients (see below section on
AdaBoost).  The response of SprTrainedBagger is given by a sum of
either binary (0 or 1) or continuous (from 0 to 1) responses of
individual classifiers divided by the total number of individual
classifiers. For a large number of individual classifiers, the output
of SprTrainedAdaBoost and SprTrainedBagger looks practically
continuous, even if each classifier returns discrete output.

Classifiers implemented in the package are described below in more
detail.


2.1 AdaBoost
------------

AdaBoost combines individual classifiers by applying them
consecutively. First AdaBoost reweights events to enhance weights of
those that were misclassified by the last applied subclassifier and to
reduce weights of events that were correctly classified by the last
subclassifier. Then AdaBoost applies a new subclassifier to the
reweighted events.

SprAdaBoost currently implements 3 different algorithms: Discrete
AdaBoost, Real AdaBoost and epsilon-Boost. Specifics of event
reweighting and computing the relative importance (beta coefficients)
for the individual weak classifiers vary.

The implementation of AdaBoost in StatPatternRecognition can combine
any classifiers that inherit from SprAbs(Trained)Classifier. The user
can supply any combination of trained and trainable
classifiers. AdaBoost goes through supplied trained classifiers first
and calculates beta weights for those. These trained classifiers are
not included in the number of training cycles; AdaBoost will always go
through all trained classifiers supplied by the user, no matter how
many training cycles have been requested (the special case is n=0
training cycles which is discussed below). After AdaBoost went through
all trained classifiers, it goes through trainable
classifiers. AdaBoost first trains each trainable classifier on the
reweighted sample, makes a corresponding trained classifier and stores
it in the final list of trained classifiers. AdaBoost goes through
trainable classifiers consecutively in the order in which they were
added to AdaBoost by the user and stops either when it is not possible
to find a weak classifier improving the overall classification power
(this stopping rule varies for different boosting flavors) or when the
number of iterations through trainable classifiers reaches the number
of training cycles requested by the user. For example, if the user
supplies 2 trained and 3 trainable classifiers to AdaBoost and
requests 10 training cycles, AdaBoost will first go through the two
trained classifiers, then go through all 3 trainable classifiers 3
times consecutively (9 training cycles), then do one more training
cycle for the 1st trainable classifier and exit. In this example,
AdaBoost will produce a list of 12 trained classifiers with
corresponding beta weights.

Currently 4 AdaBoost executables are implemented:

SprBoosterApp              - for boosting an arbitrary sequence of classifiers
				 in the package

SprAdaBoostBinarySplitApp  - AdaBoost with binary splits in each dimension

SprAdaBoostDecisionTreeApp - AdaBoost with decision trees

SprStdBackpropApp          - AdaBoost with neural nets

SprBoosterApp has been introduced in tag V05-00-00. In principle,
SprBoosterApp alone should statisfy the needs of a regular user. It is
capable of boosting any sequence of classifiers implemented in the
package. It takes an input file with weak classifier configurations
specified by the user. Use booster.config as an example. SprBoosterApp
reproduces, to a large extent, the functionality of the 3 other
executables. It offers even more flexibility as the user can combine
any weak classifiers in any sequence. For example, if you put a
decision tree on the first line of the input config line and a neural
net on the second line, AdaBoost will use the decision tree for odd
boosting cycles and the neural net for even boosting cycles.

The other 3 executables will be kept in the package. They offer some
features that SprBoosterApp does not; for example, cross-validation or
saving intermediate training weights. Also, for historical reasons -
people who have been using the package for a while may have got used
to the syntax.

Use "-M" option of the executable to choose among available boosting
algorithms (default=Discrete AdaBoost). At the moment, 3 flavors of
boosting are implemented: Discrete, Real and Epsilon. For description,
please refer to the statistics literature.  Discrete AdaBoost and
epsilon-Boost can be used with any weak classifier. Real AdaBoost
requires that the output of the supplied weak learner be continuous
and range from 0 to 1. Since the output of all implemented classifiers
by default ranges from 0 to 1, any weak classifier can be used with
Real AdaBoost; however, the user should be careful not to convert the
[0,1] output range of the weak classifier to (-infty,+infty).

setEpsilon() method of SprAdaBoost (-E command-line option for the
executable) sets the epsilon value. This value is of no relevance for
Discrete AdaBoost but affects Real and Epsilon algorithms. The meaning
of epsilon is different for Real and Epsilon boosting. For the Epsilon
boost, epsilon defines the rate at which weights are updated for
misclassified events, specifically exp(2*epsilon). For the Real boost,
epsilon defines the margin of the weak classifier response. The
response of each weak classifier, p, for Real boosting must range from
0 to 1 and represent the probablity of an event at this point being
signal. Weights of events are then multiplied by p/(1-p) for
background and (1-p)/p for signal. For p=0 or p=1, the updated weight
goes to infinity. epsilon defines the minimal and (1-epsilon) defines
the maximal allowed value of p to avoid infinite weights. Increasing
epsilon improves stability of the Real boost but may require more
training cycles. It is the other way around for the Epsilon boost -
decreasing epsilon improves stability and results in more training
cycles.

When you train AdaBoost on multidimensional data (e.g., hundreds of
dimensions) and large data sets, you will likely need to save an
intermediate AdaBoost configuration to a file and then resume training
later. The AdaBoost configuration file keeps only parameters of the
trained classifiers but not the event weights adjusted due to AdaBoost
application. Your resume job can therefore spend a significant amount
of time reweighting events by applying the trained classifiers from
the saved configuration. To reduce the amount of time spent on initial
reweighting, one can store training data with updated weights into a
new data file at the end of the training routine and then read the
training data with already adjusted weights from this file before
resuming. For example:

SprAdaBoostDecisionTreeApp -a 1 -n 50 -M 1 -l 2000 -f tree_n50.spr -u reweighted_gauss2_uniform_2d_train.pat gauss2_uniform_2d_train.pat

And then:

SprAdaBoostDecisionTreeApp -a 4 -M 1 -n 50 -l 2000 -r tree_n50.spr -f tree_n100.spr -e reweighted_gauss2_uniform_2d_train.pat 

The "-e" option will force AdaBoost to skip the event reweighting;
this option is necessary because events in
reweighted_gauss2_uniform_2d_train.pat have been reweighted
already. The final AdaBoost configuration stored in tree_n100.spr will
have 50 classifiers copied from tree_n50.spr and then 50 new
classifiers added. Note that now the "-a 4" option is necessary
because the format of the input data has changed due to addition of
weights.

AdaBoost output has a nice probabilistic interpretation: 

P(Y=1|X)/P(Y=-1|X) = exp(2*F(X))

where P(Y=1|X) and P(Y=-1|X) are probabilities for an event with
coordinates X to come from class 1 (signal) or class -1 (background),
respectively, and F(X) is the conventional AdaBoost output ranging
from -infty to +infty. By default trained AdaBoost returns the
probability of an event at point X being signal:
1/(1+exp(-2*F(X))). If you want to use raw AdaBoost output unmapped to
the [0,1] range, use "-s" option of the AdaBoost executables.


2.2 Bagging
-----------

Bagging (Bootstrap AGGregatING) means that many classifiers are
trained on bootstrap replicas of a training data set. To classify a
new event, responses of individual classifiers are summed up and
divided by the total number of individual classifiers.

The user can bag an arbitrary sequence of classifiers using
SprBaggerApp. To specify classifier configurations, use the same
syntax as booster.config mentioned in the previous section.

For historical reasons, I will keep another executable in the package
for bagging decision trees and random forest:
SprBaggerDecisionTreeApp.

One can bootstrap not only training points but also input dimensions
(random forest technology). This method can be invoked using the "-s"
option of SprBaggerDecisionTreeApp or specifying an appropriate
parameter in the input configuration file for SprBaggerApp.

If individual classifiers optimize a certain criterion, e.g., signal
significance, bagging these classifiers is equivalent to optimizing
the same criterion on bootstrap replicas of the training sample. This
is a desirable feature if the user is interested only in optimization
of this specific criterion.

A typical application of bagged decision trees would look like this:

SprBaggerDecisionTreeApp -a 1 -c 2 -n 50 -l 100 -o training.root -f tree.spr gauss2_uniform_2d_train.pat

(builds 50 decision trees, with at least 100 entries per terminal
node, optimizing the signal significance, saves the classifier
configuration into tree.spr and saves classified training data into
training.root)

SprBaggerDecisionTreeApp -a 1 -n 0 -o validation.root -r tree.spr gauss2_uniform_2d_valid.pat

(reads the stored bagger from tree.spr, computes classifier
response for the validation data and saves ntuple to validation.root)

The two most important training parameters are "-s" and "-l". The
optimal number of sampled input features and the minimal size of leaf
nodes should be determined by examining the bagger performance on
(cross-)validation data. Breiman
(http://www.stat.berkeley.edu/users/breiman/RandomForests/) recommends
choosing the number of sampled input features small compared to the
dimensionality of the problem. I would recommend trying various values
of "-s" increasing exponentially. For example, if you have
100-dimensional data, you might try 1, 2, 4, 8, 16, 32, 64, and
100. The minimal leaf size needs to be optimized for each value
independently. Typically, the minimal leaf size is fairly small - you
can start with "-l 5" events per leaf and assume that this could be
fairly close to optimal.

Note that this example, as well as the example in the AdaBoost
section, is mostly qualitative. The power of boosted and bagged
decision trees is most obvious in high-dimensional problems; an
example with 2D data is not intended to demonstrate the power of these
classifiers.

Another classifier implemented in SprBaggerDecisionTreeApp is a
boosting algorithm, arc-x4. Due to some technicalities, the
implementation of this algorithm is closer to that of the bagging
algorithm than that of the boosting algorithm. This is why SprArcE4
inherits from SprBagger and the trained counterpart for this algorithm
is just SprTrainedBagger.  To use this algorithm, choose "-B" option
of SprBaggerDecisionTreeApp or SprBaggerApp. Recommendations for the
choice of "-s" and "-l" parameters often hold for this algorithm as
well.

Another boosting algorithm included in SprBaggerApp is the range
booster. In practical applications, the user is often interested in
the region of high signal purity (and therefore low signal
efficiency). The range booster lets you discard events that lie
outside of the interesting range as you accumulate more training
cycles. For example, instead of bagging 200 decision trees, you might
opt to run 4 baggers each using 50 decision trees and discard events
that lie outside of the interesting range after training each
bagger. To do this, you would specify 

Bagger 50 0
TopdownTree 5 0 2 0

in the configuration file for SprBaggerApp (50 bagged decision trees
optimizing Gini index with 2 events per leaf) and use -R option of
SprBaggerApp to select range boosting. Then specify '-n 4' on the
command line to train 4 baggers.  The supplied training set is
randomly split in two halves for each bagger with the first half used
for training and the second half used for event reweighting. -S
specifies the signal efficiency range that you are interested in; the
weights of all events outside this range will be reduced after each
bagger is finished. -E option specifies the factor by which the weight
of every event outside this signal efficiency range is reduced. -T
specifies the weight threshold below which the event is permanently
discarded. For example, to apply an aggressive event reduction
algorithm, specify -E value less than -T; this means that events
outside of the interesting efficiency range will be discarded
immediately.

Imagine, for example, that you only care about the classifier
performance at the signal efficiency 0.2 or lower. In that case, you
can specify '-S 0.5' (it is recommended to keep some margin for
safety), '-E 0.01' and '-T 0.1'. This will let the classifier focus on
the interesting range as it trains more baggers. This effectively
implements the so-called "cascade" method of classification advertised
in some recent HEP papers.

In general, I would not expect any improvement in performance out of
this method; this is just a way to save CPU time if you need to train
on huge datasets and don't want to waste CPU cycles on events well
outside of the interesting range. Note, however, that one can play
many games here. The syntax for the configuration file lets you
specify as many classifiers as you like. You can use bagged decision
trees for the first cycle and boosted decision trees (or a neural net,
or anything else) at the second step of the "cascade" cycle etc. You
can use as many cascade cycles as you like.


2.3 Decision tree
-----------------

A decision tree recursively divides input space into rectangular
regions (nodes). For each node the tree attempts to find the most
optimal split by choosing among all possible splits for all input
dimensions. The split yielding the largest figure of merit, according
to the user-supplied optimization criterion, is used to divide this
node into halves. If the largest figure of merit is less than the
original FOM computed for the unsplit node, this node becomes a leaf
(terminal node), classified either as signal (1) or background (0),
depending on the FOM value. The output of the tree training procedure
is a list of signal leaf nodes. After the leaves have been found, the
tree sorts the leaves by signal purity in the descending order and, if
it was requested by the user, merges the leaves to maximize the
overall FOM. The leaves included in the overall maximal-FOM set are
considered as signal regions, while the rest of the signal leaves not
included in the overall FOM set are re-classified to background
nodes. This selection procedure does not enforce continuity, and in
the end the signal leaves may be far apart from each other.

Decision trees have been introduced in the statistics literature two
decades ago and have become a popular data analysis tool. The
predictive power of an individual decision tree is typically worse
than that of flexible classifiers such as neural nets or boosted and
bagged decision trees. The main advantage of the decision tree is its
interpretability - rectangular regions can be easily visualized in
many dimensions.

The decision tree implemented here is different from popular
commercial implementations such as CART and C4.5. It is more
simple-minded because (at least for now) it can deal only with two
categories, signal and background. A conventional decision tree
implementation also involves a pruning algorithm for merging nodes
that belong to the same tree branch. A pruning algorithm typically
introduces a penalty on the tree complexity and attempts to reduce the
number of nodes at the cost of obtaining a slightly worse FOM with a
decreased complexity penalty. No pruning is attempted in this
implementation. On the other hand, the SPR implementation offers
advanced features missing from conventional trees.  A conventional
tree treats both categories symmetrically and carries on training as
long as the FOM with respect to any input category can be
improved. The tree implemented here can work in two modes, symmetric
and asymmetric. In the asymmetric mode, it only finds optimal signal
nodes, but no optimization is attempted for background. This allows to
use such inherently asymmetric FOM's as signal significance
S/sqrt(S+B) or a 90% upper limit. In the symmetric mode, the tree
works in the same way as a conventional tree would - it finds each
decision split by minimizing a fraction of correctly classified
events, Gini index or cross-entropy.

The stopping rule for the decision tree implemented here is set by the
minimal number of events per tree node. This parameter has to be
supplied by the user at tree construction. Depending on the chosen
optimization criterion, the tree performance can crucially depend on
this parameter. Try, for example

SprDecisionTreeApp -a 1 -c 3 -n 1 -m -v 1 gauss2_uniform_2d_train.pat

The tree will produce an insane number of nodes and in the end display
a message:

"Included 5318 nodes in category 1 with overall FOM=1 with 10000
signal and 0 background weights."

It means that the tree separated 10,000 signal events from background
events and grouped these signal events into 5318 nodes with 1.9 event
per node on average. This may seem impressive but if you run this tree
on validation data, you will find that the validation FOM is quite
poor because the tree is over-trained.

If you do now

SprDecisionTreeApp -a 1 -c 3 -n 5000 -m -v 1 gauss2_uniform_2d_train.pat

you will get

"Included 1 nodes in category 1 with overall FOM=0.915234 with 4632 signal and 429 background weights."

Since FOM now is significantly less than 1, it likely means that
the tree is under-trained. For this exercise, the optimal number of
events per node happens to be around 200, which you can easily
establish by looking at FOM computed on validation data:

SprDecisionTreeApp -a 1 -c 3 -n 200 -m -t gauss2_uniform_2d_valid.pat -v 1 gauss2_uniform_2d_train.pat

Note that without merging ("-m" option on the command line) these
results would be different.

For some optimization criteria, specifying 1 as the minimal number of
events per node can force the tree to create a separate node for each
event and lead to memory exhaustion.


2.4 Top-down tree
-----------------

The topdown tree (SprTopdownTree) implements the same training
algorithm as SprDecisionTree described in 2.3 but stores the
classifier configuration in a different way. While SprDecisionTree
stores the trained classifier as a list of rectangular nodes, the
topdown tree stores a full path from the tree root. This makes the
topdown tree less interpretable but faster for classification of new
events. To classify a new event with SprDecisionTree, it takes O(D*M)
operations, where D is the dimensionality of the problem and M is the
number of terminal nodes. The height of a balanced binary tree with M
terminal nodes is log(M)+1. Hence, the topdown tree reduces the
classification time by O(D*M/log(M)). In reality, this factor can be a
lot less because a) the topdown tree is not necessarily balanced, and
b) each terminal node may be defined by only a small subset of input
variables (which would reduce the classification time for
SprDecisionTree).

The topdown tree also offers optional continuous output instead of the
default discrete output (0/1) for the regular decision tree. The
continuous output is computed as w1/(w0+w1), where w1 and w0 are
signal and background weights in a terminal node.

Topdown trees should be the preferred choice if the user is interested
mostly in the classifier performance and not so much in the
interpretability of results. Topdown trees are currently the default
choice for the two main executables, SprAdaBoostDecisionTreeApp and
SprBaggerDecisionTreeApp. If you wish to use slower regular decision
trees, you can use "-j" option of the two executables. If you wish
discrete tree output for the bagging algorithm, you can use "-k"
option. The Discrete AdaBoost and epsilon-Boost algorithms generally
require that each individual boosted decision tree produce discrete
output; minimization of the exponential loss has been derived under
the assumption of discreteness of the base classifier. For Real
AdaBoost it is necessary, that each tree produce continuous
output. The tree options are switched automatically by the executable.

The single decision tree executable, SprDecisionTreeApp, continues
using regular decision trees by default. In this case, the primary
concern is the ease of interpretation, not speed.

Saved configuration files with topdown trees for the Bagger do not
retain information about whether decision trees were chosen to be
discrete or continuous. The default choice for SprTrainedBagger is
"continuous". If you wish to change this, you will need to do this
manually after you read the saved configuration. For example:

  SprTrainedBagger* bagger 
    = SprClassifierReader::readTrained<SprTrainedBagger>("saved_bagger.txt",
                                                         "Bagger");
  assert( bagger != 0 );
  bagger->setDiscrete();


2.5 Bump hunting
----------------

The idea behind bump hunting is very similar to that of a decision
tree - we search for rectangular boxes by optimizing a chosen figure
of merit. To find such a box, we look at all possible binary splits in
all dimensions. However, instead of recursively splitting input space
into multiple boxes of shrinking size (nodes), we attempt to find one
box with an overall optimal FOM. This is done in two steps:

a) Shrinking. At this stage, we attempt to shrink the signal box in
each dimension to optimize FOM. This shrinking crucially depends on
the "peel" parameter, a fraction of events that can be peeled off the
signal box in one iteration. The user should be very careful about
choosing this "peel" parameter. A low "peel" parameter makes the
algorithm slower and more conservative, while a high "peel" parameter
gives a high-speed and less robust performance. Note that the method
may fail to find statistically significant bumps both due to excessive
conservatism (low peel) and the lack of such (high peel).

b) Expansion. After the signal box shrunk to the minimal size, we
attempt to relax the bounds to increase FOM.

After the shrinking and expansion stages have been completed, we
remove events in the found box from the data set and repeat the
algorithm on the obtained subset to find a next box.

The outcome of this algorithm is a set of rectangular regions (bumps)
with the associated FOMs. The user can request any number of bumps,
and then the algorithm finds them consecutively. In the case of
multiple bumps, the output of the bump hunter can be a bit confusing
because the bumps can overlap and even be enclosed one inside
another. Keep in mind that each found bump is removed from the data
before training is resumed to find a next bump; that is, in reality
the bumps don't overlap because each bump is merely a piece of empty
space from the point of view of bumps with higher indices.

This method is intended for exploratory analysis, e.g., a blind search
for new physics phenomena and any analysis where the user wants to
find a rectangular signal box.

The ready-to-go executable is SprBumpHunterApp. 

Both SprBumpHunterApp and SprDecisionTreeApp store individual box
numbers into ntuples.


2.6 Fisher and logistic regression
----------------------------------

The executable is SprFisherLogitApp. It can store both training and
validation data into ntuples simultaneously. For example, you can do

SprFisherLogitApp -A -a 2 -m 3 -l -o training.out -t lambda-test.pat -p validation.out lambda-train.pat

(trains both linear and quadratic Fisher's (-m 3), as well as logistic
regression (-l), on lambda-train.pat and stores the training
tuple into training.out; validates Fisher on
lambda-test.pat and stores ntuple into validation.out)

Command-line options -e, -u and -i are valid only for logistic
regression. By default the starting point for logistic regression is
set to coefficients obtained by LDA (aka Fisher). If LDA fails, the
initial estimates for the regression coefficients are automatically
reset to zero. You can avoid using LDA for the initial estimate by
selecting "-i" option of the executable. If logistic regression fails
to converge, try reducing the update factor by choosing "-u" less than
1 (for example, 0.5 is a good starting choice). This reduces the rate
of convergence and makes the algorithm more stable.

For consistency with AdaBoost and in the general spirit of preferring
classifier output normalized to the [0,1] interval, L(Q)DA and LogitR
by default return output normalized to [0,1]: first the conventional
output ranging from -infty to +infty is computed and then it is mapped
onto [0,1] by applying 1/(1+exp(-X)) transformation.  To see the
conventional output, use "-s" option of the executable or
useStandard() method.


2.7 Neural networks
-------------------

The package implements a trainable backprop neural net and interfaces
to two SNNS classifiers, the standard backprop net and the radial
basis function net. The user does not have access to the training
mechanism supplied by SNNS. But if you train these classifiers using
the SNNS package, you can save the trained net into a *.net file. Then
you can run either SprStdBackpropApp or SprRBFNetApp, depending on the
type of the net you chose, to compute NN output for user-supplied data
and store the data and the NN output into an ntuple.

SprTrainedRBF implements the radial basis function net. It can be used
only for reading a saved SNNS configuration and applying it to
data. Spr(Trained)StdBackprop provides an original backprop neural net
implementation and also has the ability to read saved SNNS
configurations.

SprStdBackpropApp implements both single backprop neural net and
boosted backprop neural nets. You can access the single neural net by
using "-n 1" option of the executable (one AdaBoost training
cycle). If you use n>1, you will train n boosted neural nets. If you
use n=0, the executable expects that you provide an input file with a
stored AdaBoost configuration.

The most important parameters of a neural net are its structure (-N),
learning rate (-L), and learning rate for initialization (-I). See
SprStdBackprop.hh for details on structure specification.

At initialization the neural net is trained on randomly selected
events using randomly generated learning rates. This is done to break
the initial neural net symmetry.

Training events are by default randomly permuted before training.

Here are a few examples of how the executable can be used:

SprStdBackpropApp -a 1 -n 1 -N '2:4:2:1' -l 100 -L 0.1 -d 5 -t gauss2_uniform_2d_valid.pat -f net.spr gauss2_uniform_2d_train.pat

Trains a single NN with 100 cycles from scratch. The NN configuration
is: 2 input nodes (must be always equal to the dimensionality of the
data!!!), 4 hidden nodes in the 1st hidden layer, 2 hidden nodes in
the 2nd hidden layer, and 1 output node (must be always 1 for
now!!!). The learning rate is 0.1. Display quadratic loss on
validation data every 5 cycles. Save NN configuration into net.spr.

Suppose you trained SNNS on these data earlier and now want to
convert SNNS configuration into SPR configuration. Do:

SprStdBackpropApp -a 1 -n 1 -S net.snns -l 0 -f net.spr gauss2_uniform_2d_train.pat

This simply converts SNNS format net.snns into SPR format net.spr.

You can resume training from SNNS format directly:

SprStdBackpropApp -a 1 -n 1 -S net.snns -l 100 -L 0.01 -f net200.spr -d 5 -t gauss2_uniform_2d_valid.pat gauss2_uniform_2d_train.pat

or from the converted SPR net:

SprStdBackpropApp -a 1 -n 1 -R net.spr -l 100 -L 0.01 -f net200.spr -d 5 -t gauss2_uniform_2d_valid.pat gauss2_uniform_2d_train.pat

You can use the obtained NN configuration as a starting point for
boosting:

SprStdBackpropApp -a 1 -n 10 -M 2 -E 0.1 -g 2 -R net200.spr -l 50 -L 0.01 -f boosted_net.spr -d 1 -t gauss2_uniform_2d_valid.pat gauss2_uniform_2d_train.pat

This takes the saved network net200.spr and uses it as a first
classifier in the boosting sequence. 10 more neural nets are trained
using the same network architecture and the 11 neural nets are saved
to boosted_net.spr. We now monitor exponential loss (-g 2) because it
is more appropriate for boosting.

Read the saved AdaBoost from a configuration file and run it on
validation data (results are saved into save.out):

SprStdBackpropApp -a 1 -n 0 -r boosted_net.spr -o save.out gauss2_uniform_2d_valid.pat

Do the same thing for a single NN you obtained earlier:

SprStdBackpropApp -a 1 -n 1 -l 0 -R net200.spr -o save.out gauss2_uniform_2d_valid.pat

If you would like to resume training from a saved AdaBoost
configuration, the executable forces you to choose a network structure:

SprStdBackpropApp -a 1 -n 10 -M 2 -g 2 -E 0.1 -r boosted_net.spr -f boosted_net_2.spr -L 0.01 -l 50 -N '2:6:3:1' -d 1 -t gauss2_uniform_2d_valid.pat gauss2_uniform_2d_train.pat

Note that the new structure 2:6:3:1 is different from 2:4:2:1 used
earlier. 10 more neural nets will be trained using 2:6:3:1 and
combined with the 11 nets trained earlier using 2:4:2:1. Also note
that -M option is redundant - the AdaBoost mode is read from
boosted_net.spr and set to Real, no matter what you specify on the
command line.


2.8 Combining classifiers
-------------------------

After several classifiers have been trained independently, it is
possible to combine them to obtain a better overall
performance. Spr(Trained)Combiner provides an interface for this
combining. The idea is to train a new powerful classifier on the
outputs of the combined classifiers. In the statistics literature,
this is sometimes referred to as "mixture of experts".

Starting with tag V06-00-00, the reworked combiner can combine an
arbitrary number of classifiers trained on arbitrary subsets of input
variables. For example, your input data is 6D with input variables
named x0, x1, ..., x5. You know that variables x0 and x2 need to be
grouped together (for example, they come from one detector
sub-system), variables x1, x3, and x5 form another group (another
detector sub-system), and variable x4 should be ignored. The input data
file train.pat has all variables filled out for each event with
missing values modeled as values outside the allowed range.  For
instance, the meaningful range for variable x3 extends from -1 to 1
and from 5 to 10; for events where you cannot measure variable x3
(missing value), you model it as a number somewhere outside the range,
e.g., -13.

The recommended sequence of actions looks like this:

1) Train your favorite classifiers on the subsets:

SprBaggerDecisionTreeApp -a 2 -n 10 -l 10 -V 'x0,x2' -f save_1.spr lambda-train.pat

For the 2nd classifier, you have to modify the executable to exclude
unreasonable values of x3. On top of the executable code, put your
functions (see example in src/exampleUserCuts.cc):

// accept all classes for input data
vector<int> myClasses() {
  return vector<int>();
}

// Define a list of variables used for user selection requirements.
vector<string> myVars() {
  vector<string> vars(1);
  vars[0] = "x3";
  return vars;
}

// Define cuts for specified variables using the specified order of variables.
// Return true if the point passes the cuts, false otherwise.
bool myCuts(const vector<double>& v) {
  if( (v[0]>-1. && v[0]<1.) || (v[0]>5. && v[0]<10.) ) return true;
  return false;
}

Following example in src/exampleUserCuts.cc, put the following in
main():

  // prepare pre-filter
  SprPreFilter pre;
  if( !pre.setSelection(myVars,myCuts,myClasses) ) {
    cerr << "Unable to set pre-filter." << endl;
    return 1;
  }

and modify the reader constructor to apply the constraint:

  auto_ptr<SprAbsReader> 
	reader(SprRWFactory::makeReader(inputType,readMode,&pre));

Rebuild the executable by running 'make install' and run it:

SprBaggerDecisionTreeApp -a 2 -n 10 -l 10 -V 'x1,x3,x5' -f save_2.spr lambda-train.pat

Check the messages returned by the executable to make sure that you
see as many input events after the pre-selection cuts are applied as
you expect.

If you want to avoid the hassle of editing C++ code, you can simply
prepare two separate input files, one with varaibles x0 and x2 for the
1st sub-classifier and another one with variables x1, x3 and x5 and
with proper cuts on x3 imposed for the 2nd classifier. The two
training sets do not need to have anything in common - they can use
different input variables in a different order. The two essential
requirements are: a) an input variable must have the same name in all
input files for proper variable matching, and b) all variables used to
train sub-classifiers must be present in the training and test
datasets for the combiner.

2) Define input configuration files for the combiner. You need to
define one config file for each trained classifier plus a config file
for the global classifier to be trained on outputs of these trained
classifiers:

unix> cat classifier_1.config
Path:        save_1.spr
Name:        C1
Default:     0.5
Constraints: 0

Path needs to specify the path to the saved classifier config file
obtained in step 1. Name is an arbitrary name for the
classifier. Default is the default response value returned by the
classifier if input variables do not satisfy input constraints. The
output of the bagger ranges from 0 to 1. Therefore, if the failure to
satisfy the input constraints does not qualify this event as
background or signal, choose 0.5 as the default value. In this case,
there are no constraints, as you can see on the constraints line, and
so the chosen default value is irrelevant.

For the 2nd classifier, you need to specify the constraint on variable
x3:

unix> cat classifier_2.config
Path:        save_2.spr
Name:        C2
Default:     0
Constraints: 1
x3  2  -1. 1.   5. 10.

First, you specify the number of constraints (1 in this case because
you impose constraints only on variable x3). On the next lines you
need to specify the constraints. Put the name of the input variable
first, then the number of allowed intervals and all allowed intervals
on the same line. If you need to impose constraints on another input
variable, change 1 to 2 and insert another line at bottom.

In this case, if the value of x3 is outside of the allowed range, this
is considered as a symptom of the event being background. The default
response value is set to 0.

Make a global config file for the combiner:

unix> cat global.config
StdBackprop 2:4:2:1 50 0.1 0 0.1

This uses the same syntax used by SprBoosterApp or SprBaggerApp. See
booster.config for detail. In this case, you would like to train a
neural net with 2 input units because you have trained 2 classifiers
on data subsets.

3) Train the combiner:

SprCombinerApp -a 2 -f global.spr 'classifier_1.config,classifier_2.config' global.config lambda-train.pat

4) Study the performance of the combiner on test data:

SprOutputWriterApp -a 2 global.spr lambda-test.pat save.out

Note that you do not need to specify explicitly input variables in
steps 3 and 4. SprCombinerApp and SprOutputWriterApp know that
variable x4 has been excluded because neither of the trained
sub-classifiers uses it.

Also note that in the example above the same training data, train.pat,
has been used for training classifiers on subsets and the combiner. If
the classifiers on subsets severely overtrain (which in fact may be
the case for the bagger), training the combiner on the overtrained
responses may be far from optimal. In general, for this exercise you
would need to prepare 2 training sets, one to train on subsets and
another one to train the combiner.


2.9 Recommendations: What classifiers to use and when
-----------------------------------------------------

These should be viewed as highly approximate. It is certainly not my
intention to impose my views of what is a "better" or "worse"
classifier on other physicists. Moreover, these recommendations are
correct on average but not necessarily for the specific problem at
hand.

Every classifier has a set of features: predictive power, speed and
robustness of the training procedure, interpretability of results,
performance in high-dimensional problems etc. There is no such thing
as the "best" classifier; every tool is suited for its purpose.

If you are mostly concerned with speed, I suggest
boosted binary splits. This classifier trains very quickly and
produces a decent, although hardly optimal, separation of signal from
background.

If, due to whatever reason, you want to find a rectangular signal
region, you should use the bump hunter. This method should be also
good in situations when you perform a blind search for an excess of
data above the simulated process.

If you are mostly concerned with the quality of signal/background
separation, you should use a neural net, boosted decision trees
or random forest (Bagger).

If your input data live in a low-dimensional space (dimensionality is
less than 10), if all input variables are continuous and there are no
strong correlations among them, and if the structure of the input data
is more or less simple (i.e., there are no multiple peaks), a standard
feedforward backpropagation neural net should do a great job for
you. This is not to imply that the neural net should perform better
than the boosted decision trees or random forest. I would make no
predictions for what classifier of the mentioned three would perform
better in a specific low-dimensional problem.

The neural net suffers from a variety of problems in high-dimensional
settings. When you add more dimensions, you generally need to expand
the hidden layer as well to capture the increasing complexity of the
data. If the size of the hidden layer is proportional to that of the
input layer, the training and response time of a neural net increase
quadratically. With more than one hidden layer, this dependence is
even worse. Also, the neural net can get confused if it is presented
with strongly correlated variables or variables of the mixed type
(continuous and discrete).

This is why in high-dimensional settings (with dimensionality 20 or
greater), boosted decision trees and random forest are faster and more
reliable classification tools. The training and response time of these
classifiers generally scales linearly with dimensionality.

If you are mostly interested in obtaining a good _average_ separation
between signal and background, e.g., you would like to get two
distinct one-dimensional likelihood shapes and use them in a fit to
data, or if you would like to access the whole "signal purity vs
signal magnitude" curve, I suggest that you use either
boosted decision trees or random forest with symmetric optimization 
criteria 5 or 6, that is, Gini index and cross-entropy.

If you are only interested in one specific criterion such as the
maximal signal significance or the minimal 90% upper limit, I suggest
that you try SprBaggerDecisionTreeApp and optimize this criterion
directly by specifying an appropriate "-c" option.

Do not overdo!!! For many problems, you can get fairly good results
with quick and robust classifiers such as boosted binary splits or a
single decision tree. For example, running boosted or random forest to
separate two bivariate Gaussians from uniform background
(gauss2_uniform_2d_train.pat) does not make much sense (even though
this dataset is used for illustration here).

SprInteractiveAnalysisApp has been provided to help people quickly
compare performance of various classifiers on relatively small
datasets in not too many dimensions. Syntax should be
self-explanatory.


2.10 How to choose parameters for boosted and bagged decision trees
------------------------------------------------------------------

The most important input parameter for boosted decision trees is the
minimal number of events per terminal node of a decision tree,
SprAdaBoostDecisionTreeApp gets this parameter from "-l" option on the
command input line. SprBoosterApp reads this parameter from the
configuration file. The easiest way to choose an optimal value for
this parameter is to monitor the classifier performance on validation
data.

AdaBoost is designed to minimize the exponential loss. To choose the
optimal leaf size, monitor the exponential loss for the validation
data:

unix> SprAdaBoostDecisionTreeApp -a 1 -n 20 -l 100 -t gauss2_uniform_2d_valid.pat -d 1 gauss2_uniform_2d_train.pat

Validation Loss=0.52714 at cycle 1
Validation Loss=0.46702 at cycle 2
Validation Loss=0.44809 at cycle 3
Validation Loss=0.449999 at cycle 4
Validation Loss=0.455101 at cycle 5
Validation Loss=0.462101 at cycle 6
Validation Loss=0.474901 at cycle 7
Validation Loss=0.488699 at cycle 8
Validation Loss=0.502885 at cycle 9
Validation Loss=0.50625 at cycle 10
Validation Loss=0.509496 at cycle 11
Validation Loss=0.535736 at cycle 12
Validation Loss=0.549812 at cycle 13
Validation Loss=0.565217 at cycle 14
Validation Loss=0.595657 at cycle 15
Validation Loss=0.611038 at cycle 16
Validation Loss=0.636845 at cycle 17
Validation Loss=0.659387 at cycle 18
Validation Loss=0.683962 at cycle 19
Validation Loss=0.700659 at cycle 20

The exponential loss increases. Probably L=100 is not a good
choice. Let us try a larger L:

unix> SprAdaBoostDecisionTreeApp -a 1 -n 20 -l 8000 -t gauss2_uniform_2d_valid.pat -d 1 gauss2_uniform_2d_train.pat

Validation Loss=0.815399 at cycle 1
Validation Loss=0.803473 at cycle 2
Validation Loss=0.771938 at cycle 3
Validation Loss=0.756562 at cycle 4
Validation Loss=0.747749 at cycle 5
Validation Loss=0.741897 at cycle 6
Validation Loss=0.737569 at cycle 7
Validation Loss=0.73411 at cycle 8
Validation Loss=0.730116 at cycle 9
Validation Loss=0.727139 at cycle 10
Validation Loss=0.724745 at cycle 11
Validation Loss=0.723074 at cycle 12
Validation Loss=0.721264 at cycle 13
Validation Loss=0.720109 at cycle 14
Validation Loss=0.718357 at cycle 15
Validation Loss=0.716691 at cycle 16
Validation Loss=0.715792 at cycle 17
Validation Loss=0.714532 at cycle 18
Validation Loss=0.713479 at cycle 19
Validation Loss=0.712872 at cycle 20

Exponential loss decreases but we can do better. One more try:

unix> SprAdaBoostDecisionTreeApp -a 1 -n 20 -l 1500 -t gauss2_uniform_2d_valid.pat -d 1 gauss2_uniform_2d_train.pat

Validation Loss=0.650332 at cycle 1
Validation Loss=0.516325 at cycle 2
Validation Loss=0.482314 at cycle 3
Validation Loss=0.460987 at cycle 4
Validation Loss=0.454166 at cycle 5
Validation Loss=0.449057 at cycle 6
Validation Loss=0.435003 at cycle 7
Validation Loss=0.426111 at cycle 8
Validation Loss=0.422804 at cycle 9
Validation Loss=0.416456 at cycle 10
Validation Loss=0.40981 at cycle 11
Validation Loss=0.407647 at cycle 12
Validation Loss=0.406386 at cycle 13
Validation Loss=0.406077 at cycle 14
Validation Loss=0.405301 at cycle 15
Validation Loss=0.40665 at cycle 16
Validation Loss=0.40524 at cycle 17
Validation Loss=0.403948 at cycle 18
Validation Loss=0.403166 at cycle 19
Validation Loss=0.399208 at cycle 20

The loss values are better. By trying various values of "-l"
parameter, one can find that L=1500 is about optimal.

For this problem, two bivariate Gaussians on top of uniform
background, separating signal from background is easy. You certainly
don't need many boosted trees for that - in fact, you can separate
signal from background fairly well with a single decision tree! This
example is merely for illustration.

Let us now find the optimal leaf size for the bagger. Suppose we are
interested in the quadratic loss (-g 1 option of the executable).
Naively, one could assume that the optimal value should be the same as
for the booster:

unix> SprBaggerDecisionTreeApp -a 1 -g 1 -n 20 -l 1500 -t gauss2_uniform_2d_valid.pat -d 1 gauss2_uniform_2d_train.pat

Validation FOM=0.101408 at cycle 1
Validation FOM=0.0958078 at cycle 2
Validation FOM=0.0941453 at cycle 3
Validation FOM=0.0911666 at cycle 4
Validation FOM=0.0873188 at cycle 5
Validation FOM=0.0847348 at cycle 6
Validation FOM=0.0843446 at cycle 7
Validation FOM=0.0890755 at cycle 8
Validation FOM=0.0895145 at cycle 9
Validation FOM=0.0910927 at cycle 10
Validation FOM=0.089789 at cycle 11
Validation FOM=0.0879379 at cycle 12
Validation FOM=0.0879443 at cycle 13
Validation FOM=0.0864982 at cycle 14
Validation FOM=0.085486 at cycle 15
Validation FOM=0.0849834 at cycle 16
Validation FOM=0.0847154 at cycle 17
Validation FOM=0.0843778 at cycle 18
Validation FOM=0.0840525 at cycle 19
Validation FOM=0.0838303 at cycle 20

Let us try another choice for the leaf size: 50.

unix> SprBaggerDecisionTreeApp -a 1 -g 1 -n 20 -l 50 -t gauss2_uniform_2d_valid.pat -d 1 gauss2_uniform_2d_train.pat

Validation FOM=0.084895 at cycle 1
Validation FOM=0.0770321 at cycle 2
Validation FOM=0.0680597 at cycle 3
Validation FOM=0.0643376 at cycle 4
Validation FOM=0.0611588 at cycle 5
Validation FOM=0.0618565 at cycle 6
Validation FOM=0.0613833 at cycle 7
Validation FOM=0.0652366 at cycle 8
Validation FOM=0.0638901 at cycle 9
Validation FOM=0.0647727 at cycle 10
Validation FOM=0.0630297 at cycle 11
Validation FOM=0.0622102 at cycle 12
Validation FOM=0.0626738 at cycle 13
Validation FOM=0.0616809 at cycle 14
Validation FOM=0.0608163 at cycle 15
Validation FOM=0.0616452 at cycle 16
Validation FOM=0.0625021 at cycle 17
Validation FOM=0.0616811 at cycle 18
Validation FOM=0.0610225 at cycle 19
Validation FOM=0.0603946 at cycle 20

Quadratic loss is less than in the previous trial. Most certainly,
50 is a better option than 1500. Let us go even to smaller leaves:

unix> SprBaggerDecisionTreeApp -a 1 -g 1 -n 20 -l 5 -t gauss2_uniform_2d_valid.pat -d 1 gauss2_uniform_2d_train.pat

Validation FOM=0.0875863 at cycle 1
Validation FOM=0.080353 at cycle 2
Validation FOM=0.0714745 at cycle 3
Validation FOM=0.0670627 at cycle 4
Validation FOM=0.0638389 at cycle 5
Validation FOM=0.0640173 at cycle 6
Validation FOM=0.0635259 at cycle 7
Validation FOM=0.0666818 at cycle 8
Validation FOM=0.0652588 at cycle 9
Validation FOM=0.0661904 at cycle 10
Validation FOM=0.0645742 at cycle 11
Validation FOM=0.0637672 at cycle 12
Validation FOM=0.0640231 at cycle 13
Validation FOM=0.0630114 at cycle 14
Validation FOM=0.0623262 at cycle 15
Validation FOM=0.0630325 at cycle 16
Validation FOM=0.0637906 at cycle 17
Validation FOM=0.0629989 at cycle 18
Validation FOM=0.0623234 at cycle 19
Validation FOM=0.06177 at cycle 20

These FOM values are a little worse than in the previous example. So
probably 50 is about optimal.

Note the dramatic difference between boosted and bagged decision
trees. While boosted trees typically want to have large terminal
nodes, bagged trees usually benefit from making the terminal node
size fairly small.


2.11 Cross-validation
---------------------

Cross-validation has been implemented for 4 executables:
SprCrossValidatorApp, SprAdaBoostDecisionTreeApp,
SprBaggerDecisionTreeApp, and SprDecisionTreeApp. The purpose of
cross-validation is to choose an optimal set of the classifier
parameters if the input dataset is small. In this case, the supplied
data are split into several equal pieces (as specified by "-x"
option). These pieces are then excluded one by one from the training
sample. The classifier is trained on the training data with one piece
excluded and then its performance is evaluated on the excluded
piece. At the end, the FOM obtained by averaging the FOMs from all
pieces is returned to the user. A typical application of the
cross-validation algorithm would look like this:

SprBaggerDecisionTreeApp -a 1 -g 1 -n 20 -x 5 -q '5,10,20,50,100,1000,2000,5000' gauss2_uniform_2d_train.pat

This trains 20 bagged decision trees for each specified minimal node
size. That is, the executable trains 8 baggers, each using 20 bagged
decision trees. In the end you get (quadratic loss displayed):

Cross-validated FOMs:
Node size=       5      FOM= 0.0591947
Node size=      10      FOM= 0.0604726
Node size=      20      FOM= 0.0595813
Node size=      50      FOM= 0.0586453
Node size=     100      FOM= 0.0621193
Node size=    1000      FOM= 0.0830538
Node size=    2000      FOM=  0.104705
Node size=    5000      FOM=  0.130985

This cross-validation exercise tells you that L=50 is close to the
optimal value.

The "-g" option of the two main executables,
SprAdaBoostDecisionTreeApp and SprBaggerDecisionTreeApp, lets you
choose an average per-event loss as the cross-validation
figure-of-merit. If no "-g" option is specified, the FOM specified by
the -C option is monitored on validation or cross-validation data. 

2.12 Gene Expression Programming
--------------------------------

Unfortunately, the authors do not have time to write
documentation. SprGEPApp gives handles to all model parameters and
supplies reasonable default values for those. Try it out and send
feedback to Julian or Ilya (see AUTHORS).


2a Multi class learning
-----------------------

The executable for multi-class learning implemented in the package is
SprMultiClassApp. It lets the user choose from implemented multiclass
algorithms using '-C' option. The user can adjust class weights on the
fly by applying '-w' option followed by a list of weights. The list of
weights must have as many entries as there are classes in the input
data; these entries must be separated by commas, for example: -w
'1,5,2' sets the total weight of class 1 to 1, the total weight of
class 2 to 5, and the total weight of class 3 to 2. The total weight
settings apply to both training and validation (if supplied) data.

2a.1 Algorithm by Allwein, Schapire and Singer
----------------------------------------------

Spr(Trained)MultiClassLearner implements this multi-class learning
algorithm by training a series of binary classifiers for two-class
problems. The reduction of the multi-class problem to a series of
two-class problems is defined by an indicator matrix of the size K*L,
where K is the number of classes and L is the number of binary
classifiers to be built. Matrix elements can take only 3 values: -1, 0
and 1. For each column of the matrix, 1 means that this class is
treated as signal, -1 means that this class is treated as background,
and 0 means that this class does not participate in this binary
problem. Two popular training strategies have been implemented:
one-vs-all and one-vs-one. For the one-vs-all strategy, the indicator
matrix is of size K*K with diagonal elements equal to 1 and
off-diagonal elements equal to -1. For the one-vs-one strategy, the
indicator matrix is of size K*L, where L=K*(K-1)/2; in each column
there is one 1 and one -1, while all other elements are zeros. The
user can specify any other indicator matrix. For example, if one has a
problem with background (class 0) and two similar signals (classes 1
and 2), one could train one classifier to separate the joint signal
from the background and then the two signals from each other. In this
case the indicator matrix would look like this: 

Row/Column |  0  1
------------------
 0         | -1  0
 1         |  1  1
 2         |  1 -1

To provide an arbitrary indicator matrix, the user must construct
SprMutiClassLearner in the User mode. The two other modes, OneVsAll
and OneVsOne, will force regeneration of the indicator matrix
according to the described templates. Running SprMultiClassApp, the
user needs to choose "-e 3" for the user-defined mode and use "-i"
option to specify an indicator matrix in an input file. See
indicator.config for an example.

After training is completed, new data are classified using the
user-specified per-event loss for each row of the indicator
matrix. Each of the L classifiers is used to compute response f_l for
a given point; l=1,...,L. Then average loss is evaluated for each row
of the indicator matrix. For example, for quadratic loss we have
Loss_k = \sum_{l=1}^L (Ikl-f_l)^2, where Ikl is an element of the
indicator matrix (k-th row and l-th column). The point is classified
as category k if Loss_k is the least of loss values computed for all
rows. SprDataFeeder allows the user to save loss values computed for
each row of the indicator matrix as well as the overall integer
classification label for each data point.

SprMultiClassApp lets you use any classifier for binary class
training. The restriction at the moment is that this has to be the
same classifier for each pair of classes, that is, you cannot use, for
example, a neural net to separate class 1 from class 2 and then a
decision tree to separate class 1 from class 3. You can either use a
neural net in both cases or a decision tree in both cases. The
classifier is specified in a config file, similar to booster.config
except it only takes one topmost entry.  Due to certain specifics,
this executable does not allow to resume training from a saved
configuration file. Therefore, "-r" command line option automatically
assumes zero training cycles and can be used only to read from the
saved configuration file and compute classifier response for new
data. The "-y" option is mandatory - the user must specify all classes
to be included in the multi-class learning algorithm.

As an example, try:

> SprMultiClassApp -a 1 -e 1 -y '0,1,2,3,4' -c multi.config -f save.spr -t gauss4_uniform_2d_valid.pat -v 1 gauss4_uniform_2d_train.pat

where multi.config consists of 2 lines:

AdaBoost 20 1 0 0.01
TopdownTree 5 0 1000 0

This will train 20 boosted decision trees for each column of the
indicator matrix. AdaBoost and Bagger require 2 input lines in the
config file instead of one - the 2nd line must specify the weak
classifier to be used with AdaBoost.

Also, try

> SprMultiClassApp -a 1 -e 2 -y '0,1,2,3,4' -c multi.config -f save.spr -t gauss4_uniform_2d_valid.pat -v 1 gauss4_uniform_2d_train.pat

and then

> SprMultiClassApp -a 1 -y '0,1,2,3,4' -r save.spr -o save.out -v 1 gauss4_uniform_2d_valid.pat

to see distributions for validation data.

Two options, -j and -W, are included to provide more flexibility for
multiclass separation. -j options tells the trained multiclass learner
to ignore classes with 0 elements of the indicator matrix for
computation of the loss contribution associated with this
classifier. If this option is chosen, loss values for each row of the
indicator matrix are computed by summing weighted classifier responses:

Loss_k = \sum_{l=1}^L I(Ikl<>0)*Wl*(Ikl-f_l)^2 / \sum{l=1}^L Wl

where I(Ikl<>0) gives 1 for every element of the indicator matrix Ikl
not equal to 0 and gives 0 otherwise, and Wl is the weight of the
classifier for column l. By default, Wl is determined by the total
weight of all classes used for training this classifier, that is, all
classes labeled as -1 or 1 in column l. If the user does not like the
default setting, he can supply his own weights through '-W'
option. There must be as many weights in the list supplied by '-W' as
there are classifiers, or equivalently columns of the indicator
matrix.

The package also allows to train binary classifiers separately and
combine them in a multiclass learner using SprAddColumnsForMCLApp. The
user can train different binary classifiers for different columns of
the matrix and fine-tune their parameters separately. As an example,
do:

> SprDecisionTreeApp -a 1 -y '0:1' -n 100 -T -F 0vs1.spr -t gauss4_uniform_2d_valid.pat gauss4_uniform_2d_train.pat

> SprDecisionTreeApp -a 1 -y '0:2' -n 100 -T -F 0vs2.spr -t gauss4_uniform_2d_valid.pat gauss4_uniform_2d_train.pat

> SprDecisionTreeApp -a 1 -y '1:2' -n 1000 -T -F 1vs2.spr -t gauss4_uniform_2d_valid.pat gauss4_uniform_2d_train.pat

Now prepare a file add.config with following lines:

0 1 2
0:1   1   0vs1.spr
0:2   1   0vs2.spr
1:2  10   1vs2.spr

The first line shows all classes. The following lines specify
classifiers. The first entry on each line specifies what classes are
used; the usual syntax with commas and colons applies - you can use,
for example, '0,1:2' to separate classes 0 and 1 from class 2. The 2nd
entry is the weight of this classifier. The 3rd entry is the
configuration file to which this classifier was saved. The example
above implements one-vs-one strategy but you can use any complex
strategy of your choice.

Now run

> SprAddColumnsForMCLApp add.config mc_3classes.spr

The obtained mc_3classes.spr stores the multiclass learner
configuration. You can apply it to test data:

> SprMultiClassApp -a 1 -y '0,1,2' -r mc_3classes.spr gauss4_uniform_2d_valid.pat

to display the loss table.

2a.2 Binary encoder
-------------------

Another multiclass algorithm included in the package is "binary
encoder".  It encodes multiclass data with D features, K classes and N
instances as binary data with D+1 features, 2 classes and N*K
instances. For example, a vector X of class 2 for a problem with 3
classes is encoded as 3 vectors:

X,1    X,2    X,3

where X,2 is labeled as class 1 and the two others are labeled as
class 0. To compensate for the fact that for every binary signal event
this method supplies (K-1) binary background events, the weights of
all binary signal events are multiplied by K.

The disadvantage of this method is the K-fold increase in memory
consumption. The advantage is that the user can monitor the
performance of the multiclass learner on validation data versus the
number of training cycles using the machinery of binary
classifiers. Just specify a validation dataset and the frequency of
print-outs of the validation FOM, in the same way as it is done for
binary classifiers:

> SprMultiClassApp -a 1 -e 1 -y '0,1,2,3,4' -c multi.config -f save.spr -t gauss4_uniform_2d_valid.pat -v 1 -C 2 -d 1 gauss4_uniform_2d_train.pat

'-C 2' invokes the binary encoder and '-d 1' monitors validation loss
at each boosting cycle.

The trained binary encoder computes the class label for a new event by
assuming every class from the multiple class list in turn and choosing
the one for which the underlying binary classifier gives the largest,
most signal-like response value.

Note: due to subtleties of the implementation, this method does not
work with binary splits.

2a.3 Treatment of missing values
--------------------------------

The two multiclass learners implemented in the package are capable of
processing missing values for each input class separately. This
functionality is unique to the multiclass learners; it has not been
implemented for binary classifiers. Running SprMultiClassApp, the user
has two choices for dealing with missing values:

1) Just like with any other method, multiclass or binary, the user can
compute replacements for the missing values by training a 'Missing'
transformation through SprTransformationApp. This transformation
computes replacement values (medians or averages) irrespective of the
class. '-Q' option for SprMultiClassApp then applies this class-blind
transformation to the input data.

2) The user can apply '-L' option of SprMultiClassApp without '-Q'
option. In this case, all values below the cutoff specified by '-L'
are treated as missing. The replacements for these missing values are
computed for each input class separately. The multiclass learner
(whichever is chosen by the user) is trained on data with missing
values replaced by the computed values for the specific class, that
is, if a point is from class 2 and has a missing value in coordinate
x5, it is replaced by the median in variable x5 computed for points in
the training data from class 2. The trained multiclass learner, of
course, does not know the true class of a new point. To compute the
class label for this point, the trained learner assumes missing values
for every input class in turn (assuming the point has some values
missing) and records computed loss for the assumed class with the
corresponding missing values. It then chooses the class which gives
the minimal loss.

In the 2nd approach, missing values in the training data are replaced
by medians of the 1D marginal distributions in the multiclass learner
constructor. These changes cannot be applied to the original supplied
data for the mere fact that these data can be re-used for other tasks,
e.g., cross-validation. The multiclass learner therefore creates a
temporary copy of the input data in memory with missing values
replaced. The amount of consumed memory therefore doubles. For that
reason, the user should be careful applying this algorithm. If it is
used in conjunction with the binary encoder, the total amount of
memory consumption increases by a factor of (K+1) where K is the
number of input classes: first the data are doubled and then the
replica of the data is multiplied by a factor of K.

2a.4 Indicator matrix generation for Allwein-Schapire-Singer
------------------------------------------------------------

Although one-vs-all and one-vs-one multi-class separation strategies
are popular, they are not necessarily most powerful. Work rooted in
error-correcting output code (ECOC) algorithms suggests that matrices
with maximized Hamming distance between rows (classes) or maximized
diversity between columns (classifiers) can be more efficient. For a
brief review, we recommend Ludmila Kuncheva "Using diversity measures
for generating error-correcting output codes in classifier ensembles",
Pattern Recognition Letters 26 (2005), pp. 83-90.

SprIndicatorMatrixApp implements several approaches for generation of
such matrices. All input parameters for the executable are set in an
input configuration file; see samplematrixindicator.cfg for an
example. The generated matrix can be plugged into SprMultiClassApp
using -i with -e 3 option.

The "ovoova" algorithm simply generates one-vs-all and one-vs-one
matrices. This functionality is already provided by
SprMultiClassApp. The "random" algorithm generates a user-specified
number of random matrices of a given size and chooses the one with the
best measure of row and/or column separation. The "random hill"
algorithm implements simulated annealing, aka random hill
climbing. Changing the sign of one randomly chosen matrix element
defines one optimization step and the probability of accepting the
change that leads to a worse measure of separation is given by
probKeepBadChange. The "exhaustive" algorithm generates all possible
class assignments for K classes and produces a K*L matrix, where
L=2^(K-1)-1 is the number of classifiers. Obviously, it is not
recommended for a large number of classes.


3. Data and filters
-------------------

Each data point, SprPoint, is described by a vector of its
coordinates, its class and its index (a non-negative number assigned
by the user to distinguish points). SprData is a collection of
SprPoint's.

In many applications we need to weight data points and impose
cuts. This can be accomplished by imposing a filter. Every filter that
can be used with the package must inherit from SprAbsFilter. Note that
pretty much all classes, e.g., classifiers, that need access to data
do not talk to data directly; instead they get access to a filter
applied to this data. If you don't need any filtering, you can use
SprEmptyFilter.

SprAbsFilter itself is capable of filtering on 
1) event class 
2) event index 
The index here applies not to the user-defined index of each
point but to the flat index in the list of data points ranging from 0
to (length-1) given by the order in which the points were added to
SprData. User-defined indexes of SprPoint's are used simply to give
each point a unique id number.

Filter can also apply weights to events. These weights are not
normalized to unit sum. If you wish to keep weights normalized, you
need either to call normalizeWeights() method after filter
construction and after every filter() operation or supply normalized
weights to the filter constructor.

SprAbsFilter::pass() is a pure virtual method that must be implemented
by specific filters. It determines if a data point satisfies the
requirements imposed by the filter. 

At the moment, four filters are implemented:

1) SprEmptyFilter which does not impose any cuts and simply reproduces
the base SprAbsFilter functionality.

2) SprBoxFilter which imposes rectangular cuts on the input space

3) SprTransformerFilter which applies variable transformations defined
by SprAbsVarTransformer to input data

4) SprVarSelectorFilter which selects a subset of input variables

To put filtering in effect, the user needs to run filter() method. The
applied requirements and weights can then be removed with
clear(). Note that clear() in SprAbsFilter restores the filter to its
original state, that is, all requirements are removed and the weights
are reset to the weights with which the filter was constructed. The
exception to this design is SprTransformerFilter. After transform() is
applied, there is no way of going back to the original variables.

To reduce memory consumption, the user can apply irreversibleFilter()
method. If irreversibleFilter() is used, events that do not satisfy
the imposed requirements are removed permanently and cannot be
restored by clearing the filter. This method cannot be applied if
SprData owned by the filter owns SprPoint's.

Each filter implementation should also have a
copy-constructor. However, for SprAbsFilter the copy-constructor does
not do the usual deep copying. Its purpose in this case is to provide
a consecutive filter. SprAbsFilter's copy-constructor copies the
current filter's state, that is, weights and all imposed
requirements. Only events that satisfy these requirements are fed into
the new filter. If you create filter 2 out of filter 1 using a
copy-constructor, apply some extra filtering to filter 2 and then call
clear() on filter 2, filter 2 will restore weights and requirements
used at its construction. That is, you will never get access to the
original weights and data used by filter 1 - unless, of course, you
call clear() on filter 1 before you copy-construct filter 2.

Note that the original SprEmptyFilter produced by SprSimpleReader owns
the input data. If you impose consecutive filters, they will hold
pointers to the data owned by the original filter. Therefore, you are
not allowed to delete the original filter - as long as you need access
to the input data. If you delete the original filter, consecutive
filters will point to a non-existent piece of memory.


4. Imposing simple cuts on input variables
------------------------------------------

Cuts are imposed using an SprCut typedef specified in SprDefs. SprCut
is a vector of pairs. Each pair in the vector represents lower and
upper bounds. The idea is that the cut on a variable is satisfied if
the value of a variable lies between the lower and upper bounds for
any pair in the vector. A simple one-sided cut is modeled as a vector
with only one pair which has either the lower or upper bound equal to
the numeric limit on double (see SprUtils).


5. Variable transformation and feature extraction
-------------------------------------------------

There are two ways of applying variable transformations in the
package:

1) While input data is being read from disk to memory. This
functionality is provided by SprPreFilter and discussed in Section 8.

2) After input data has been read into memory. This functionality is
provided by SprTransformerFilter and SprAbsVarTransformer and discussed
here.

SprAbsVarTransformer provides interface for variable
transformations. Conceptually, variable transformation and feature
extraction are the same thing. This is why all transformers include
the train() method used to find an optimal transformation.

The user can apply an arbitrary sequence of transformations using the
SprTransformationApp executable. The list of desired transformations
is specified as an input string, for example, 'PCA,Normalize'. The
transformations are applied from left to right, that is, in this case
PCA will be followed by Normalize. After training, the overall
'PCA,Normalize' transformation can be saved into a file and then read
back using -Q command-line option of most executables.

The package implements 3 specific transformations:

- PCA (Principal Component Analysis).  It transforms data into a new
coordinate system where correlations between input variables are
minimal. After the transformation, it labels the new input variables
as pc0, pc1 etc.

- Normalize. This transforms every input variable X to (X-<X>)/Sx,
where <X> is the mean and Sx is the standard deviation of the sample.

- Missing. Replacement of missing values with either median or average
values of the corresponding 1D marginal distributions. For more
detail, see the section on missing values.

If you wish to use transformations interactively from a Root session,
see root/spr_transform.C for examples.

To use a trained variable transformation in your C++ code, use syntax
similar to that of trained classifiers:
	
  const SprAbsVarTransformer* t 
    = SprVarTransformerReader::read("saved_trans.spr");
  assert( t != 0 );
  if( !t->ready() ) {
    cerr << "Transformer not ready." << endl;
    abort();
  }
  vector<string> oldVars, newVars;
  t->oldVars(oldVars); // variables before transformation
  t->newVars(newVars); // variables after transformation
  vector<double> v = ....;
  vector<double> transformed;
  t->transform(v,transformed);
  delete t;// delete when you are done


6. Treatment of input classes
-----------------------------

Classes in data input must be coded as integers. For the purpose of
classification, however, the user can group classes in any way s/he
likes. This functionality is implemented in SprClass. Each SprClass
has a vector of integer numbers and a negation flag. For instance,
input data has 5 classes numbered from 0 to 4. If the user wants to
treat classes 1 and 3 as signal and all the others as background, s/he
should make two SprClass objects. There are several ways of making
them:

a) 1st SprClass object has integers 1 and 3 and the negation flag is
set to false. 2nd SprClass object has integers 0, 2, and 4, and the
negation flag is also false.

b) 1st SprClass object has integers 1 and 3 and the negation flag is
set to false. 2nd SprClass object has integers 1 and 3, and the
negation flag is set to true.

c) ...

In case (b), the meaning of the 2nd SprClass object is "any class but
1 and 3". For 5 input classes, this is equivalent to "classes 0, 2,
and 4".

Most executables have "-y" command-line option which lets the user
make this choice interactively. The corresponding C++ method is
SprAbsFilter::chooseClassesFromString(). For description of syntax,
see SprAbsFilter.hh.

Content of each SprClass object is printed out using the "<<"
operator. It shows all classes as a list of integers separated by
commas and the negation flag in parentheses as either +1 (no negation)
or -1 (negation is in effect).

In this approach, original event classes specified in input data are
stored, unchanged, into an output ntuple. This allows the user to see
what specific processes contribute to specific regions of the
classifier output. The ntuple does not carry information about what
class was treated as background and what class was treated as signal;
in most situations it is fairly easy to figure this out just by
looking at the correlation between the class number and the classifier
output, but in principle the user has to keep an independent record of
how the classes were split into signal and background for the purpose
of this classification job.


7. Estimation of variable importance
------------------------------------

The ultimate goal of the variable selection analysis is to select a
subset of lowest dimensionality without significant loss in the
predictive power.

Tools for variable selection fall in two groups: 1) methods that work
with any classifier and 2) methods that work only with one specific
classifier.

7.1 Correlation with the class label
------------------------------------

The classifier-independent tool implemented in the package
(SprExploratoryAnalysisApp) computes correlations between variables
and the class label. SprDataMoments implements two methods:
correlClassLabel() to compute corr(X,Y) and absCorrelClassLabel() to
compute corr(|X-EX|,Y), where X is the variable in question, EX is its
expectation and Y is the class label. In principle, the larger is the
magnitude of the correlation, the more powerful is this variable for
classification. In reality, this method for variable selection is not
very reliable. It is easy to construct an example where corr(X,Y)=0
but in fact X is a powerful classification variable. It is harder (but
possible) to construct an example where both corr(X,Y) and
corr(|X-EX|,Y) are close to zero but variable X is powerful
nevertheless. The reverse is not true: If one of these correlations is
close to 1 or -1, this is most likely a powerful classification
variable.

7.2 Counting decision splits
----------------------------

For decision trees, you can count the number of decision splits on
each input variable and estimate the change in the split criterion FOM
due to this variable. The change in the FOM is a more reliable measure
of the variable importance than the split counting. Both these options
are invoked by the "-i" command-line option with the
decision-tree-based executables.

7.3 Variable importance for neural networks
-------------------------------------------

For neural networks, you can estimate the importance of each input
variable by summing weights of links connected to the corresponding
input node over all hidden nodes. Imagine a simple neural network with
one hidden layer and one output node. Let us denote the weight of a
link from an input node d to a hidden node h as w(d,h). Let us denote
the weight of a link from this hidden node to the output node as
w(h,o). The importance of input variable d is then estimated as 

\sum_h abs(w(d,h)*w(h,o))

where the sum is taken over all hidden nodes. For a more complex case
with several hidden layers, this formula would be modified
accordingly, in a straightforward manner.

This method is applicable only if the input data have been normalized
to the same input range. It is strongly recommended to apply the
Normalize transformation to data before estimating variable importance
using this recipe.

To invoke this method, use "-j" option of SprStdBackpropApp.

7.4 Estimation of variable importance by random permutation of class labels
---------------------------------------------------------------------------

SprVariableImportanceApp estimates the importance of each input
variable for any applied classifier. It randomly permutes class labels
in the provided dataset for each input variable in turn and estimates
an increase in the classification loss due to this mislabeling. The
more important the variable, the greater the increase in the
classification loss. The user can specify the number of permutations
per variable over which the increase in loss is averaged. The default
is one permutation but it is recommended to use more than one for more
accurate results. The quadratic loss is chosen as default since
classifiers whose output ranges from -infty to +infty automatically
map the output onto [0,1]. For a multi-class learner, the default loss
is set to misidentified fraction of events.

It is recommended to run this method on data not used for training
this classifier. Because classifiers like random forest or boosted
decision trees achieve perfect separation on training data, the
increase in loss computed for the training data can be misleading.

7.5 "Add N Remove R" method
---------------------------

A more powerful approach is the "add n remove r" method implemented in
SprAddNRemoveRApp. This executable uses the user-defined classifier
and loss to find an optimal combination of input variables by adding n
and removing r variables at each optimization step. This method is
CPU-intensive and should be only tried on small datasets. For really
small datasets, the user can improve the accuracy of variable
selection by cross-validating on the training set (a test data set is
not supplied in this case).

This method works very well with decision-tree-based classifiers. But
this method cannot be used with neural networks because one needs to
change the neural network structure whenever variables are added or
removed. This part of the algorithm could be, of course, automatized
too, but I am reluctant to provide too many magic red buttons that
spit out some numbers as soon as they are pressed.

Set the verbosity of SprAddNRemoveRApp to "-v 1" to see in what order
variables are added and eliminated.

For the multi-class algorithms, the "add n remove r" method is
implemented as part of SprMultiClassApp. Again, use it with "-v 1".

7.6 Variable interactions
-------------------------

SprVariableImportanceApp can also estimate an interaction between two
subsets of input variables, S1 and S2, for the selected
classifier. The interaction is defined as

corr(F(S1),F(S2))

where F(S1) is the response of the classifier at a given point
averaged over all variables not included in S1, similarly for F(S2). A
weak correlation indicates that one can build a classifier on S1 and
an independent classifier on S2, while building a common classifier on
the union of S1+S2 will likely not increase the overall classification
power.

The input syntax for defining variable subsets is similar to the
syntax used for grouping integers into classes. The two subsets of
variables are separated by a colon, and variables within one subset
are separated by commas. A single dot '.' means "all variables but
those included in the other group"; for example, 'x1:.' means that the
method will compute an interaction between x1 and all variables
excluding x1. If an empty string is entered, the method computes
interaction between each variable known to the trained classifier and
all other variables.

-s option of SprVariableImportanceApp invokes an algorithm that
selects variables with largest interactions sequentially. First, it
selects the variable that interacts most with all other variables,
then selects the 2nd variable that interacts most with all other
variables except the variable selected at step 1, then selects 3rd
variable that interacts most with all other variables except variables
1 and 2 etc.  The outcome is a list of variable subsets with
corresponding interactions. The variable subsets in this list are
ordered by size: the 1st subset consists of one variable, the 2nd
subset consists of 2 variables etc, with each subset being the most
powerful subset among the subsets of the same size based on the
strength of interaction. This method is less CPU-intensive than the
Add N Remove R algorithm. The hope is that this method can rank
reliably input variables to let the user choose a subset of lowest
dimensionality and thus can be a CPU-cheap substitute for the Add N
Remove R algorithm. This method is currently being tested on
real-world datasets.

7.7 Which variable selection method is best?
--------------------------------------------

The methods above only provide crude estimates. The best (and most
time-consuming) approach is to try all or most variable combinations
for each classifier and select the combination that gives the best
performance.

Because boosted decision trees and random forest perform well in high
dimensions, the usual strategy is to start with all variables for
these classifiers and throw them away as long as the classification
power does not deteriorate past the level you can tolerate.

For neural networks, one usually deploys the opposite strategy: start
with a few powerful and well-understood variables and add more as long
as you see a noticeable gain in performance.

Note that, unlike some other implementations, SPR makes no attempt to
normalize variable importance values to add up to one. This lets the
user monitor how the importance of a certain variable changes on the
absolute scale due to addition or removal of other variables.


8. Handling input variables
---------------------------

If input data are read from a Root file, the user specifies explicitly what
variables are taken from the input ntuple. If input data are read from an Ascii
file, all variables included in the Ascii file are by default read in. If you
want to exclude some of them, you can do so by using "-z" option of the
executables. 

"-V" command-line option of most executables lets the user include
only specified variables. Again, it works only for Ascii input.

Generally, it is the responsibility of the user to keep track of what
variables were used for training and supply the same variables in the
same order to a trained classifier. That is, if you exclude certain
variables from training using "-z" option, you have to remember to
exclude the same variables when you compute classification values for
this dataset. Note that the list of variables used for training is
saved at the bottom of each trained classifier configuration file.

To make your life easier, some executables have "-M" option. This
option allows to map variables used for training (included at the
bottom of the classifier configuration file) onto variables available
in input data. If "-M" option is provided, you don't have to remember
what variables were used for training. You have to make sure that
identical variables have identical names in training and test
datasets.  This option is always turned on for classifiers used via
SprRootAdapter.

Sometimes you want to exclude variables from training but store them
into the output root (or Ascii) file. This can be done using "-Z"
option. A typical application of the "-Z" option would look like this:

unix> SprAdaBoostDecisionTreeApp -a 1 -n 100 -z 'var1,var2' -f save.spr train.pat

and then

unix> SprAdaBoostDecisionTreeApp -a 1 -n 0 -Z 'var1,var2' -r save.spr -o test.root test.pat

This will train boosted decision trees using all input variables except var1
and var2 and then compute the classifier response for the test data and store
these variables into test.root as well. For comparison,

unix> SprAdaBoostDecisionTreeApp -a 1 -n 0 -z 'var1,var2' -r save.spr -o test.root test.pat

would do the same thing without storing var1 and var2 into the output ntuple.

Starting with SPR-6, the user can filter data, apply variable
transformation and compute the class label for each event defined as a
function of the event coordinates. All these operations are performed
as the data is being read from a file. An example is shown in
exampleUserCuts.cc. Keep in mind that selection requirements are
applied before transformations - you always cut on the untransformed
values. These methods are implemented in SprPreFilter and they have
nothing to do with generic filtering described in Section "3. Data and
filters". SprAbsFilter described in section 3 is applied to data that
has been stored in memory and converted into SPR format. Most of
SprAbsFilter methods hide pieces of data from the user but do not
remove them permanently from memory; these can be reversed and the
user can get access to the full dataset again. SprPreFilter selection
is applied while data is being read from input files. SprPreFilter
requirements and transformations have a permanent effect - for
example, events that do not pass these requirements never get stored
in memory.


9. Data input and output
------------------------------------

9.1 Supported I/O types: Ascii and Root
---------------------------------------

Input is implemented from Ascii and Root files.

The Ascii input format is close to that of SNNS. There are two
essential rules:
- everything to the right of '#' is treated as comments
- an input line must contain at least as many numbers as the dimensionality
  of the input space; these are considered point coordinates. Class
  can be supplied either on the same line or on a separate line.

For details and acceptable Ascii formats, see SprSimpleReader.hh and
examples in data/.

Ascii input format 7 allows to read input data from output files
obtained by SprOutputWriterApp. This is useful for sequential training
and splitting data into training and test sets.

For details on Root input, see SprRootReader.hh and examples in data/.

If the package is built against Root, the user can choose between
input and output from/to Ascii and Root files. For Ascii input, the
user must specify -a flag for all executables with an appropriate
integer. If -a is omitted or "-a 0" is specified, the executable
assumes that input data must be fetched from Root files. For output,
the default choice is Root. To output data into Ascii files, the user
must specify -A flag on the command line.

9.2 Splitting data into training and test sets
----------------------------------------------

SprSplitterApp lets the user split input data into training and test
sets in the specified proportion using the specified random seed. If
you save data in Ascii format, it will be saved in format 7. You can
read the splitted data back by running any executable with "-a 7". If
you save the splitted data in Root, you have to prepare a new *.pat
file for input from Root (see examples in data/). This should be
straightforward.

Note that most executables include -K option for splitting input data
into training and test sets on the fly. This option will split input
data keeping events in the same order, for example, '-K 0.5' will
treat the top 50% of events of each class as training data and the
bottom 50% as test data. -D in combination with -K randomizes
splitting. However, the user does not have any control over the random
seed. Depending on subtleties of the implementation, the behavior can
be non-deterministic, that is, if you apply -K and -D options
tomorrow, you may get a different splitting. In contrast, results
obtained by SprSplitterApp with a user-defined random seed are
expected to be reproducible (to the extent of reproducibility of the
random number generator output for different compilers and platforms).

Splitting into training and test datasets with different random seeds
is a valuable tool for analysis of small datasets, if you want to
investigate how sensitive are your conclusions about the classifier
performance to training data.

9.3 SprIOTestApp: displaying data content
-----------------------------------------

SprIOTestApp displays info about all classes found in the input data
and their content. It is also a simple executable to play with if you
just want to test the format of your input files or storing data in
one of the available formats.


10. Computing output for trained classifiers
-------------------------------------------

Data, event weights and classifier outputs can be stored into ntuples
using SprDataFeeder. The feeder can work with any SprAbsWriter. At
present there are 2 implemented writers, SprRootWriter, capable of
storing data into Root, and SprAsciiWriter for data storage into Ascii
files. The feeder will feed all data points obtained from an input
filter with classifier outputs to the writer. You can add as many
classifiers to the output as you like using addClassifier() method of
the feeder.

Spr{Root,Ascii}Writer will save input coordinates for each point with
column names extracted from the input data file and with the output of all
classifiers. Besides, it will always fill the following 3 columns:
- "index" for user-specified event index
- "i" for event class ("classification" in case of SprRootWriter)
- "w" for event weight ("weight" in case of SprRootWriter)

SprOutputWriterApp reads the saved configuration of any trained classifier,
computes classifier responses for the supplied data and saves output 
into a file. Prior to tag V05-00-00 this fuctionality was distributed among
several executables. For example, to read the saved AdaBoost configuration
from a file and apply it to data, the user would have to specify zero
training cycles "-n 0" and a file to read from using "-r" option of the
corresponding AdaBoost executable. SprOutputWriterApp is a unified 
recommended replacement.


11. Installation instructions
-----------------------------

See INSTALL.


12. How to use trained classifiers in your C++ code
---------------------------------------------------

This syntax has changed in tag V05-00-00. Now all trained classifiers
are read in using the unified interface of SprClassifierReader. For
example, to apply the saved AdaBoost configuration to data, the user
would have to do:

  SprAbsTrainedClassifier* ada 
    = SprClassifierReader::readTrained("saved_adaboost.spr");
  assert( ada != 0 );
  vector<double> input = ...;
  double ada_output = ada->response(input);
  ....
  delete ada;// delete after you are done

Similarly, for Bagger. 

If you know what classifier is saved in the configuration file, you
might want to use the type-safe syntax:

  SprTrainedAdaBoost* ada
    = SprClassifierReader::readTrained<SprTrainedAdaBoost>(
                             "saved_adaboost.spr","AdaBoost");
  assert( ada != 0 );

In this case the read will succeed only if the saved classifier is
indeed AdaBoost. (For this to work, make sure that template
instantiations in SprClassifierReader are uncommented!! Since they
break package compilation for some compilers, I may keep them
commented out in the source code.)

For the multi-class learner, you need to use SprMultiClassReader:

  SprAbsTrainedMultiClassLearner* multiclass =
    SprMultiClassReader::readTrained("saved_multiclass.spr");
  assert( multiclass != 0 );
  vector<double> input = ...;
  map<int,double> output;
  int category = multiclass->response(input,output);
  ....
  delete multiclass;// delete after you are done


13. Working in Root-free environment
------------------------------------

SprInteractiveAnalysisApp and SprOutputAnalyzerApp are the two
executables that allow one to avoid using, to a large extent, any
graphics packages. If you built the package without Root, you can use
the following sequence to assess the performance of classifiers:

a) Train a bunch of classifiers, for example:

SprBaggerDecisionTreeApp -a 1 -n 100 -l 5 -f bagger.spr train.pat
SprAdaBoostDecisionTreeApp -a 1 -n 100 -l 1000 -f ada.spr train.pat

b) Compute output for test data:

SprOutputWriterApp -a 1 -A -C 'ada,bag' 'ada.spr,bagger.spr' test.pat test.out

c) Analyze the Ascii output:

SprOutputAnalyzerApp -C 'ada,bag' test.out 

See comments in SprOutputAnalyzerApp.cc.

If you built the package with Root, you can't use step (c) but, of
course, you could and should use steps (a) and (b) (without the "-A"
flag).

For small data sets in not too many dimensions, you can combine all
these steps in one and compare several classifiers using
SprInteractiveAnalysisApp:

SprInteractiveAnalysisApp -a 1 train.pat test.pat


14. Working in Root environment
-------------------------------

Starting with SPR release 7, the user can run SPR interactively from a
Root session. See root/spr_tutorial.C and other examples in root/. The
package needs to be built against Root. This functionality is provided
by the SprRootAdapter class. See the interface in
include/StatPatternRecognition for a list of all available options.
This interface implements wrappers to most SPR methods and gives the
user quick access to Root graphics.

I strongly advise against running CPU-intensive jobs within Root
environment, whether interactive or batch. The SPR Root wrapper has
been provided to let the user run interactive analysis on small
datasets in not too many dimensions and give the user access to
graphics. If you need to train on datasets of appreciable size, do it
by running SPR executables. To enjoy graphics, you can then load test
data and the saved classifier configuration into an interactive Root
session using loadData..() and loadClassifier() methods.


15. Missing values
------------------

By convention missing values must be encoded as values below the
allowed physical range. There are two ways of dealing with missing
values:

- Treat them as normal entries. This works fine for robust algorithms
such as boosted decision trees and random forest.

- Replace them with medians or averages of the corresponding 1D
marginal densities prior to running operations on data. This can be
done by computing a 'Missing' transformation through
SprTransformationApp.

Replacement of missing values works like any other input variable
transformation. First, you need to train it by running
SprTransformationApp. The 'Missing' transformation requires a lower
cutoff and optionally takes the transformation mode. All values below
the specified cutoff are treated as missing. The mode lets the user
choose between computing medians or averages. After training is
completed and the transformation is stored into a file, it can be
applied to any data using '-Q' option of almost any executable.

Note that the missing values are always replaced by medians or
averages computed for all classes chosen by the user from the input
data. The exception is muticlass algorithms accessed through
SprMultiClassApp. By specifying '-L' option for SprMultiClassApp, the
user invokes class-specific computation of the medians of the marginal
distributions. The missing values are then replaced by the computed
medians assuming every class hypothesis in turn and the classification
label is given by the class hypothesis that provides the smallest loss
for a given row of the indicator matrix. For more detail, see the
section on multiclass learning.

Different packages have different conventions for coding missing
values, for example: "?", "NaN" etc. If someone gives me a convincing
argument why this is better than coding them as large negative
numbers, I may one day implement a similar convention as well. The
advantage of my approach is that 1) it does not require any extra code
for I/O, and 2) missing values can be easily visualized.


16. Example: Cleveland heart data
---------------------------------

I'd like to give an example of the full analysis chain using Ascii
I/O. This is intended as a quick overview of what can be done using
the package.

See how much data we got in total:

[narsky@hepsubmit share]$ ../bin/SprIOTestApp -a 1 cleveland.data
Read data from file cleveland.data for variables "age" "sex" "cp" "trestbps" "chol" "fbs" "restecg" "thalach" "exang" "oldpeak" "slope" "ca" "thal"
Total number of points read: 303
Found 5 classes in data:
  Class:          0(1)    Points:        164
  Class:          2(1)    Points:         36
  Class:          1(1)    Points:         55
  Class:          3(1)    Points:         35
  Class:          4(1)    Points:         13
No input classes specified in string  Will use 0 and 1 by default.
Training data filtered by class.
Points in class 0(1):   164
Points in class 1(1):   55

This gives you the breakdown of classes in the input data.  (The last
4 lines are irrelevant.)

Let us group the classes in two groups.  Class 0 (no disease) is
treated as background, and classes 1-4 (various stages of disease) are
treated as signal:

[narsky@cithep97 share]$ ../bin/SprIOTestApp -a 1 -y '0:.' cleveland.data
Read data from file cleveland.data for variables "age" "sex" "cp" "trestbps" "chol" "fbs" "restecg" "thalach" "exang" "oldpeak" "slope" "ca" "thal"
Total number of points read: 303
Training data filtered by class.
Points in class 0(1):   164
Points in class 0(-1):   139

Split input data into training and test subsets in equal proportion:

[narsky@cithep97 share]$ ../bin/SprSplitterApp -a 1 -A -K 0.5 -S 72727 -y '0:.' cleveland.data clev_train.pat clev_test.pat
Read data from file cleveland.data for variables "age" "sex" "cp" "trestbps" "chol" "fbs" "restecg" "thalach" "exang" "oldpeak" "slope" "ca" "thal"
Total number of points read: 303
Input data filtered by class.
Points in class 0(1):   164
Points in class 0(-1):   139
Splitting input data with factor 0.5
Data re-filtered:
Training data:
Points in class 0(1):   82
Points in class 0(-1):   69
Test data:
Points in class 0(1):   82
Points in class 0(-1):   70
Warning: file clev_train.pat will be deleted.
Warning: file clev_test.pat will be deleted.
Warning: no classifiers specified for SprDataFeeder. Will save data only.
Feeder storing point 0 out of 151
Writer successfully closed.
Warning: no classifiers specified for SprDataFeeder. Will save data only.
Feeder storing point 0 out of 152
Writer successfully closed.

Normalize inputs for application of a neural net. There are a few
missing values in the data that are encoded as -11.

[narsky@cithep97 share]$ ../bin/SprTransformationApp -a 1 -y '0:.' -L '-10' 'Missing,Normalize' cleveland.data clev_norm.spr
Read data from file cleveland.data for variables "age" "sex" "cp" "trestbps" "chol" "fbs" "restecg" "thalach" "exang" "oldpeak" "slope" "ca" "thal"
Total number of points read: 303
Training data filtered by class.
Points in class 0(1):   164
Points in class 0(-1):   139
Transformation sequence has been trained.
Stored the trained transformation

Train a neural network (I omit the part where I find an optimal
configuration of the neural net):

[narsky@cithep97 share]$ ../bin/SprStdBackpropApp -a 7 -y '0:.' -Q clev_norm.spr -N '13:6:3:1' -n 1 -l 400 -L 0.1 -i 10 -g 1 -d 10 -t clev_test.pat clev_train.pat
Read data from file clev_train.pat for variables "age" "sex" "cp" "trestbps" "chol" "fbs" "restecg" "thalach" "exang" "oldpeak" "slope" "ca" "thal"
Total number of points read: 151
Training data filtered by class.
Points in class 0(1):   82
Points in class 0(-1):   69
Read validation data from file clev_test.pat for variables "age" "sex" "cp" "trestbps" "chol" "fbs" "restecg" "thalach" "exang" "oldpeak" "slope" "ca" "thal"
Total number of points read: 152
Points in class 0: 82 1: 70
Validation data filtered by class.
Points in class 0(1):   82
Points in class 0(-1):   70
Variable transformation from file clev_norm.spr has been applied to training and validation data.
Per-event loss set to Quadratic loss (y-f(x))^2
Will train Discrete AdaBoost.
Classes for StdBackprop are set to 0(1) 0(-1)
StdBackprop initialized with classes 0(1) 0(-1) nCycles=400 structure=13:6:3:1 LearningRate=0.1
Validation Loss=0.249513 at cycle 0
Validation Loss=0.248784 at cycle 10
Validation Loss=0.248775 at cycle 20
Validation Loss=0.248765 at cycle 30
Validation Loss=0.248752 at cycle 40
Validation Loss=0.248734 at cycle 50
Validation Loss=0.248704 at cycle 60
Validation Loss=0.24864 at cycle 70
Validation Loss=0.248459 at cycle 80
Validation Loss=0.247667 at cycle 90
Validation Loss=0.240805 at cycle 100
Validation Loss=0.186157 at cycle 110
Validation Loss=0.146592 at cycle 120
Validation Loss=0.14351 at cycle 130
Validation Loss=0.145346 at cycle 140
Validation Loss=0.147809 at cycle 150
Validation Loss=0.14999 at cycle 160
Validation Loss=0.152153 at cycle 170
Validation Loss=0.154612 at cycle 180
Validation Loss=0.157342 at cycle 190
Validation Loss=0.159825 at cycle 200
Validation Loss=0.161757 at cycle 210
Validation Loss=0.163274 at cycle 220
Validation Loss=0.164617 at cycle 230
Validation Loss=0.166054 at cycle 240
Validation Loss=0.167951 at cycle 250
Validation Loss=0.170806 at cycle 260
Validation Loss=0.174859 at cycle 270
Validation Loss=0.179328 at cycle 280
Validation Loss=0.183079 at cycle 290
Validation Loss=0.186059 at cycle 300
Validation Loss=0.188555 at cycle 310
Validation Loss=0.190724 at cycle 320
Validation Loss=0.192646 at cycle 330
Validation Loss=0.19437 at cycle 340
Validation Loss=0.195929 at cycle 350
Validation Loss=0.197348 at cycle 360
Validation Loss=0.198646 at cycle 370
Validation Loss=0.199838 at cycle 380
Validation Loss=0.200934 at cycle 390
Validation Loss=0.201945 at cycle 400
Training done.

The minimum of the quadratic loss for the test data is somewhere
around 150 training cycles. Let us use that and save the trained
neural net into a file:

[narsky@cithep97 share]$ ../bin/SprStdBackpropApp -a 7 -y '0:.' -Q clev_norm.spr -N '13:6:3:1' -n 1 -l 150 -f clev_nn.spr -L 0.1 -i 10 -g 1 -d 10 -t clev_test.pat clev_train.pat

Train boosted decision trees now. Discrete and Real AdaBoost do not
show stable performance for this dataset. Epsilon-Boost with eps=0.01
performs reasonably well. Let us find the optimal leaf size by
cross-validation:

[narsky@cithep97 share]$ ../bin/SprAdaBoostDecisionTreeApp -a 1 -y '0:.' -Q clev_norm.spr -g 2 -n 200 -M 3 -x 3 -q '5,10,20,30,40,50,60,70,80,90,100,110,120,130' cleveland.data
......
Cross-validated FOMs:
Node size=       5      FOM=  0.973268 +- 0.00993318
Node size=      10      FOM=  0.972704 +-  0.0278197
Node size=      20      FOM=  0.952098 +-  0.0457402
Node size=      30      FOM=   0.94095 +-  0.0463437
Node size=      40      FOM=   0.94597 +-  0.0368492
Node size=      50      FOM=  0.953193 +-  0.0464553
Node size=      60      FOM=  0.951837 +-  0.0612336
Node size=      70      FOM=  0.944358 +-  0.0629308
Node size=      80      FOM=  0.959559 +-  0.0516297
Node size=      90      FOM=   0.99134 +-  0.0474166
Node size=     100      FOM=   1.07064 +-  0.0638777
Node size=     110      FOM=   1.14793 +-  0.0244583
Node size=     120      FOM=   1.14793 +-  0.0244583
Node size=     130      FOM=   1.14793 +-  0.0244583

Use 30 as the optimal node size. We need to multiply it by 3/2
(because we split the data into 3 parts for cross-validation and used
2 out of 3 for training0 and divide by 2 (because the training set is
1/2 of the whole input dataset). Therefore, the optimal leaf size for
training is roughly 20:

[narsky@cithep97 share]$ ../bin/SprAdaBoostDecisionTreeApp -Q clev_norm.spr -a 7 -y '0:.' -g 2 -t clev_test.pat -n 200 -d 10 -l 20 -M 3 -f clev_bdt.spr clev_train.pat
........
Validation Loss=0.963527 at cycle 10
Validation Loss=0.930953 at cycle 20
Validation Loss=0.899423 at cycle 30
Validation Loss=0.872137 at cycle 40
Validation Loss=0.849918 at cycle 50
Validation Loss=0.82902 at cycle 60
Validation Loss=0.813038 at cycle 70
Validation Loss=0.80077 at cycle 80
Validation Loss=0.788923 at cycle 90
Validation Loss=0.780752 at cycle 100
Validation Loss=0.77765 at cycle 110
Validation Loss=0.774673 at cycle 120
Validation Loss=0.766489 at cycle 130
Validation Loss=0.764717 at cycle 140
Validation Loss=0.764703 at cycle 150
Validation Loss=0.76869 at cycle 160
Validation Loss=0.77181 at cycle 170
Validation Loss=0.771408 at cycle 180
Validation Loss=0.775252 at cycle 190
Validation Loss=0.77887 at cycle 200
AdaBoost finished training with 200 classifiers.

Try bagging decision trees now. Again, cross-validate to get the
optimal leaf size:

../bin/SprBaggerDecisionTreeApp -Q clev_norm.spr -a 1 -y '0:.' -n 100 -g 1 -x 3 -q '1,2,3,4,5,10,15,20,25,30' -v 1 cleveland.data
............
Cross-validated FOMs:
Node size=       1      FOM=  0.149607 +-  0.0190036
Node size=       2      FOM=  0.144665 +- 0.00993419
Node size=       3      FOM=  0.146089 +-  0.0121881
Node size=       4      FOM=  0.141243 +-  0.0101155
Node size=       5      FOM=  0.141316 +- 0.00127799
Node size=      10      FOM=  0.137164 +- 0.00451656
Node size=      15      FOM=   0.14209 +-  0.0054432
Node size=      20      FOM=  0.147212 +- 0.00248721
Node size=      25      FOM=  0.145218 +- 0.00316222
Node size=      30      FOM=  0.150421 +- 0.00406612

Let us choose leaf size 10 and use 8 for training:

../bin/SprBaggerDecisionTreeApp -Q clev_norm.spr -a 7 -y '0:.' -n 100 -l 8 -g 1 -d 5 -t clev_test.pat -f clev_rf.spr clev_train.pat
............
Validation FOM=0.170303 at cycle 5
Validation FOM=0.164707 at cycle 10
Validation FOM=0.159462 at cycle 15
Validation FOM=0.155348 at cycle 20
Validation FOM=0.154199 at cycle 25
Validation FOM=0.154731 at cycle 30
Validation FOM=0.15558 at cycle 35
Validation FOM=0.154422 at cycle 40
Validation FOM=0.153872 at cycle 45
Validation FOM=0.154465 at cycle 50
Validation FOM=0.152767 at cycle 55
Validation FOM=0.152698 at cycle 60
Validation FOM=0.152226 at cycle 65
Validation FOM=0.152953 at cycle 70
Validation FOM=0.151939 at cycle 75
Validation FOM=0.151536 at cycle 80
Validation FOM=0.150761 at cycle 85
Validation FOM=0.150968 at cycle 90
Validation FOM=0.150515 at cycle 95
Validation FOM=0.150619 at cycle 100
Bagger finished training with 100 classifiers.

Compare the performance of the 3 classifiers now:

[narsky@cithep97 share]$ ../bin/SprOutputWriterApp -a 7 -A -y '0:.' -Q clev_norm.spr -C 'nn,bdt,rf' 'clev_nn.spr,clev_bdt.spr,clev_rf.spr' clev_test.pat save.out
[narsky@cithep97 share]$ ../bin/SprOutputAnalyzerApp -y '0:.' -C 'nn,bdt,rf' -n save.out
Optimization criterion set to Gini index  -1+p^2+q^2
Output of classifier "nn" found in position 17
Output of classifier "bdt" found in position 18
Output of classifier "rf" found in position 19
Read 152 points from input file save.out   Bgrnd Nevents=82   Bgrnd Weight=82   Signal Nevents=70   Signal Weight=70
Input signal efficiency values for which background will be estimated [ 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 ]
===========================================
Table of surviving background fractions (* shows minimal value in a row; Cut on classifier output and FOM are shown in parentheses)
===========================================
Signal Eff   \  Classifiers  |                           bdt |                            nn |                            rf |
------------------------------------------------------------------------------------------------------------------------------
              0.1000         | *     0.00000 +-      0.00000 |       0.00000 +-      0.00000 |       0.00000 +-      0.00000 |
              0.2000         | *     0.00000 +-      0.00000 |       0.00000 +-      0.00000 |       0.00000 +-      0.00000 |
              0.3000         |       0.02439 +-      0.01725 |       0.01220 +-      0.01220 | *     0.00000 +-      0.00000 |
              0.4000         | *     0.03659 +-      0.02112 |       0.03659 +-      0.02112 |       0.03659 +-      0.02112 |
              0.5000         |       0.06098 +-      0.02727 | *     0.04878 +-      0.02439 |       0.04878 +-      0.02439 |
              0.6000         |       0.08537 +-      0.03227 |       0.07317 +-      0.02987 | *     0.04878 +-      0.02439 |
              0.7000         |       0.13415 +-      0.04045 | *     0.09756 +-      0.03449 |       0.09756 +-      0.03449 |
              0.8000         |       0.30488 +-      0.06098 | *     0.23171 +-      0.05316 |       0.28049 +-      0.05849 |
              0.9000         |       0.35366 +-      0.06567 | *     0.34146 +-      0.06453 |       0.50000 +-      0.07809 |
===========================================
Histogram classifier output? y/n [y] n
Continue? y/n [y] n

Looks like bagged decision trees have an edge in the region of high
signal purity but the neural net performs better in the region of high
signal efficiency. Might be interesting to bag the neural network and
see if that can bring a noticeable improvement. But I am not doing
this now.

Let us estimate input variable importance for the bagged decision
trees using two methods: one specific for the decision tree and another
one classifier-independent.

To count decision splits for the bagged trees, we need to re-train
using "-i" option:

[narsky@cithep97 share]$ ../bin/SprBaggerDecisionTreeApp -Q clev_norm.spr -a 7 -y '0:.' -n 100 -l 8 -g 1 -d 5 -t clev_test.pat -i clev_train.pat | sort -grk 7,7
Variable                        thalach    Splits         169    Delta FOM    28.61687
Variable                             ca    Splits          85    Delta FOM    24.33387
Variable                           thal    Splits          75    Delta FOM    20.36938
Variable                            age    Splits         146    Delta FOM    18.85921
Variable                             cp    Splits          54    Delta FOM    15.83812
Variable                        oldpeak    Splits          67    Delta FOM    13.76227
Variable                           chol    Splits          96    Delta FOM    11.71196
Variable                          exang    Splits          34    Delta FOM     6.68017
Variable                       trestbps    Splits          48    Delta FOM     5.76838
Variable                            sex    Splits          13    Delta FOM     1.58061
Variable                          slope    Splits           5    Delta FOM     0.97370
Variable                        restecg    Splits           9    Delta FOM     0.89813
Variable                            fbs    Splits           2    Delta FOM     0.19316

Now estimate the decrease in quadratic loss due to each variable:

[narsky@cithep97 share]$ ../bin/SprVariableImportanceApp -y '0:.' -a 7 -Q clev_norm.spr -n 10 -g 2 clev_rf.spr clev_test.pat | sort -grk 2,2 
                               thal         0.0302867259 +-    0.0093382963         0.0000000000 +-    0.0000000000
                                 ca         0.0300654846 +-    0.0059497151         0.0000000000 +-    0.0000000000
                                 cp         0.0138796084 +-    0.0029744952         0.0000000000 +-    0.0000000000
                            oldpeak         0.0065700839 +-    0.0018971035         0.0000000000 +-    0.0000000000
                            thalach         0.0064980518 +-    0.0065570920         0.0000000000 +-    0.0000000000
                                age         0.0043183276 +-    0.0031195336         0.0000000000 +-    0.0000000000
                              exang         0.0034915037 +-    0.0021649321         0.0000000000 +-    0.0000000000
                               chol         0.0018462538 +-    0.0015761285         0.0000000000 +-    0.0000000000
                                sex         0.0010740878 +-    0.0003132361         0.0000000000 +-    0.0000000000
                              slope         0.0001578955 +-    0.0002165354         0.0000000000 +-    0.0000000000
                                fbs        -0.0000150207 +-    0.0000932678         0.0000000000 +-    0.0000000000
                            restecg        -0.0000162495 +-    0.0002991600         0.0000000000 +-    0.0000000000
                           trestbps        -0.0005304460 +-    0.0006093024         0.0000000000 +-    0.0000000000

My guess is that slope, fbs, restecg and trestbps can be safely
removed without losing any predictive power. To verify that, let us
run an "add 2 remove 1" test. I prepared file my.config with two lines
in it:

Bagger 100 0
TopdownTree 5 0 10 0

Because I am going to cross-validate on the full input data using 3
pieces, I set the optimal leaf size to 10.

[narsky@cithep97 share]$ ../bin/SprAddNRemoveRApp -a 1 -Q clev_norm.spr -y '0:.' -g 2 -n 2 -r 1 -x 3 -v 1 my.config cleveland.data

N_of_variables      Loss
    1                  0.196326 +-  0.0028473
    2                   0.16178 +-  0.0140857
    3                    0.1455 +-  0.0151866
    4                  0.138903 +-  0.0292912
    5                  0.131255 +-  0.0191462
    6                  0.129631 +-  0.0150008
    7                  0.128226 +-  0.0126571
    8                  0.127219 +- 0.00732298
    9                  0.129644 +-  0.0122463
   10                  0.125565 +-  0.0127497
   11                  0.127843 +- 0.00206423
   12                  0.129265 +- 0.00707255
   13                   0.14691 +-  0.0129123
Variables:
    1        thal
    2        ca   thal
    3        ca   cp   thal
    4        ca   cp   sex   thal
    5        ca   cp   oldpeak   sex   thal
    6        ca   cp   exang   oldpeak   sex   thal
    7        ca   cp   exang   oldpeak   restecg   sex   thal
    8        ca   cp   exang   oldpeak   restecg   sex   thal   trestbps
    9        ca   chol   cp   exang   fbs   oldpeak   restecg   sex   thal
   10        age   ca   cp   exang   fbs   oldpeak   restecg   sex   slope   thal
   11        age   ca   chol   cp   exang   fbs   restecg   sex   slope   thal   trestbps
   12        age   ca   chol   cp   exang   fbs   oldpeak   restecg   sex   slope   thal   trestbps
   13        age   ca   chol   cp   exang   fbs   oldpeak   restecg   sex   slope   thal   thalach   trestbps

There is an obvious improvement in performance when we include 2nd and
3rd variable. Then the improvement is questionable. 

Sort variables by interaction now:

[narsky@cithep97 share]$ ../bin/SprVariableImportanceApp -s -a 1 -Q clev_norm.spr -y '0:.' -v 1 clev_rf.spr cleveland.data

N_of_variables      Interaction
    1                  0.514851 +-          0
    2                  0.454187 +-          0
    3                  0.403137 +-          0
    4                  0.345686 +-          0
    5                  0.359334 +-          0
    6                  0.351943 +-          0
    7                  0.283103 +-          0
    8                  0.167429 +-          0
    9                  0.108733 +-          0
   10                 0.0499255 +-          0
   11                 0.0201312 +-          0
   12                -0.0659972 +-          0
Variables:
    1        thalach
    2        thalach   cp
    3        thalach   cp   oldpeak
    4        thalach   cp   oldpeak   ca
    5        thalach   cp   oldpeak   ca   slope
    6        thalach   cp   oldpeak   ca   slope   exang
    7        thalach   cp   oldpeak   ca   slope   exang   sex
    8        thalach   cp   oldpeak   ca   slope   exang   sex   trestbps
    9        thalach   cp   oldpeak   ca   slope   exang   sex   trestbps   age
   10        thalach   cp   oldpeak   ca   slope   exang   sex   trestbps   age   restecg
   11        thalach   cp   oldpeak   ca   slope   exang   sex   trestbps   age   restecg   chol
   12        thalach   cp   oldpeak   ca   slope   exang   sex   trestbps   age   restecg   chol   thal

To summarize, Add 2 Remove 1 algorithm predicts that only 3-4
variables are needed to get the full predictive power. Here are
predictions for the 4 most powerful variables:

Add 2 Remove 1         ca        cp   sex       thal
Delta Gini             thalach   ca   thal      age
Class permutation      thal      ca   cp        oldpeak
Interaction            thalach   cp   oldpeak   ca

Estimate the performance of random forest on these subsets using
3-fold cross-validation:

../bin/SprBaggerDecisionTreeApp -n 100 -q 10 -a 1 -Q clev_norm.spr -y '0:.' -x 3 -v 1 -g 1 cleveland.data

Node size=      10      FOM=  0.142855 +-  0.0278957

../bin/SprBaggerDecisionTreeApp -n 100 -q 10 -a 1 -Q clev_norm.spr -y '0:.' -x 3 -v 1 -g 1 -V 'ca,cp,sex,thal' cleveland.data

Node size=      10      FOM=  0.140762 +-  0.0134891

../bin/SprBaggerDecisionTreeApp -n 100 -q 10 -a 1 -Q clev_norm.spr -y '0:.' -x 3 -v 1 -g 1 -V 'thalach,ca,thal,age' cleveland.data

Node size=      10      FOM=  0.157264 +-   0.020884

../bin/SprBaggerDecisionTreeApp -n 100 -q 10 -a 1 -Q clev_norm.spr -y '0:.' -x 3 -v 1 -g 1 -V 'thal,ca,cp,oldpeak' cleveland.data

Node size=      10      FOM=  0.134336 +-  0.0140065

../bin/SprBaggerDecisionTreeApp -n 100 -q 10 -a 1 -Q clev_norm.spr -y '0:.' -x 3 -v 1 -g 1 -V 'thalach,cp,oldpeak,ca' cleveland.data

Node size=      10      FOM=  0.159046 +-  0.0237228

The conclusion is that any variable subset gives pretty much the same
predictive power within statistical uncertainty.

***

The performance of the binary classifiers above was estimated looking
at the full ROC on a test dataset. Because the test dataset is small,
statitical errors are substantial. As an alternative, we can display a
gross FOM (correctly classified fraction) by cross-validation:

../bin/SprBaggerDecisionTreeApp -Q clev_norm.spr -a 1 -y '0:.' -n 100 -x 3 -l 10 -C 1 -v 1 cleveland.data

Optimization criterion set to Gini index  -1+p^2+q^2 
Monitoring criterion set to Fraction of correctly classified events 
No per-event loss is chosen. Will use the default.
Decision tree initialized with minimal number of events per node 10
Using a Topdown tree.
Will cross-validate by dividing training data into 3 subsamples.
Will cross-validate for trees with minimal node sizes: 10 
Decision tree initialized with minimal number of events per node 10
Using a Topdown tree.
Classes for Bagger are set to 0(1) 0(-1)
Bagger initialized with classes 0(1) 0(-1) with cycles 100
Will cross-validate using 3 subsamples: 
Subsample 0  W1=45  W0=54  N1=45  N0=54
Subsample 1  W1=44  W0=55  N1=44  N0=55
Subsample 2  W1=50  W0=55  N1=50  N0=55
Cross-validator processing classifier 0
Cross-validator processing sub-sample 0 for classifier 0
Cross-validator processing sub-sample 1 for classifier 0
Cross-validator processing sub-sample 2 for classifier 0
Cross-validated FOMs:
Node size=      10      FOM=  0.799038 +-  0.0186373

Let us try 10-fold CV and double the number of decision trees:

Bagger initialized with classes 0(1) 0(-1) with cycles 200
Will cross-validate using 10 subsamples: 
Subsample 0  W1=11  W0=16  N1=11  N0=16
Subsample 1  W1=12  W0=14  N1=12  N0=14
Subsample 2  W1=12  W0=16  N1=12  N0=16
Subsample 3  W1=12  W0=16  N1=12  N0=16
Subsample 4  W1=14  W0=15  N1=14  N0=15
Subsample 5  W1=13  W0=17  N1=13  N0=17
Subsample 6  W1=14  W0=17  N1=14  N0=17
Subsample 7  W1=15  W0=16  N1=15  N0=16
Subsample 8  W1=14  W0=18  N1=14  N0=18
Subsample 9  W1=22  W0=19  N1=22  N0=19
Cross-validated FOMs:
Node size=      10      FOM=  0.835614 +-   0.068813

As expected, this produces a performance estimator which is less
biased but has a larger variance. 80-84% accuracy is the ballpark we
are looking at.

***

So far, we have been dealing with the binary problem "no disease vs
some kind of disease.' For fun, let us try to approach this as a
multiclass problem with 5 classes and see what kind of classification
power we can get.

Let us start with boosted binary splits and try the two popular
multiclass algorithms, one vs one and one vs all. Prepare a file
multi.config with 2 lines in it:

AdaBoost 1300 1 0 0.01
BinarySplit

Try 10-fold cross-validation using one-vs-all strategy:

[narsky@cithep97 share]$ ../bin/SprMultiClassApp -a 1 -y '0,1,2,3,4' -e 1 -x 10 -v 1 -c multi.config -G 1 cleveland.data 
Read data from file cleveland.data for variables "age" "sex" "cp" "trestbps" "chol" "fbs" "restecg" "thalach" "exang" "oldpeak" "slope" "ca" "thal"
Total number of points read: 303
Training data filtered by class.
Points in class 0(1):   164
Points in class 1(1):   55
Points in class 2(1):   36
Points in class 3(1):   35
Points in class 4(1):   13
Per-event monitoring loss set to Misid rate int(y==f(x)) 
Reading configuration for classifier AdaBoost
Classes for AdaBoost are set to 0(1) 1(1)
AdaBoost initialized with classes 0(1) 1(1) with cycles 1300
Adding AdaBoost with: nCycle=1300 AdaBoostMode=1 BagInput=0 Epsilon=0.01
Finished reading 1 classifiers from file multi.config
Multi class learning mode set to OneVsAll.
Indicator matrix:
 Classes/Classifiers : 5 5
=========================================================
                   0 :  1 -1 -1 -1 -1 
                   1 : -1  1 -1 -1 -1 
                   2 : -1 -1  1 -1 -1 
                   3 : -1 -1 -1  1 -1 
                   4 : -1 -1 -1 -1  1 
=========================================================
Multiclass algorithm set to Allwein-Schapire-Singer.
Will cross-validate using 10 subsamples: 
=====================================
Overall validation loss = 0.471947
=====================================
Classification table: Fractions of total class weight
True Class \ Classification |     0      |     1      |     2      |     3      |     4      |   Total weight in class |
------------------------------------------------------------------------------------------------------------------------
    0                       |     0.8415 |     0.1280 |     0.0183 |     0.0122 |     0.0000 |                164.0000 |
    1                       |     0.4909 |     0.1455 |     0.1273 |     0.1818 |     0.0545 |                 55.0000 |
    2                       |     0.3611 |     0.3056 |     0.1389 |     0.1667 |     0.0278 |                 36.0000 |
    3                       |     0.1143 |     0.3143 |     0.2571 |     0.2286 |     0.0857 |                 35.0000 |
    4                       |     0.2308 |     0.2308 |     0.1538 |     0.3077 |     0.0769 |                 13.0000 |
------------------------------------------------------------------------------------------------------------------------

Everything is miclassified as class 0. Try one-vs-one algorithm:

../bin/SprMultiClassApp -a 1 -y '0,1,2,3,4' -e 2 -x 10 -v 1 -c multi.config -G 1 cleveland.data

=====================================
Overall validation loss = 0.448845
=====================================
Classification table: Fractions of total class weight
True Class \ Classification |     0      |     1      |     2      |     3      |     4      |   Total weight in class |
------------------------------------------------------------------------------------------------------------------------
    0                       |     0.8293 |     0.1280 |     0.0244 |     0.0122 |     0.0061 |                164.0000 |
    1                       |     0.4909 |     0.2000 |     0.0909 |     0.1818 |     0.0364 |                 55.0000 |
    2                       |     0.2500 |     0.2778 |     0.1944 |     0.1944 |     0.0833 |                 36.0000 |
    3                       |     0.0571 |     0.2857 |     0.1714 |     0.3429 |     0.1429 |                 35.0000 |
    4                       |     0.1538 |     0.2308 |     0.2308 |     0.3077 |     0.0769 |                 13.0000 |
------------------------------------------------------------------------------------------------------------------------

Somewhat better but still not impressive. Use random forest instead of
boosted splits. Put 2 lines in multi.config:

Bagger 100 0
TopdownTree 5 0 10 0

and run

../bin/SprMultiClassApp -a 1 -y '0,1,2,3,4' -e 2 -x 10 -v 1 -c multi.config -G 1 cleveland.data

=====================================
Overall validation loss = 0.409241
=====================================
Classification table: Fractions of total class weight
True Class \ Classification |     0      |     1      |     2      |     3      |     4      |   Total weight in class |
------------------------------------------------------------------------------------------------------------------------
    0                       |     0.9268 |     0.0488 |     0.0244 |     0.0000 |     0.0000 |                164.0000 |
    1                       |     0.6000 |     0.1455 |     0.0909 |     0.1636 |     0.0000 |                 55.0000 |
    2                       |     0.2778 |     0.2778 |     0.2222 |     0.2222 |     0.0000 |                 36.0000 |
    3                       |     0.1714 |     0.3429 |     0.1714 |     0.3143 |     0.0000 |                 35.0000 |
    4                       |     0.0769 |     0.3077 |     0.3077 |     0.3077 |     0.0000 |                 13.0000 |
------------------------------------------------------------------------------------------------------------------------

See if we can eliminate some variables. Reduce the number of decision
trees in multi.config from 100 to 50 to make the code run
faster. Then run

../bin/SprMultiClassApp -a 1 -y '0,1,2,3,4' -e 2 -x 10 -v 1 -c multi.config -G 1 -N 2 -R 1 cleveland.data

=========================================================
N_of_variables      Loss
    1                  0.429043 +-          0
    2                  0.389439 +-          0
    3                  0.386139 +-          0
    4                   0.39604 +-          0
    5                  0.379538 +-          0
    6                  0.372937 +-          0
    7                  0.386139 +-          0
    8                  0.389439 +-          0
    9                   0.39604 +-          0
   10                  0.376238 +-          0
   11                  0.392739 +-          0
   12                   0.40264 +-          0
   13                   0.39604 +-          0
Variables:
    1        ca   
    2        ca   cp   
    3        ca   cp   slope   
    4        ca   cp   sex   slope   
    5        ca   cp   sex   slope   thal   
    6        ca   cp   restecg   sex   slope   thal   
    7        ca   cp   exang   restecg   sex   thal   thalach   
    8        ca   cp   exang   oldpeak   restecg   sex   thal   thalach   
    9        age   ca   cp   exang   oldpeak   sex   slope   thal   thalach   
   10        age   ca   chol   cp   exang   oldpeak   restecg   sex   slope   thal   
   11        ca   chol   cp   exang   fbs   oldpeak   restecg   sex   slope   thal   thalach   
   12        ca   chol   cp   exang   fbs   oldpeak   restecg   sex   slope   thal   thalach   trestbps   
   13        age   ca   chol   cp   exang   fbs   oldpeak   restecg   sex   slope   thal   thalach   trestbps   
=========================================================

Looks like 5 variables should be enough to achieve the full separation
power. As a cross-check, let us run the multiclass learner again with
the 5 best variables selected and 100 decision trees in each forest:

../bin/SprMultiClassApp -a 1 -y '0,1,2,3,4' -e 2 -x 10 -v 1 -c multi.config -G 1 -V 'ca,cp,sex,slope,thal' cleveland.data

=====================================
Overall validation loss = 0.389439
=====================================
Classification table: Fractions of total class weight
True Class \ Classification |     0      |     1      |     2      |     3      |     4      |   Total weight in class |
------------------------------------------------------------------------------------------------------------------------
    0                       |     0.9390 |     0.0366 |     0.0061 |     0.0183 |     0.0000 |                164.0000 |
    1                       |     0.5273 |     0.2727 |     0.0909 |     0.1091 |     0.0000 |                 55.0000 |
    2                       |     0.3611 |     0.3333 |     0.0556 |     0.2500 |     0.0000 |                 36.0000 |
    3                       |     0.1143 |     0.4286 |     0.0571 |     0.4000 |     0.0000 |                 35.0000 |
    4                       |     0.1538 |     0.3077 |     0.1538 |     0.3846 |     0.0000 |                 13.0000 |
------------------------------------------------------------------------------------------------------------------------

61% accuracy is probably fairly close to the best we can squeeze out
of the multiclass learner with 5 classes. Note that only classes 0 and
3 are identified with some success, and class 4 is totally hopeless.

I most certainly do not want to give anyone an impression that 61% is
the best multiclass accuracy. Various things can be tried to improve
the separation further: customized indicator matrices, grouping
classes together, getting rid of class 4 etc.

Note that cross-validation of the multiclass learner does not produce
error estimates. This is a limitation of the package, perhaps to be
fixed in the future.


17. Memory management
---------------------

In SPR-8 I added a new constant to SprDefs: SprEnforceLowMemory. By
default it is set to true. This tells the package to minimize memory
consumption at the expense of not cloning trained classifiers after
optimization is completed. If you are planning to work with the
package just by using the provided executables or through
SprRootAdapter, use this default setting. If you plan to write your
own C++ code that involves training classifiers, I advise to build the
package with this option set to false. If you keep this option at its
default value, you will need to keep track of the ownership of trained
classifiers and destroy them using the correct sequence; that is, you
will need to understand the SPR implementation in detail. This option
has no effect on memory consumption by input data or by classifiers in
the process of training - it only affects the amount of consumed
memory after training is completed.


18. Summary
-----------

Feedback on the package use is very much appreciated. Send email to
the address shown on top of this file.


19. Release notes
-----------------

04.02.2008 Tag V08-00-00 changes format used for storing trained
multiclass learner configurations. The user will not be able to use
multiclass learner configuration files produced with earlier
versions. Please re-train.

02.04.2007 Tag V05-00-00 introduces a new interface for reading saved
classifier configurations from files. This required multiple changes
to the format. This change is backwards-incompatible. The user won't
be able to read from configuration files produced prior to this tag.
All classifiers need to be re-trained and saved into files again. See
README for a detaied description of the new syntax and new package
features that became possible due to the introduction of the unified
classifier configuration format.

10.14.2006 Due to addition of Real and Epsilon AdaBoosts, the output
format for AdaBoost configuration files has changed. If the user wants
to read configuration files created by releases prior to V03-05-00
with a post-V03-05-00 package release, s/he needs to add the following
line to the configuration file:

Mode: 1

This must be included as a 2nd line in the file, right after the
"Trained AdaBoost:" line. To keep the code clean and maintainable in
the future, I decided not to implement the backwards compatibility in
the code.

Also, the default output in the range [0,1] from AdaBoost has
changed. It is now computed as raw AdaBoost output and then
transformed to [0,1] using 1/(1+exp(-2X)). This directly returns the
probability of an event at X being signal. The earlier algorithm used
before V03-05-00 produced more smooth distributions between 0 and 1;
the new distributions are more peaking near 0 and 1.

09.07.2006 From now on SPR will be released as a standalone package.

06.29.2006 If you are running on not too big data samples in not too
many dimensions, try SprInteractiveAnalysisApp first. It lets you
quickly compare various classifiers.

01.18.2006 Tag V03-00-00 introduces a more flexible treatment of input
classes which allows arbitrary grouping. See Section 6 of this file
and SprAbsFilter.hh for details.

11.12.2005 Starting with tag V01-18-00, the default options for the
two main executables, SprBaggerDecisionTreeApp and
SprAdaBoostDecisionTreeApp, have been reversed. Now by default the two
executables use topdown trees (SprTopdownTree) instead of regular
decision trees (SprDecisionTree) due to their faster performance. The
bagger executable also by default invokes continuous tree output
instead of discrete output. To obtain the old behavior, use "-j"
option for SprAdaBoostDecisionTreeApp and both "-j" and "-k" for
SprBaggerDecisionTreeApp. Users won't be able to read from old
classifier configuration files unless "-j" option is used.

09.11.2005 Tag V01-00-00 introduced a new format for configuration files
used to store boosted and bagged decision trees. The new
SprAdaBoostDecisionTreeApp and SprBaggerDecisionTreeApp will not be
able to read from old configuration files.
