This is an alpha version of an Extremely Randomized Trees (ExtraTrees, a derivative of Random Forest) classifier. Please see the Wikipedia Random Forest page for references.
This does binary classification and multi-class classification, and does not currently support regression.
This has the following modes of operation:
- Train on the --input csv and test on the --test csv.
- Do K-fold validation on the --input csv.
||Displays usage information and exits.|
||Displays version information and exits.|
||Ignores the rest of the labeled arguments following this flag.|
||Turn on debug verbosity|
||Turn on quiet mode|
|-i , --input
||The input csv file. This is used for training, so it must contain the true classification for each sample. Reading from stdin is not implemented (yet). The first row of the csv must be the header row with the names of the columns. The first column must be the identity of the sample, for your use. Default name is 'index' but see the --id argument. The last column must be the true class (as an integer) for each sample. Default name is 'class' but see the --truth argument. Missing feature values are interpreted to be 0.0. |
|-t , --test
||The input csv file containing test samples. This must meet the same requirements as the training csv. See description of --input.|
|-o , --output
||Output csv file. This is written with probability of class 1 for each input sample. You can specify stdout using a dash (e.g. -o -)|
||Display Feature Set: Just load, parse, and display info about feature set.|
||Feature set export: Read the input file, do col exclusion and subsetting, and write the resulting csv to the file specified.|
||Specifies that the feature values are relatively sparse, meaning missing many values or many values are the same. If this is specified then a different code path is executed that is optimized for sparse feature values. |
|-n , --nTrees
||Specify how many trees to construct. Default is 100.|
|-s , --speed
||Specify a speed-vs-accuracy level. Speed 0 is slowest and highest accuracy. Speed 10 is fastest and lowest accuracy. Default is 10.|
This value affects the number of features tried and number of feature values tried for each tree node split decision. At speed 10 those values are left at the baseline values.
The baseline value for number of features tried is min(24, sqrt(nFeatures)).
The baseline value for number of feature values tried is 2. The number of additional features tried is (10 - speed) / 3.
|-x , --maxDepth
||Specify max depth of any tree. Default is 30.|
|-m , --metric
||Specify the metric used to quantify benefit of each node split. Possible values are 'gini' and 'mse'. Default is mse.|
|-z , --nodeSize
||Specify the target node size. When building trees this is used to determine when to stop splitting nodes in the tree based on number of samples represented by that node. If a node has this many samples or fewer then it is not split. Default is 1.|
||Number of threads. 0 means whatever each classifier prefers (whatever Intel Thread Building Blocks decides). Default is 0.|
||Specify the random seed integer for the random number generator.|
||Number of repetitions. Runs the classification multiple times with incrementing seed values. Default is 1. The final results are per-sample averages.|
Feature Column Spec
||Specify the column of the input csv that is the sample ID, by name. The default is 'index'. If there is no ID column then one is generated by the base-1 sample index.|
||Specify the column of the input csv that is the truth, by name. The default is 'class'. |
||Exclude columns, by name or index. You can specify multiple columns using comma-delimited format. |
||Include columns, by name or index. You can specify multiple columns using comma-delimited format. |
||Number of folds for K-fold validation. 1 is fine means all train and all test. Default is 4|
||Specify to use random folding (default is simple blocked folding), and specify the random seed integer for the random number generator. Seed must be greater than or equal to 0.|
||Turn on striped and spread folding. This spreads class 1 (for two-class classification) samples across all folds, and stripes samples across folds to separate them. Spreading avoids having uneven distribution of potentially rare class 1 samples among the folds, for example if the class 1 samples were all together in the csv they would be put into a single fold using the default folding scheme. Striping samples across folds puts consecutive samples into different folds to avoid correlation in folds by locations in csv. |