QuickRT Help

QuickRT 0.0.0

This is an alpha version of an Extremely Randomized Trees (ExtraTrees, a derivative of Random Forest) classifier. Please see the Wikipedia Random Forest page for references.

This does binary classification and multi-class classification, and does not currently support regression.

This has the following modes of operation:


Standard

-h, --help Displays usage information and exits.
--version Displays version information and exits.
--, --ignore_rest Ignores the rest of the labeled arguments following this flag.
-d, --debug Turn on debug verbosity
-q, --quiet Turn on quiet mode

Main

-i , --input The input csv file. This is used for training, so it must contain the true classification for each sample. Reading from stdin is not implemented (yet). The first row of the csv must be the header row with the names of the columns. The first column must be the identity of the sample, for your use. Default name is 'index' but see the --id argument. The last column must be the true class (as an integer) for each sample. Default name is 'class' but see the --truth argument. Missing feature values are interpreted to be 0.0.
-t , --test The input csv file containing test samples. This must meet the same requirements as the training csv. See description of --input.
-o , --output Output csv file. This is written with probability of class 1 for each input sample. You can specify stdout using a dash (e.g. -o -)

Run Modes

--dfs Display Feature Set: Just load, parse, and display info about feature set.
--export Feature set export: Read the input file, do col exclusion and subsetting, and write the resulting csv to the file specified.
--sparse Specifies that the feature values are relatively sparse, meaning missing many values or many values are the same. If this is specified then a different code path is executed that is optimized for sparse feature values.

Classifier Params

-n , --nTrees Specify how many trees to construct. Default is 100.
-s , --speed Specify a speed-vs-accuracy level. Speed 0 is slowest and highest accuracy. Speed 10 is fastest and lowest accuracy. Default is 10.
This value affects the number of features tried and number of feature values tried for each tree node split decision. At speed 10 those values are left at the baseline values.
The baseline value for number of features tried is min(24, sqrt(nFeatures)).
The baseline value for number of feature values tried is 2. The number of additional features tried is (10 - speed) / 3.
-x , --maxDepth Specify max depth of any tree. Default is 30.
-m , --metric Specify the metric used to quantify benefit of each node split. Possible values are 'gini' and 'mse'. Default is mse.
-z , --nodeSize Specify the target node size. When building trees this is used to determine when to stop splitting nodes in the tree based on number of samples represented by that node. If a node has this many samples or fewer then it is not split. Default is 1.

Run Options

--nThreads Number of threads. 0 means whatever each classifier prefers (whatever Intel Thread Building Blocks decides). Default is 0.
--seed Specify the random seed integer for the random number generator.
--nReps Number of repetitions. Runs the classification multiple times with incrementing seed values. Default is 1. The final results are per-sample averages.

Feature Column Spec

--id Specify the column of the input csv that is the sample ID, by name. The default is 'index'. If there is no ID column then one is generated by the base-1 sample index.
--truth Specify the column of the input csv that is the truth, by name. The default is 'class'.
--exclude Exclude columns, by name or index. You can specify multiple columns using comma-delimited format.
--select Include columns, by name or index. You can specify multiple columns using comma-delimited format.

Folding

--nFolds Number of folds for K-fold validation. 1 is fine means all train and all test. Default is 4
--foldRand Specify to use random folding (default is simple blocked folding), and specify the random seed integer for the random number generator. Seed must be greater than or equal to 0.
--spread Turn on striped and spread folding. This spreads class 1 (for two-class classification) samples across all folds, and stripes samples across folds to separate them. Spreading avoids having uneven distribution of potentially rare class 1 samples among the folds, for example if the class 1 samples were all together in the csv they would be put into a single fold using the default folding scheme. Striping samples across folds puts consecutive samples into different folds to avoid correlation in folds by locations in csv.