Brief comments on decision tree program 12-9-91 C.S. Wallace, Computer Science, Monash University. 28-2-01: Updated by Sarah George for new distribution. Synopsis: This file describes the dtree program in full. If you only want a quick introduction, try reading quickstart.txt For technical programming documentation, try the doc/ directory after you have compiled the program. Compiling the program: Read the file INSTALL for instructions on installing DTree. What the program trys to do: We assume there is a body of data about N 'things'. For each thing, the data comprises K values, one for each of K 'attributes'. An attribute may be real-valued (eg., a height, weight, income etc) or discrete. A discrete attribute (like sex, marital status etc) may only take values in a discrete, unordered range, each of which is identified or labelled by an integer in the range 1 ... V, where V is the number of possible values for the attribute. For discrete attributes, no numeric or ordinal significance applies to these integer values. Thus, if the attribute 'marital status' has possible values Single, Widowed, Married, Divorced, the four values could be labelled 1,2,3,4 in the order given here, or any other order: any consistent labelling will do. We futher assume that each thing also has a serial or identifying number which is an integer, the set of serials not necessarily being compact or starting from 1, but all serials are > 0. Finally, assume each thing is known to belong to one of a small number of "Classes", labelled 1 .. C where C is the number of classes. The program tries to find a function of the attributes (strictly, attribute values) of a thing which well predicts its class. The kind of function found is restricted to a "Decision Tree". A decision tree partions all the possible combinations of attribute values into a number of "Categories", usually somewhat more numerous than the number of classes. Any thing whose attribute values are known may be placed in that category which contains the thing's combination of attribute values. In a Decision Tree, the partion of attribute value patterns into categories is representable by a rooted tree. The root node corresponds to the total set of all possible value patterns. The root node can 'split' on an attribute, giving a number of daughter nodes. If the split is on a discrete attribute with V possible values, there will be (up to) V+1 daughters, one for each possible value of the attribute and one for the special value "Unknown". Each daughter node thus corresponds to a subset of the possible value patterns, namely a subset where the attribute split on has a particular value. If the split is on a real-valued attribute, the will be (up to) 3 daughters. Daughter 1 comprises all values less than a "split value" S, daughter 2 comprises all values >= S, and daughter 0 (if there is one), the "Unknown" value. The split value S is part of the definition of the tree, and is chosen by the program. Daughter nodes may similarly be split on attributes, and their daughters and so on. Ultimately, one comes to "leaf" nodes which are not split. Each leaf defines, and corresponds to, a category of the partition. The decision tree is a good predictor of class if each category (leaf) comprises things which are predominantly of one class, ideally all of one class. However, the tree may be a useful predictor even if leaves are not "pure": some prdictive power exists if, on average, the leaf categories are more homogeneous in terms of class than is the population as a whole. Given data (attribute valeues and classes) for a sample of things, the program tries to find a tree which partitions the sample into nearly-pure leaf categories. It may not produce a tree with single-class leaves even when the data would allow such a tree to be constructed. Rather, the splitting of nodes is stopped at a level when further splitting, while it might give purer categories, would improve the purity only by an amount which is consistent with what would be expected from random effects even when the class was not dependent on any of the available attributes. That is, a split is only done if the resulting improvement in leaf purity is statistically significant. The form of the test is based in a general theory of inductive inference called "Minimun Message Length" (or sometimes, Minimum Description Length). In the present case, we imagine the following information coding problem: A Sender has knowlege of the sample data, including both attribute values and class for each thing. The Sender wishes to transmit to a Receiver the class of each thing, where the Receiver already knows the attribute values for each thing, but not its class. How should the Sender's message be encoded so as to make it as short as possible (in, say, binary digits)? To encode a set of class labels, one for each thing, standard coding theory gives a way of doing it which is of optimal brevity under reasonable assumptions. It turns out that the leng of the coded set of labels will be short if most labels are the same, and longest if every class is equally frequent. That is, class purity leads to brevity. Exploiting this fact, we suppose that the Sender constructs a message comprising two parts. The first part describes a decision tree. On reciept of this part, the Receiver can construct the tree, and then (knowing already the attribute values of each thing) determine the category of each thing. The second part of the message then gives, for each category in turn and independently, the class labels of the sample things in that category. That is, the second part is itself a series of messages, one for each leaf of the tree, giving the classes of the things assigned to that leaf. If the leaf categories are fairly pure, each of these little messages will be short, and so the whole message, first and second parts together, may well be shorter than a message which simple gave the class labels for all the things in one lump. The program tries to find the tree which leads to the shortest total message. It can be proved that, if the attributes do not significantly predict the class, no tree will be found better than the Null tree consisting of a single node which is both root and leaf. Further, if some partition based on the attribute values is an optimal predictor for the population from which the things were sampled, it can be proved that the MML process will give a tree corresponding to that partition, given a large enough sample of things. Data format: The data must be prepared as an ASCII file. The first line contains two integers, the number of classes C and the number of attributes K, space or comma separated. Then, begining on the next line and if necessary extending over several lines, come K integers, one for each attribute. This integer should be 1 if the attribute is real-valued, or V if it is a discrete attribute with V possible values coded 1 ... V. The integers must be space or comma separated. Now comes the actual sample data, thing by thing. The data for each thing must begin on a new line. The various items for a thing may extend over several lines, with space or comma separation (or NewLine also counts as separation, here and throughout the data file). For a sample thing, we first have a Serial integer, which simply serves to identify the thing in output, and so should be different for each thing. Its magnatude has no significance, but it may be entered as a negative number, which means that the thing will NOT be used in building the tree, but will be used as cross-validation data after the tree is built. Next comes the class of the thing, an integer in 1 ... C. Next come the K attribute values of the thing. These are actually read as real numbers, and so may have signs and decimal points. Values of discrete attributes are rounded to the nearest integer as they are read in, and would normally appear as unsigned integers in the data file. If the thing's value for some attribute is unknown, it should be entered as 9999, or if the attribute is discrete, it may also be entered as 0. Note that no value for a real attribute may exceed 9999. The end of the data is signalled by the end of the file: no special termination is needed. Running the program: Use the normal system command to start the program. It will ask for the name of the data file, which you should then type in, with any ".dat" or similar suffix. It will then ask for the name for a results file. The program will then say 'Enter options or "help" '. If you type "help", the program will type a brief description of the options available, and repeat the prompt. Normally, the only options of interest will be: f Causes the unpruned tree to be printed as well as the pruned tree. In building the tree, the program first builds it to the maximum detail permitted by the data, i.e. it will keep on splitting nodes until either all leaves are pure, or it runs out of attributed to spli on. It then "prunes" the twigs back if they are not significant by the message length criterion to give a pruned tree which minimizes the message length, but whose leaves may not be pure. The unpruned tree may be of interest in suggesting futher splits which, while not significantly supported by the data, may have meaning in the problem context. <0-9> A decimal digit in 0 .. 9 can be used to control how thoroughly the program searches for the "best" tree. "0" will give a quick and dirty answer, "1", (the default), trys a lot harder, "2" or "3" will really chew up computer time but occasionally find a slightly better tree. Note that an increased digit is not guaranteed to yield a better, or even an as good, tree as a smaller digit, but it usually will, so whatever is the highest value your purse and patience can afford, all smaller values should also be tried. The additional time cost is small, and you may luck out. t It was mentioned that things entered with a negative serial no. will not be used in building the tree, but will then be used as test data for validating the tree. That is, the program will report on how well the tree classifies the test things. The "t" option also provides for some things to be held back as test things. If this option is used, the program will prompt for a percentage of the data to reserve for testing. The percent entered will include all the negative-serial things, and, if there are none or not enough to amount to the requested %age, a random seelection of the positive-serial things. The program will prompt for an integer "seed" for its random number generator, so different random selections can be used on different runs. p Used only if some serials are negative, or with "t" option. Will cause program to output its prediction about each test thing, in format: Thing Class Cat Default Cost H-cost l (small L) Will list the serials of all things assigned to each leaf category. q Quit the program. After a tree is built, the program will return to the "Enter options or help" prompt. Enter q if you do not want another go on this data. Less useful options are: s Use a code for describing the tree corresponding to that used by Quinlan and Rivest in their paper suggesting the MML approach to tree building. This code is less efficient than the default option, and will normally give a worse tree. There is almost no difference, however, if all the attributes are binary. n The code assumed for coding the classes of things at a leaf (in a category) assumes that a priori the distribution of class frequencies in a category is expected to be non-uniform and is modelled by a generalized Beta prior with parameter Alfa < 1. The origial Quinlan & Rivest paper assumed something equivalent (for 2 classes) to a Beta prior with Alfa = 1, that is expectation of a uniform distribution of class frequencies in the leaves. Option n forces this assumption, which speeds up the program but gives poorer trees, as any worthwhile tree will of course have nearly pure categories, with very non-equal class frequencies in each category. d If option "n" is not used, the program estimates the best value of the parameter Alfa. By default, it assumes the same parameter value apples to all categories, but if option "d" is used, it will expect Alfa to be smaller (expecting greater purity) in leaves at greater tree depth. This was an experiment based on the naive thought that as the tree became deeper, and more attributes were tested, better purity might result. Real-life data does not seem to support this idea. This option assumes that the Beta parameter Alfa should vary with the depth of a leaf as Alfa = P ** (depth - 1) (the root is depth 1). The program asks for an initial value of the parameter P, but will attempt to improve on it. If option "n" is also selected, P is not adjusted. The required option characters should be typed in in one line, without spaces and in any order, ended with RETURN. If you don't want to change any of the default choices, just hit RETURN. Output: All output goes to Standard Output (Fortran Logical Unit 6) and so will appear on your terminal unless redirected to a file by an appropriate system command. Input of options comes from Standard Input, your keyboard unless you redirect it or run the program in a batch queue. A copy of all output is written to the results file, overwriting any previous content. The output begins with some garbage of no real interest about how the program is getting on with its search for a good tree. Then it writes somethying like PRUNED DECISION TREE ************* BIAS = 0.48 (BIAS is the best value for ALFA or, if option d, of the parameter P) Then it tells us the size in nodes and depth of the full (unpruned) and pruned trees. The latter is the one of interest. Then: Tree cost xxx.x nits = yyy.y bits. This gives the best (shortest) message length achieved in bits (binary digits) and nits (1 bit = natural log 2 bits). Costs (message lengths) in the output are cited in nits except where otherwise stated. Then comes a listing of the tree in a rather verbose and detailed form. It is a node-by-node listing in left-child depth-first order, starting at the root. The record for a node looks something like this: size v cut Sc Wc Bc Then on a new line: D Then on a new line: C If the node is a leaf we find: LEAF bits/thng at prior bias Then on a new line: D If the "l" option was chosen, this line is followed by one or more lines, each beginning with "L", listing the serials of the things assigned to this category. Then comes a Short Form listing of the tree which reproduces some of this info in a more compact, and I hope fairly obvious, form. There is one line per node. If the node is not a leaf, the line says: Cat Spliton If it is a leaf, the line is the same save Spliton.... is replaced by Class If there were some negative serials in the data file, or if the "t" option were chosen, we then get a couple of lines reporting the predictive success of the tree on the test things. They give the number of test things, the number whose class was correctly predicted by the most numerous class in the category to which the thing was assigned, and the total length (in nits and bits) of a message giving the true classes of the test things using a Huffman or similar code based on the class probabilities in the leaf categories estimated from the training (non-test) things. This length or cost is minus the log of the likelihood of getting such a set of things if the population was as described by the tree. If there are 2 classes, a cost in bits much less than the number of test things means the tree is giving pretty good predictions. The cost is also printed in bits per thing. This cost is a more sensitive measure of predictive success than the raw ratio of correct predictions. Trial data: The file "xd6dat" is a correctly formatted data file. When the program is run on it with no options, one should get a (pruned) tree of 29 nodes, max depth 7, message length 441.4 bits. Other data files provided are: eg1dat, disdat, pdat Limitations: These are mostly set by paramter constants in the file "dtree.com", and can be changed if necessary. Recompilation of all code is then needed. Num of attributes: 197 Num of classes: 50 Num of distinct values for a discrete att.: 200 The total amount of data which can be used, and the max size of the tree, are together limited by the parameter constant "mstore", at present 300000. This can be raised as needed subject to system memory limits. Bugs: The only known bug is a flaky treatment of unknown or novel values in test data. The program cannot cope properly with a test thing which has a value for a discrete attribute which never appeared at that node in the training data, or an unknown value for any attribute which never appeared in the training data. This problem is not so much a bug as a problem with the very notion of a decision tree learnt from training data. It may be that some node of the tree splits on a discrete attribute with values 1,2,3, but no thing in the data set used to build the tree reaches this node with a value for the att. of either 2 or Unknown. The node will be built with only two children, one for value 1 and one for value 3. But the test set of things may well have a thing which reaches this node of the tree, and has value 2 (or Unknown) for the att.. Where should it go to next? The program uses the following rule for novel test values: If the node has a child for the Unknown value, send the thing there. If not, send the thing to the child which received the largest number of training things.