Snob Program for classification,mixture modelling,clustering,taxonomy,unsupervised pattern recognition. Copyright: Christopher S. Wallace, David L. Dowe Dept. Computer Science, Monash University, Clayton, Australia. Last update: Tue 17 Oct 1995 Wed 18 Jun 1996 Language: Fortran (f77) Platforms: Unix D.E.C. Vax 11 Documentation: This documentation file, snob.doc, has been developed by Chris S. Wallace since the late 1960's and has been updated by David L. Dowe in 1994+ . Further reading: See "REFERENCES" below. CONTENTS 1 OVERVIEW 2 DATA INPUT 2.1 Initial input (a) Format (b) Serial numbers (c) Inactive things (d) Inactive attributes (e) Value ranges 2.2 Binary data files 3 SAVEFILES 4 ITERATION (a) Overlap (b) Subclasses and splitting (c) Killing (d) Combining (e) Iteration control 5 CLASS NUMBERING 6 REPORT FILES 7 ACTIONS 7.1 Alfabetic list 8 OUTPUT FORMATS 8.1 Sum (summary) 8.2 prclas, repclas (class reports) 8.3 pratt, repatt (attribute reports) 8.4 prthng, repthng (thing reports) 8.4.1 Heading format 8.4.2 Format for prthng 0 n 8.4.3 Format for prthng 1 n 8.4.4 Format for prthng 2 n 8.5 prmemb, repmemb (membership of classes) 8.6 prcross, repcross (comparison between two classifications) 8.7 kldist 9 INPUT OF CLASSES 9.1 Using restore to input a classification by members. 10 COMPARISON OF CLASSIFICATIONS 11 THE SIGNIFICANCE OF MESSAGE LENGTHS 11.1 Caveats 12 EXAMPLE 13 LIMITATIONS 13.1 On size of problem - e.g. no. and arity of attributes; storage 13.2 On file names 13.3 On classes 13.4 On nature of problem ACKNOWLEDGEMENTS REFERENCES APPENDIX Sampling Using rdclas to describe class(es) by attributes or members. 1 Overview. Assume a potentially infinite multi-variate population of "things", where each thing is characterised by a set of values of the dimensions of the multi-variate space. Each variable or dimension is called an "attribute". Also, the value of an attribute variable characterising or possessed by a particular thing is also called an attribute, or, if confusion would arise, an attribute value. Assume the nature and possible range of each attribute, and the number of attributes, are known. Given the attribute values for each thing in a finite sample of things drawn from the population, Snob attempts to decide whether the population can usefully be considered to be the union of a number of "classes" of things, such that the distribution of things within a single class is fairly tight and has simple form, but such that different classes differ significantly in their attribute distributions. The problem addressed by Snob is known variously as "classification" (or, to distinguish it from the case when the number and nature of the classes are known a priori, "intrinsic classification"), "mixture modelling", "clustering", "unsupervised pattern recognition", and "numerical taxonomy". Most other programs for this problem start by computing, for each pair of things, a measure of their "similarity". Things are then sorted into classes so that, in some sense, members of the one class are similar to one another, but members of different classes are dissimilar. Satisfactory results for many purposes can be obtained by such methods. See, e.g., the C.S.I.R.O. "CLUSTAN" suite. However, they suffer from several practical and theoretical difficulties, some of which are not suffered by Snob. These include: (a) The definition of a similarity measure is somewhat arbitrary. There is no sound theoretical basis for choosing any one of the many which have been proposed and which have some intuitive appeal. (b) Missing values are difficult to handle consistently. In many practical instances of the problem, many things lack measured values for some attributes, for reasons ranging from observer mistake through specimen damage to logical impossibility. (c) The methods mostly fail to decide for themselves how many classes are present. This defect follows from the lack of a firm basis for the definition of "similarity": there is no clear way of determining the significance (in the sense of statistics) of the difference or average dissimilarity between two classes. Thus most such programs give an hierarchy of progressively finer partitions of the sample into smaller and smaller classes, ending with single things. It is left to the user to decide what level of partition best suits his needs (or argument). (More recently, see also the COBWEB and AutoClass programs. AutoClass gives very similar results to Snob when dealing with continuous attributes.) The approach used by Snob is based on a general theory of inductive inference which is consistent with conventional statistical inference. The data (i.e. the attribute values of the things) are considered to form a message. If some reasonable code is used for encoding the data, the length of this message can be computed. However, a more concise encoding may be possible if the sample is divided into classes. A message may then be constructed as follows: The message begins by stating the number of classes. It then has a segment for each class. This segment states the distribution of attribute values in the class. In the current version of Snob, this is done by assuming a simple form of distribution for each attribute: Normal for continuous attributes, Multinomial for discrete or "multi-state" attributes, Poisson for non-negative and theoretically unbounded integer attributes, and von Mises for circular-valued attributes. It is also currently assumed that within a class, attributes are uncorrelated. Thus the distribution within a class can be specified by simply stating: The relative abundance of the class (i.e. the fraction of the population that belongs to the class), For each continuous attribute: its mean and standard deviation, for each discrete attribute: the relative probability of each discrete value or "state", for each Poisson attribute: its rate, and for each circular-valued attribute, its mean and concentration parameter. Instead of the above distribution details for an attribute, the class description may rather say that an attribute is "not significant". In that case, the distribution of that attribute within the class is assumed to be the same as the attribute's distribution in the whole population. The message then has a segment for each thing. The segment starts by stating the class to which the thing is regarded as belonging. It then states the attribute values of the thing, using for each value a Huffman code which would be optimal in the sense of least expected code word length if the values of this attribute for members of this class indeed had the distribution stated in the class description segment. Missing values are omitted from the message. The form of message described above is an instance of a general form of message called an "explanation". An explanation encodes a body of data or observations by first stating a "theory" about the data, then encoding the data in a code which would be optimal were the theory true. In this case, the "theory" is the class structure asserted of the population. A general principle of inductive inference, which may be regarded as an operational form of Occam's razor, is to prefer that theory which yields the shortest explanation of the available data. Snob uses this principle. However, use of the principle does not require Snob actually to constuct the coded message or explanation. Instead, it need only compute the length the message would have if it were constructed. Given some supposed class structure, the length of the second part of the message, which states all the things' attribute values, is just - ln (prob. of getting these data values, given the assumed classes) where we use a unit of message length of (log e / log 2) bits, called the "nit". The length of the first part of the message, which describes the class structure, depends not only on the assumed structure but also on the code used for specifying classes. Snob assumes a Huffman code which would be optimal given the following prior expectations about class structures: (a) Any number of classes from 1 up to an unspecified limit is equally likely. (b) Any set of non-zero, normalised relative abundances is equally likely. (c) In any class, every attribute is equally likely to be "significant" or "not significant". (d) If a discrete attribute is significant in some class, its state probabilities within that class are equally likely to take any set of non-zero normalised values. (e) If a continuous attribute is significant in some class, its mean within that class is equally likely to take any value in a range of size equal to four times the standard deviation of the attribute over the whole population, and the log of its standard deviation in the class is equally likely to take any value in a certain range. The range of the log S.D. in a class is based on the assumption that the S.D. of an attribute in a class will not exceed 2.5 * the S.D. of the attribute over the whole population and will not be less than the precision of measurement (least count or probable measurement error) of that attribute. (f) If a Poisson attribute is significant in any class, its rate will have prior h(r) = 1/alpha . exp(-r/alpha), equivalent to sampling r = -alpha.ln(U) for U uniform in (0,1], where alpha = ratepp is the rate observed in the whole population. (g) If a von Mises attribute is significant in any class, its mean within that class will be equally likely to take any value in the range [0, 2*pi), independent of kappa. The prior on the concentration parameter, kappa, is h(kappa) = kappa/((1 + kappa^2)^1.5) . This prior is normalisable, "colourless" and uniform in (X,Y) = (kappa.cos(mu), kappa.sin(mu)) near the Cartesian origin. The above assumptions, which correspond to the prior probabilities of a Bayesian estimation, are intended to be as colourless as possible. Within reason, variations on these assumptions have little effect on Snob's results. 2 Data input 2.1 Initial input The data for Snob may be input from a file, or read from logical unit 5 (system input, e.g., your terminal.) The required format is the same in either case. The first record should contain only a single integer: the number of attributes. Then comes a record for each attribute, each record having 3 numbers. Active flag: 0=inactive, 1=active Kind : 1=continuous, 2=discrete, 3=Poisson, 4=circular Either : Number of states (if att. is discrete) or : Measurement error or least count (if continuous) or : No. of continuous base att. representing segment length (if Poisson) or : 1 or 0 for radians or degrees; Measurement error (if circular) (The above is elaborated upon in the Example in Section 12.) Then come one or more records for each thing. Each thing must start on a new record, but its data may be spread over as many records as you like. For each thing there must be: Serial number Value for att. 1, value for att. 2, etc for all attributes Finally, a record containing zero, to terminate data input. Notes on input data. (a) Format Input is free-format. Each number should be separated from the next by a space, a comma, or an end of record. Numbers may have an initial + or - sign, but no space may intervene between the sign and the number, nor may a number have any imbedded spaces. Any number may have a decimal fraction part, following a normal decimal point. No number should exceed 10**8 in magnitude. Lines longer than 80 characters should be split into several consecutive lines of no longer than 80 characters each. An isolated minus sign is read as -10**9, which is the internal representation of a missing data value. It is illegal in any other context. (b) Serial numbers. Each thing should have a unique integer serial number. These should not be unnecessarily large (due to innum.f calling inreal.f). If using values exceeding 10^7, it might be best to use 2 or more inactive attributes to represent a long serial number. (See (d) below.) (c) Inactive things. If a thing is given a negative serial number, it will be ignored in the calculations. However, it will be included in printouts showing the class to which each thing belongs. Thus, an inactive thing will be classified, but will not contribute to the estimation of the class structure. (See prthng, repthng) After data has been input, the active status of a thing may be changed by the thngflag action. This action requires a list of thing serials terminated by zero. If the serial is input as a positive number, the thing is made active. If the serial number is input as negative, the thing is made inactive. The change does not affect the raw or binary data file from which the data was read: the change is only for the duration of the current run. (d) Inactive attributes. If an attribute is declared inactive (active flag = 0), it will be ignored in the calculations. However, its distribution on each class will be included in printouts. (See prclas, repclas, pratt, repatt) After input, the active status of an attribute may be changed by using the attflag action. This action requires as input an attribute number followed by the new value for its active flag, 0 = inactive, 1 = active. The change does not affect the raw or binary data file read at the start of the run: it affects only the current run. Savefiles written before certain attributes or things had their active status changed may be restored after the changes, and will operate correctly. In particular, if a savefile is written when an attribute was inactive, and is restored when the attribute is active, the attribute will receive distribution parameters from the savefile accurately describing its distributions in the classes described in that savefile. (e) Value ranges. Discrete attribute values are integers in the range 1 to the number of states in the attribute. If a value is input with a fractional part, it is rounded to the nearest integer. Continuous attribute values should not exceed 10 ** 8 in magnitude, and preferably should not need to be written to more than 2 decimal places, as in some printouts, they will be printed to 2 decimal places. No input number may have more than 9 digits, including digits after the decimal point. 2.2 Binary data files. Input of data in the form described above is a fairly slow process. Because several Snob runs may be made on the same data, Snob saves a binary copy of the data on another file, which may be used as the input on later runs. When started up, Snob asks for the name of a data file (without ".raw" or ".bin"). If the response is: filename Snob will search for the file filename.bin . If successful, Snob will read in the binary data. (See Limitations in Section 13.2 .) If unsuccessful, Snob will search for the file filename.raw . If successful, Snob will read in the raw data and write to a binary file, filename.bin . If Snob is unsuccessful on both counts, it will exit. If Snob successfully finds either filename.bin or filename.raw, it will then seek to restore classes from filename.bst . If it fails, it will restore a 1-class classification consisting of the population class. Even for a short run with a small data set, Snob insists on writing a binary. Data read in in raw (decimal character file) form is sorted into thing serial number order before being written as a binary file. Binary files are therefore already sorted, with a further time saving. Sorting is by magnitude of serial number, so inactive things appear in the same position as if they were active. 3 Savefiles. On a large data set with a lot of structure, Snob may take hours to come close to the best class structure. We therefore provide automatic checkpointing of the calculation. A checkpoint provides a copy of the current class structure written in a binary form on a file called a savefile. A run may be restarted from such a savefile (after the data has been read in). See restore. As well as automatic checkpoints, a checkpoint may be taken at any point in the calculation. You may specify the name of the file to be used for automatic checkpoints. If you do not, the name snob.sav is assumed. The .sav suffix is a useful convention for all savefiles. When a checkpoint is written to a file, any previous content of the file is destroyed. Thus, a savefile can hold only one checkpoint at a time. However, you may use several savefiles in the one Snob run if you wish to recall intermediate steps in the run. The file specified for auto checkpoints is called the "set" savefile. It is rewritten with a new checkpoint periodically, and when the user gives a "save" command without specifying a filename, and when the user gives a "stop" command to stop the Snob run. If the user gives a "save" command specifying a filename, a checkpoint is written to the named file, not to the set savefile. The name of the set savefile is not affected. To set a new savefile, use the "savefile" command, giving the new set savefile name. Any previous set savefile is not destroyed by this command. The "savefile" command only sets the file name: no checkpoint is written. Savefiles are unformatted (binary) and cannot be printed. One may also save the class structure (approximately) on a printable report file using repthng or repmemb commands, which report the classification of each thing. The restore command will accept a report file containing such a report in lieu of a savefile. See restore. 4 Iteration. Snob basically proceeds by an iterative process. Starting from some specified class structure, all things are assigned to classes, each thing being assigned to that class most likely to contain such a thing (more or less... see later). Then the class relative abundances and attribute distribution parameters within each class are re-estimated from the statistics of the things assigned to the class. Both these actions are guaranteed to improve the classification (reduce the explanation length) or to do nothing. So the iteration converges. But not necessarily to the best classification. So a number of complications is introduced. (a) Overlap. It can be shown that the shortest message is achieved by assigning things to classes in a slightly un-obvious way. If a thing can be described far more briefly as a member of class X than as a member of any other class, it is assigned to class X. However, if class Y provides almost as good a description as class X, it may be better to assign the thing to Y. The actual choice turns out to be pseudo-random:- it depends on the attribute values of the next thing to be described. Since Snob is concerned not with actually constructing a message but with calculating the expected length of the message, the effect is that Snob does not definitely assign things to classes. Instead, it calculates the probability that the thing would be assigned to each class in an actual optimum message. These probabilities turn out to be the relative probabilities of finding such a thing in the different classes. The thing then contributes to the estimation of the parameters of each class with a weight equal to the relative probability of its being assigned to that class. The partial assignment of things to classes may be seen as a recognition of the possibility that the distributions of two or more classes may overlap in the attribute space. Snob allows for this possibility correctly. However, partial assignment is a nuisance in the early stages of the iteration, when the classes are not well distinguished. At this stage, most things are not very clearly classifyable, and so would get partially assigned to several classes. These classes would then all have statistics gathered, with only slightly different weights, from much the same set of things. Hence re-estimation of the class parameters from these statistics only slightly improves the distinctions among the classes. To speed matters up, partial assignment (overlap) is automatically inhibited in the early stages of the calculation. It may also be inhibited by you for a specified number of iteration cycles following input by you of a trial or suggested class structure. (See rdclas in Appendix.) Normally, however, the complications introduced by overlap provisions will not be of concern. (b) Subclasses. The basic iteration cannot increase the number of classes. So, Snob maintains in its class structure tables two "subclasses" for each class (or 'mainclass' to distinguish from subclasses). Subclasses are defined by relative abundances and attribute distribution parameters just as are mainclasses. When a thing is wholly or partially assigned to a mainclass, it or that part of it is also assigned to one of the subclasses of the mainclass, or partially assigned to both. The parameters of the subclasses are re-estimated in the iteration. Sufficient information about the subclasses is held to allow Snob to work out at the end of each iteration whether it would be worth while to split a class into its subclasses, thus increasing the number of classes in the classification. If so, the split is done, and each of the two new mainclasses is given a pair of subclasses, formed initially by making minor random variations in the parameters of the class itself. If a class survives 20 or so iterations without being split, it is unlikely that further iteration of its subclasses will lead to a split. However, it is possible that the class should be split, but not along the lines suggested by its subclasses, which are only a refined random split. So Snob will then destroy the subclasses, and try another pair starting from a different random split. Snob's calculation of whether a class should be split is conservative. It assumes that the two new classes would exactly replace the old class without any effect on other classes. This is unlikely, and any consequent adjustment of other classes can only improve on Snob's calculation. We therefore allow the user to force a split of a class. Following such a split, the new classes are allowed several iterations to "settle down" before any thought is given to recombining them. Splitting should be considered if Snob has found subclasses for a class which seem to make some kind of sense, and where the class is almost worth splitting according to Snob's calculation. Snob may repeatedly fail to detect a particular way of splitting a class which seems worth trying to you, because Snob only sets up a random division into subclasses, then tries to improve it by small adjustments. The 'spliton' command allows you to tell Snob to set up the subclasses of a class to correspond to a division on one attribute. Spliton requires a class number and an attribute number. If the attribute is continuous, it then reads a division value for that attribute, which serves more or less as the dividing line between the subclasses. If you specify this as zero the mean value for the whole class will be used. If the dividing attribute is discrete, spliton will read a list of one or more state values for the attribute, all on one line. Things with values in the list will be assigned to one subclass, others to the other subclass. The subclass 'ages' will be set to zero, so that Snob will for the next few iterations make an all - or - nothing assignment of the members of the class to the subclasses. Finally, spliton reads one character from a new line of input. If this is 'n', the action is complete. If it is 'y', the named class is split to form two new classes, each of the subclasses of the old class forming one of the new classes. Repeated use of spliton starting with a one-class model can be used to set up an arbitrary monothetic classification for Snob to work on. This may be easier than using the rdclas action. (c) Killing. As a result of iterative adjustment, a class may be reduced to having too few members to be sensible. If this happens, the class is killed off and its members distributed among the remaining classes. A class big enough to survive, but so small it cannot have viable subclasses, has its subs killed. In a larger class, if iteration reduces a subclass to too small a size the subclasses are killed and remade by random split. Classes may be killed off ("wiped") by your command. (d) Combining. Snob can reduce an excessive number of classes by killing off small ones. However, another tactic which is more effective is also used. Snob tries to estimate the effect on the message length of combining a pair of mainclasses into a single class. Unfortunately, an accurate estimate is impossible without actually constructing the proposed combination and seeing what gets assigned to it. It is not feasible to construct and try all possible combinations of pairs of classes. So Snob uses an approximate estimate of the effect of combining two classes (combine cost) to select a pair worth trying properly. However, even the approximate estimate is hard to compute. So Snob keeps a hash table in which it remembers class pairs which it has recently tried combining, or which it has recently estimated approximately as being too hopeless to try. Every few iterations, Snob looks for a class pair which is not in the "recent trial" table and which looks a possible candidate for combining on the approximate measure. Snob then constructs a "trial combine class" and sees how it goes during the next few iterations. If it proves to give a better message length than the two separate classes, Snob replaces the pair by their combination. During the trial of the combination, the normal iteration proceeds unaffected. The effect of the combination can be (conservatively) calculated without interfering with the two main classes being considered, or any others. Because the calculation of combine benefit is conservative, we allow forcible combining of a pair of classes. The class so formed will not be allowed to split again for several iterations, to give it a chance to prove itself useful. Whenever two classes are combined, forcibly or automatically, they become the initial subclasses of the combination. (e) Iteration control. Iterative improvement of the classification is started by the "adjust" command, with a parameter giving the number of re-estimation / re-assignment cycles to be done. During these cycles, Snob will test the possibility of killing or splitting a class after each cycle, and will also consider the possibility of effecting the combination being tried, if a trial combine class is defined. However, to save time, the possibility of creating new subclasses for classes which have none, or whose subclasses have been tried 20 times without success, is tested only every few cycles. Also, a new pair of classes to try combining is sought only after every few cycles. If any split, kill, subclass remake, or combine takes place, things are re-assigned to classes again before any re-estimation of class parameters is attempted. The "adjust" command always leaves the cycle after a re-assignment which did not result in any of the above drastic changes to class structure, and without re-estimating class parameters on the basis of the last re-assignment. It always begins the cycle by re-estimating. After each cycle, a cycle count (called a Step count) is printed, with the message lengths required (a) to describe the class structure, (b) to describe the things using a code optimised for the described structure, and (c) the total message length. The last should decrease, or at least not increase, on each step. (A small increase, of the order of 0.05 nit, may appear due to round-off errors.) If, after a cycle, Snob finds that it has been doing the current "adjust" command for at least 29 cycles, and that the current message length is not appreciably less than the average length obtained in the last 29 cycles, it will not continue with further cycles even if its step count has not reached the requested number. The possibility of setting Snob on long, continuous runs which repeat but with different random number seeds is given in the Example in Section 12. 5 Class numbering. Classes are identified by integers assigned sequentially by Snob. If a class is split, killed, combined or wiped, its number is not reused. Nor are numbers normally changed. Thus the same number will uniquely identify a class throughout its life. Classes are numbered from 3 up. At any time in a run, the set of valid class numbers may not be a compact set. Class 1 is the whole population. It is given a class number for convenience. Class 2, if it exists, is a trial combination of two real mainclasses. Subclasses do not have numbers of their own. Classes may be renumbered using a compact set of numbers starting at 3 by a user command. The new numbers will be in the same order as the old ones. 6 Report files. Various reports can be produced by Snob. If commands beginning "pr..." such as prclas are used, the report is output to system output (logical unit 6), typically your terminal. However, you may declare or change a "report file" at any time in a run. If a report file has been named, commands beginning "rep..." such as repclas will output to the report file. Report output is appended to the tail of the file. Old contents are unaffected. To open a new or existing report file, use the command repfile . It will close any open report file, then open the named file. If the file is new, it will write one heading line saying: >>Report If the file already exists, snob will check that it has such a heading line, then will move to the end of the file. The report file can be closed by using the closerep command, and will automatically be closed at the end of a run (stop), if another report file is opened, or if it is named in a restore command. 7 Actions When Snob is started, it first asks for a raw or binary data file to read. If given a raw file, it then asks for the name for a binary file, which it then writes. Thereafter, Snob asks for and performs "actions" one at a time as directed from the console or system input. Each action is specified by a lower-case word, and may have parameters. The parameter(s) may be entered on the same record (line) as the action word. If they are not, a prompt will usually be given, and the parameter(s) may then be entered on the next line. Where numeric parameters are required, the format is free as described under "data input". Where a text string (file name) is required, leading blanks are ignored, but the name may have no imbedded blanks or commas. 7.1 Alfabetic list of actions. adjust Perform n iterations of improvement, or until convergence. attflag Change attrib. n to active (f=1) or inactive (f=0). closerep Close report file. combine Forcibly combine classes m and n. file Read further actions and paras from named file, until E.O.F. help Print list of available commands. kldist (incomplete) gives symmetric class-by-class distance matrix pratt Print info on attribute(s). n= 0 -> all atts. prclas Print info on class(es). 0=all, -1=all mains, n.1=sub1, n.2=sub2, n.3=main and subs, 1=popln, 2=trial combine. prcross Cross-tabulate current classes against those in first 'repthng' or 'repmemb' report in report file named . prmemb Print members of class(es). n1=0 -> all classes. prthng Print info on thing(s). k=0 -> print best class, 1 -> all poss classes, 2 -> like 1 and print data. n=0 -> all things. random Add n random classes to class structure, usually to get started rdclas Add class(es) specified by attributes or typical members. renumber Renumber classes from 3 up. repatt Report to report file as for pratt. repclas Report to report file as for prclas. repcross Report crosstabulation as for prcross. repfile Open report file. repmemb Report to report file as for prmemb. repthng Report to report file as for prthng. restore Restore class structure from named savefile or repmemb/repthng report file. sample Set sampling fraction to f (0.0 < f <= 1.0). save Save class structure on set savefile. save Save structure on named file. Do not change set savefile. savefile Set name of savefile for auto and no-name structure saving. seed Set seed of random generator used by "random" and tie-breaker. setcross Meant to generate cross-product of classes, like prcross . Works when argument is "best" ; maybe not when arg is reportfile. split Split class n into its subclasses. spliton Set subclasses of class n by splitting on att. a. stop Save on set savefile, close report file and stop run. sum Print summary of class structure and message length. thngflag Change thing(s) to active (n pos) or inactive (n neg). End 0. wipe Destroy class n. 9999 = all classes. 8 Output formats 8.1 Format for sum (summary). *** SUMMARY WITH MAIN CLASSES *** Sample size is = * One-class length Currently Sequence Class Size R/ab Age SubA SubB Descrip Att.Len Spl.Ben Rawcost {then for each class from 2 up} {For class 2, the trial combine class, the info is slightly different. SubA, SubB give the numbers of the classes to be combined, not their sizes, and Spl.Ben is an estimate of the benefit of making the combination.} 8.2 Format for prclas, repclas. {If all classes or mainclasses requested, a summary is printed first.} Class report classes at sequence Length = {then for each class requested} Serial age Relab Size Class desc= = nits. {Here, names one of several quantities which may appear, {depending on the kind of class. For a normal (main) class: Attrib.len. = {For a subclass A: ThingBenefit= {For a subclass B: SplitBenefit= {For class 2, the trial combine class: CombineBenft= {For class 1, the whole population: Unclassified= {then for each attribute for which the class differs significantly from the whole population} 8.3 Format of attribute report (pratt, repatt) ATTRIBUTE REPORT classes at sequence Length = {The above two lines are omitted unless all attributes are reported} Class Popln Size {then for each attribute requested} Att Class Popln at nits on classes. Serials are: {The type field begins in col 1 and is one of: >>ST Thing {for a thing report with flag 0, showing most likely class} >>FT Thing {for a thing report with flag 1, showing all likely classes} >>DT Thing {for a thing report with flag 2, giving things' data} >>MB Member {for a member report}} {Things are listed in order of increasing magnitude of serial number.} 8.4.2 Format for prthng 0 n1.. {If n1 = 0, all things are reported. Otherwise only those things whose identifiers are given as n1 n2 ...} , , {etc., five things listed per line. List is terminated by:} 0 0, {If the classification of the thing is uncertain, its class is shewn as 1. "Uncertain" means that no one class has a 50% or better relative prob. of being correct.} 8.4.3 Format for prthng 1 n1.. Thing Length Class(Rel Prob) {then for each thing a line:} > () () {Here, descrip.len is the number of nits it takes to say which class the thing belongs to and to give its attribute values using a code optimised for the attribute distribution stated for that class. The line then contains a list of classes, up to 6, which could contain the thing. They are listed in order of decreasing probability of having such a thing, and each class number is followed in brackets by the relative prob. of the thing's being a member of that class. Probs. greater than 99% are shown as (99). No prob below 1% is shown. The line may be preceded by one or more question marks ?? to flag things whose classification is particularly dubious. Two queries "??" corresponds to "uncertain" in the sense of 9.4.2} {The report ends with:} > 0 0 End 8.4.4 Format for prthng 2 n1.. {This is the same as for prthng 1 n1.. (8.4.3) save that the line for each thing is followed by one or more lines giving its attribute values} 8.5 Format for prmemb, repmemb n1.. {If n1 = 0, all classes are shown in serial number order, beginning with a "class" of things, shewn as serial 1, whose class is uncertain in the sense of 8.4.2. The report then begins with a heading as described in 8.4.1. Otherwise, only the classes whose serials are mentioned as n1 n2 ... are dealt with. The pseudo-serial 1 may be used to get just those things whose class is uncertain. For each class dealt with:} Class {etc., up to ten things per line} {last thing in class is followed by:} 0 {If all classes reported, report ends:} 0 End of report 8.6 Format for prcross, repcross. This action constructs a cross-tabulation between the classification now current in the Snob run and a classification read in from a file. It writes to standard output a table with, basically, one column for each class found in one classification and one row for each class found in the other classification. The table entry for some column and some row is the number of things which the first classification puts into the class corresponding to that column and the second classification places in the class corresponding to that row. The file containing the comparison classification should contain a thing or member report, typically generated in this or an earlier Snob run on the same data. However, as described in section 9.1 on input of classes, the file need not exactly follow the format of a Snob repthng or repmemb report. Normally, the current classification will give the columns of the table and the comparison file the rows. However, if the former has more than 24 classes but the comparison classification does not, the current will give the rows and the comparison the columns. If both classifications have more than 24 classes, the crosstab action will not be done. The class serials of the two classifications appear as row and column labels. In addition, an extra row and column are included headed with ?? rather than a class serial. These correspond to the "uncertain" class 1 mentioned in 8.4.2, and show the numbers of things which, in the row or column classification, are given less than 50% posterior probability for any single class of the classification. Only two decimal digits are provided for each table entry. If any entry would exceed a thing count of 99, all entries are scaled down by the same factor to reduce the maximum entry to 99 or less and a message to this effect precedes the table. If any class serial in either classification exceeds 499, the crosstab is not done. This limit is set by a parameter mxser in routine compare.f. 8.7 Format for kldist kldist is currently under construction, currently dealing only with von Mises distributions. kldist gives two outputs. The first output from kldist is a symmetric class-by-class matrix whose entries are a symmetric distance measure based on the Kullback-Leibler distance. The measure is arrived at as follows: Take the two p.d.f.'s f_1 and f_2 corresponding respectively to the two Snob classes, and form a new p.d.f., 1/2(f_1 + f_2), consisting of a half each of these. We choose the corresponding parameter values best fitting this mixed distribution. These are essentially Maximum Likelihood estimates assuming the Snob classes to have been arbitrarily populous. Call this new distribution, g. The distance measure is the Kullback-Leibler distance from 1/2(f_1 + f_2) to g. This ends the first part of the output. We are then able to build a binary tree whose leaf nodes are the Snob classes. The iteration proceeds iteratively merging the two closest of the remaining nodes until we are left with just the root node. This is the second part of the kldist output. Backtracking gives the (binary) minimum spanning tree. 9 Input of classes. 9.1 Using restore to input a classification by members. The restore command provides a quick and relatively convenient means for setting up a classification which you want to evaluate or use as a starting point for the iteration. It wipes any existing class structure, then reads in a file which should contain a classification described in a format corresponding to a repthng or repmemb report. That is, it should contain either a list showing, for each thing in turn, the class to which it belongs, or a list showing, for each class in turn, the things it contains. Not all things need be mentioned, and unknown thing identifiers may be mentioned without causing error. Unknown things will of course be ignored, but the ability to accept their identifiers allows this command to set up classifications which may have been generated (as repthng or repmemb reports) on a slightly different data file. The relative abundance of each class is estimated initially from the number of members specified in the file. The properties of each class are estimated from the attributes of the input members. The classification is left so that attribute "significance" testing will be done after one adjust cycle. The specific format required for the file read by the restore command is now described. The first line of the file must be: >>Report Then there may appear any number of lines with any content until a line is found starting with one of: >>ST or >>FT or >>DT or >>MB Only the four characters ">>XX" matter, and must be at the start of the line. Characters 5 to 59 of the line may have any content, which will be printed by the restore command. Characters 60 onward must be blank unless the whole line conforms exactly to the format of the reports produced by Snob, in which case the name of the binary data file appearing in characters 60 to 79 will be checked for agreement with the current data filename. Next must come, starting on a new line but perhaps spread over several lines, the number of classes and the class serials. These integers are read in a free format which permits alpha, punctuation and other non- numeric characters to appear between numbers. Embedded blanks in numbers are not permitted. The class serials may be any set of distinct positive integers not less than 3. They need not be compact or ordered. The required format for the rest of the input depends on which of the four above flags appeared. Normally, one would use either >>ST or >>MB The others are to allow the input of reports generated by repthng 1 0 or repthng 2 0, and require a rather strict format. If >>ST occured, the rest of the data must be a list of free-format integers (format as described above). These will be read in pairs: a thing identifier followed by a class serial. The thing identifier may be negative, the minus sign being separated from the integer by zero or more blanks. Any such minus is ignored, the thing still contributing to the definition of the class. A class serial value of "1" is permitted. It is read as meaning the class of the thing is uncertain, and the thing will not contribute to the definition of any class. The list is terminated by a zero thing identifier. If >>MB occured, the rest of the data must be a list of free-format integers read as groups. Each group starts with a class serial, then has a list of (possibly signed) thing identifiers terminated by a zero. As for >>ST, a serial "1" indicates a group of things whose class is uncertain, and which will be ignored. The groups need not appear in serial number order, nor in the order in which the serials were listed at the beginning of the input. The end of the data is shown by a zero serial. If >>FT or >>DT occured, the rest of the data must be in a more rigid format. Lines not beginning with the character > will be skipped. Any line beginning with > will be read with the format (x,a2,i8,10x,i5) to yield a character*2 value, a thing identifier and a class serial. If the character*2 value is '??' the thing is ignored as being of uncertain class. The data is terminated by a line beginnining > and having zeros in both integer fields. See the output formats for repthng, repmemb. (See also the command 'rdclas' in the Appendix. Rdclas allows one or more classes to be added to an existing class structure from standard input (or a file named in a "file" command). It is rather tedious and computationally slow if used for the input of a complete classification.) 10 Comparison of classifications This section has been left blank. 11 The Significance of Message Lengths The message which Snob conceptually constructs has two major parts. The first descibes the inferred class structure, and the second describes the things in a code designed for minimum message length on the assumption that the inferred class structure is correct. It is easy to show that the length of the latter part is minus the logarithm of the probability of getting the observed data given the stated class stucture. Less clearly, the length of the former part is related to the prior probability that an arbitrary population would have the stated class structure. The assumptions on which the coding (and hence length) of the former part is based are given in some of the references and partly also in section 1. Accepting these assumptions as leading to an optimal code for the description of a class structure is equivalent to accepting that the length of the description of a class structure is minus the logarithm of the prior probability that we should wish to describe that structure, i.e., the probability that we should infer that structure from data drawn from an arbitrary population. Unfortunately, we cannot exactly identify the prior probability of inferring a particular class structure with the prior probability of a population's having that class structure. For a start, the former probability is distributed over the set of structures which may be inferred and stated (a discrete set of structures at most countably infinite), whereas the latter is a probability density distributed over a continuum of possible structures. Nevertheless, the length of the class description can be validly interpreted as the neg-log prior prob. that a population would have a class structure statistically indistinguishable (on the basis of the given sample size) from that stated. It can also be shown that the length of the whole message (structure description + encoded data) is the neg-log joint probability of finding a structure such as that stated, and getting from it the given data. It is therefore possible to regard the difference in the message lengths given by two different class structures as being the log of their posterior odds ratio. Thus, if structure A gives a message X nits longer than structure B, one may conclude that the latter is the more probable explanation of the data by a factor of the order of EXP (X). 11.1 CAVEATS. There are reservations which must be made in accepting message length differences as log odds ratios. (a) The difference, especially if the structures concerned are of different complexity (e.g. different numbers of classes), depends in part on the prior assumptions built into Snob about the distributions of class number, class abundances and attribute distribution parameters (See theory in some of the references or section 1). While these have been made fairly "colourless", there may be reason in some application to dispute them, and so to modify the message length for a given structure. We believe it is unlikely that any grounds for preferring different priors, short of extreme disagreement, would lead to more than a few nit's change in length. (b) The calculations of message length made by Snob use some approxi- mations valid for large samples, but not exact for small numbers. Some care has been taken to keep these errors small, and even for classes as small as 10 the errors will be much less than one nit. In summary, a message length difference exceeding say 5 nits may reasonably be taken as showing strong evidence in favour of the structure giving the shorter message. A difference exceeding 10 nits shows an over- whelming preference. Similarly, the "significance" values given to the attribute distributions within a class may be taken as log odds ratios in favour of the hypothesis that the attribute has a distribution in the class different form its overall distribution. In this case, the errors and prior assumptions can probably amount to only 3 or 4 nits at most, since only one attribute is involved. It must be always remembered that Snob can show preference for, and with luck find, the "best" classification ONLY WITHIN A LIMITED RANGE OF POSSIBLE MODELS. Snob, as at present written, can compare and find only models which have no intra-class attribute correlations, no hierarchic class structure, and so on. The message length function could in principle be extended to embrace models with these features, and we hope to do so later. The assumption of no intra-class attribute correlation is probably the most significant limitation on the present program. Multi-variate data may often have marked attribute correlations which are of no interest in studying the class structure of the population. For instance, if the things being classified are animals, and the attributes include a number of physical dimensions, it may be expected that all these dimensions will be positively correlated, as all depend to some extent on an overall "size" attribute. The hoped-for classification of the sample will probably be a classification of the animals into different "species" or "varieties", with size being just one of the attributes contributing to the definition of each class, and perhaps not even a very important one, since each "species" may be expected to contain individuals of various sizes, including juveniles. Snob will "notice" the correlation among the physical dimension attributes, but since it is not equipped to describe classes having within- class attribute correlations, it can exhibit the correlation only by dividing up the sample along size lines. If the sample is large enough, Snob may well discover a significant classification into "species", but it will be inclined to divide the things of each species into two or more classes, i.e., into small and large members of the "species". Where such a "nuisance" correlation is thought to be present in the data, it may be removed before the data is given to Snob. For instance, if a "size" factor is expected, one may choose or construct from the affected attributes an attribute representative of overall "size", and then divide all the affected attributes by it to remove (crudely) the size effect. The "size" attribute itself should be included in the data given to Snob. This device has been found to be quite satisfactory. A similar, but more elaborate method is to do a principle components analysis on the data, and give Snob the component scores for each thing rather than the raw attribute values. All components should be included, even those accounting for little variance, because not all attributes may be measured on commensurable scales, so "variance" may not be a very meaningful concept. This device has also been used successfully, but the cruder ad-hoc removal of factors seems to be just about as good. A theory of MML (Minimum Message Length) factor analysis has been developed (see more recent references), and, as above, it is desired that this be incorporated into Snob. 12 EXAMPLE This example shows the use of many snob features in classifying an artificial data set. The data was constructed by pseudo-random draws from a three-class population. Four additional inactive things were added by hand. Their serials are 1-4. The serials of the things drawn from the "true" population have as their first digit the "true" class from which they were drawn. The "true" population parameters are that all classes are equally abundant, and the attributes have the distributions within the classes shown below: Class || Att 1 (continuous)|| Att 2 (3-state) || Att 3 (2-state) || Mean S.D. || State1 | State2 | State3 || State1 | State2 || || | | || | || 1 || 0 10 || 0.8 | 0.1 | 0.1 || 0.5 | 0.5 || || || | | || | || 2 || 30 10 || 0.1 | 0.8 | 0.1 || 0.5 | 0.5 || || || | | || | || 3 || 50 10 || 0.2 | 0.1 | 0.7 || 0.5 | 0.5 || The pseudo-random sampling that was done to generate sd1.raw from this distribution does not guarantee that the sample of 300 things has sample statistics exactly agreeing with these parameters. Note that attribute 3 does not correlate with the class structure. Snob discovers this fact for itself. The data file is named sd1.raw . It is included in this distribution package. A partial copy follows. (N.B. Lines in this copy and copies of other files herein described which start with a sharp # are comment lines edited in while preparing this document. They are not part of the files proper, and should be removed before attempting to use the files. Comment may also follow a sharp on a real line and should be removed also. Unedited copies of all useful input files are included in the distribution package.) # RAW DATA FILE "sd1.raw" # 3 # number of attributes 1 1 .10 # first att. is active, continuous, and is measured to precision 0.1 1 2 3 # second att is active, discrete, and has three states 1 2 2 # third is active, discrete, with two states -1 20 2 2 -2 0 1 1 -3 50 3 2 -4 15 2 - # first 4 things are inactive. Thing 4 has no value for att. 3 2000 28.12 2.00 2.00 2001 43.71 1.00 2.00 2002 22.61 2.00 1.00 3003 61.50 3.00 2.00 3004 51.09 3.00 1.00 2005 20.98 2.00 2.00 2006 16.14 3.00 2.00 3007 56.70 3.00 1.00 1008 -8.83 3.00 - 1009 -2.02 - 1.00 # these "missing values" were edited in just to show what happens 3010 53.73 2.00 1.00 3011 38.10 1.00 2.00 3012 43.97 3.00 1.00 3013 54.10 1.00 2.00 3014 56.22 1.00 2.00 2015 23.35 1.00 2.00 1016 -4.40 1.00 1.00 # note discete vals may be input as reals, and are rounded to nearest integer # continuous vals may be input as integers. 2017 42.73 2.00 2.00 3018 46.95 3.00 2.00 3019 53.33 3.00 2.00 # etc. etc. to a total of 304 things, 300 being active. The file ends:- 1296 -1.83 1.00 2.00 2297 24.18 2.00 2.00 1298 9.25 2.00 1.00 1299 -5.12 1.00 2.00 0 # terminates data # # # END OF sd1.raw From the header, above, of the raw data-file, sd1.raw, we observe (from the second column) that the "kind" of a continuous attribute is 1 and the kind of a discrete attribute is 2. The kind of an unbounded non-negative (Poisson) integer attribute is 3, and that of a circular attribute is 4. A Poisson attribute (kind 3) needs a 3rd column in the header. This gives the no. of the continuous "base" attribute whose value is the time or length of the interval in which the Poisson counts were made. A circular attribute (kind 4) has a 3rd column in the header - 0 for degrees and 1 for radians. A fourth column gives the measurement accuracy in the specified units, as with the 3rd column for a continuous attribute. A run was done 10/08/95 on this data. The actual input to Snob (which was read in by Snob via data-set, sd1.raw, above) is shown below. A copy is included, without comments, as file "sd1.raw". We then show the output from this run further below. # INPUT FILE sd1.raw sd1 # response was sd1 file sd1.con # This command will cause Snob to read subsequent commands from the file # "sd1.con", which is shown later. An unedited copy of sd1.con is included. # After completing the commands in sd1.con, Snob returned to logical unit 5 # for command input, and read the commands as under:- repfile sd1.rep # Open file sd1.rep to add new reports # Opened with STATUS = "UNKNOWN", then advanced to End Of File. repclas 1 # Report whole population repclas -1 # Report all main classes repatt 0 # Report all attributes repthng 1 0 # Report without data values all things prthng 2 1 2 3 4 # Print to L.U. 6 with data things 1 to 4 stop # # END OF RUN . An example is given below of a control file, sd1.con . Examples exist of other control files, as described in the reference list. Placing the command "file sd1.con" at the end of the control file 'sd1.con' causes the control file to repeatedly call itself (with different random number seeds). COMMAND FILE sd1.con INVOKED BY file COMMAND IN snob.in # # While reading commands from such a file, Snob will not give the usual # prompts on L.U. 6. This may be seen on the output which is shown later. random 3 # Start with 3 random classes (a lucky guess!) savefile sd1.sav # Name the default savefile adjust 10 # Start work. Do 10 adjust cycles. sum # Summary to L.U. 6. save temp.sav # As will be seen from the output, after 10 adjust steps, a 3-class # structure had been found. This is saved on a savefile called temp.sav wipe 3 # Destroy class 3 (classes at this stage are numbered 3,4,5) adjust 4 # Four iterations on the remaining 2- class structure sum adjust 6 sum prclas 2 # have a look at the trial combine class (as it happened,) # (no trial combine class was defined at this stage) # prclas 3 would have a look at the class with serial 3 adjust 30 sum renumber # Class numbers were getting untidy sum # To show effects of renumbering save # On default file sd1.sav restore temp.sav # Restore the structure saved before wiping class 3 random 4 # Create a total of 4 random classes (thereby making the adjust 10 # (previous command pointless) sum # Not very exciting. Restore from default savefile restore sd1.sav # Return to system input for further commands # # END OF CONTROL FILE sd1.con The output of this run comprises two parts. During the run, prompts, summaries, and comments on what Snob was doing are output to L.U. 6, normally the user's terminal, or, in a batch run, system output. This output is shown below, with a few comments added. Some lines of routine progress info. have been deleted. In reading this, remember that on a terminal the prompts etc. will be properly interleaved with the input commands. LOGICAL UNIT 6 OUTPUT FROM Snob run # Enter data file name without ".raw" or ".bin" # reply was " sd1 " datfil = sd1.raw Length of unclassified description = 2588.40 nits # This length includes no description of the popln. attribute distributions Length of description of one class is 15.41 nits Message length for one-class model is 2603.81 nits Writing binary data to sd1.bin Data dated Thu Aug 10 23:37:02 1995 Room for 250 classes. # depends on length of data. if inadequate, must re-define space in # source common declarations file " snob.com " and recompile all routines Classification of 304 things of which 300. are active. Number of attributes is 3 Save dated Tue Aug 8 14:10:06 1995 Classes restored from sd1.bst *** SUMMARY WITH 3 MAIN CLASSES *** Sample size is 300 = 1.000 * 300 One-class length 2603.8 Currently 2494.5 Sequence 24 Class Size R/ab Age SubA SubB Descrip Att.Len Spl.Ben Rawcost 6 103 0.343 14 27 76 13.2 843.6 -14.5 853.4 7 84 0.281 14 42 42 13.4 671.8 -14.5 687.5 8 113 0.376 11 52 61 13.3 936.6 -12.3 943.9 Enter action # reply was " file sd1.con " Control input file sd1.con Wiping all classes Adding 3 random classes, Seed = -1329867322 3 mains. Class cost= 44.45 Thing cost= 2763.77 Total= 2808.22 # automatic report of result of "random" command Opening save file sd1.sav Beginning 10 adjust cycles From setcom: Considered 0 pairs but none any good # reporting that all classes were so new that none were eligible for # consideration for combining together Killing class serial 3 Step 1 Class 31.6 Things 2528.4 Total length = 2560.00 nits. # after the first iteration, the total class descriptions cost 31.6 nits, # the encoded description of the (active) thing data cost 2528.4 nits. Step 2 Class 31.6 Things 2520.2 Total length = 2551.80 nits. Step 3 Class 31.7 Things 2513.8 Total length = 2545.49 nits. Step 4 Class 31.8 Things 2509.0 Total length = 2540.86 nits. From setcom: Considered 0 pairs but none any good # classes still too young Step 5 Class 32.0 Things 2506.0 Total length = 2537.98 nits. Step 6 Class 28.2 Things 2505.1 Total length = 2533.30 nits. Step 7 Class 28.1 Things 2504.8 Total length = 2532.83 nits. Remaking subclasses of 4 Remaking subclasses of 5 Step 8 Class 28.1 Things 2504.7 Total length = 2532.79 nits. Step 9 Class 28.1 Things 2504.7 Total length = 2532.79 nits. Step 10 Class 28.1 Things 2504.7 Total length = 2532.78 nits. *** SUMMARY WITH 2 MAIN CLASSES *** Sample size is 300 = 1.000 * 300 One-class length 2603.8 Currently 2532.8 Sequence 36 # the classification is much better than an unclassified description Class Size R/ab Age SubA SubB Descrip Att.Len Spl.Ben Rawcost 4 194 0.645 10 107 87 13.1 1630.7 -16.2 1641.6 5 106 0.355 10 61 45 13.3 874.0 -14.5 884.0 Classes saved on temp.sav dated Thu Aug 10 23:37:11 1995 From wipe: 3 not a valid class # automatic report of result of "wipe 3" - in this case, there was no 3 , 2 mains. Class cost= 28.06 Thing cost= 2504.72 Total= 2532.78 Beginning 4 adjust cycles Step 1 Class 28.1 Things 2504.7 Total length = 2532.78 nits. From setcom: Considered 1 pairs. Trying 4 5 Est. ben.= -59.81 # setcom now has two classes old enough for it to look at, but no go. Splitting class 4 for benefit 1.4 New classes 6 and 7 Step 2 Class 45.4 Things 2486.0 Total length = 2531.36 nits. Step 3 Class 44.3 Things 2481.2 Total length = 2525.44 nits. Step 4 Class 43.6 Things 2478.4 Total length = 2522.01 nits. *** SUMMARY WITH 3 MAIN CLASSES *** Sample size is 300 = 1.000 * 300 One-class length 2603.8 Currently 2522.0 Sequence 41 Class Size R/ab Age SubA SubB Descrip Att.Len Spl.Ben Rawcost 5 107 0.354 14 40 67 13.3 882.5 -14.2 892.5 6 106 0.362 2 ---- ---- 13.6 891.8 ---- 913.6 7 87 0.284 2 ---- ---- 14.3 704.0 ---- 717.2 Beginning 6 adjust cycles Step 1 Class 43.2 Things 2474.8 Total length = 2517.95 nits. From setcom: Considered 0 pairs but none any good # classes too young again Step 2 Class 42.9 Things 2470.1 Total length = 2512.98 nits. Remaking subclasses of 6 Remaking subclasses of 7 Step 3 Class 42.7 Things 2465.4 Total length = 2508.13 nits. Step 4 Class 42.6 Things 2461.8 Total length = 2504.37 nits. Step 5 Class 42.5 Things 2458.9 Total length = 2501.33 nits. Step 6 Class 42.4 Things 2456.1 Total length = 2498.42 nits. *** SUMMARY WITH 3 MAIN CLASSES *** Sample size is 300 = 1.000 * 300 One-class length 2603.8 Currently 2498.4 Sequence 48 Class Size R/ab Age SubA SubB Descrip Att.Len Spl.Ben Rawcost 5 110 0.365 20 30 81 13.3 912.1 -10.5 919.9 6 90 0.307 8 31 59 13.1 728.1 -12.7 749.6 7 100 0.328 8 63 36 13.4 815.8 -13.8 826.7 From prclas: 2 is not a current class serial Beginning 30 adjust cycles # etc. etc. with classes being too young for setcom until:- From setcom: Considered 2 pairs. Trying 5 6 Est. ben.= -77.74 # from an estimation of the effects of combining a pair of classes, of all # 3 possible pairs and the 2 pairs considered, the 5-6 pair looked best, # so a trial combined class is set up as class 2 to see how it would go. Step 1 Class 42.3 Things 2453.7 Total length = 2496.06 nits. Step 2 Class 42.3 Things 2452.6 Total length = 2494.98 nits. Step 3 Class 42.4 Things 2452.3 Total length = 2494.69 nits. Step 4 Class 42.4 Things 2452.2 Total length = 2494.62 nits. From setcom: Trial combine of classes 5 6 had benefit -60.19 nits From setcom: Considered 1 pairs. Trying 5 7 Est. ben.= -129.30 Step 5 Class 42.4 Things 2452.2 Total length = 2494.60 nits. Step 6 Class 42.4 Things 2452.2 Total length = 2494.58 nits. Step 7 Class 42.4 Things 2452.2 Total length = 2494.57 nits. Kill subs of 5 because 99 Step 8 Class 42.4 Things 2452.1 Total length = 2494.57 nits. From setcom: Trial combine of classes 5 7 had benefit -97.09 nits # reported as a matter of interest before abandonning the pair in favour of # another trial pair From setcom: Considered 0 pairs but none any good Step 9 Class 42.4 Things 2452.1 Total length = 2494.56 nits. Step 10 Class 42.4 Things 2452.1 Total length = 2494.56 nits. Step 11 Class 42.4 Things 2452.1 Total length = 2494.55 nits. Step 12 Class 42.4 Things 2452.1 Total length = 2494.55 nits. From setcom: Considered 0 pairs but none any good Step 13 Class 42.4 Things 2452.1 Total length = 2494.55 nits. Step 14 Class 42.4 Things 2452.1 Total length = 2494.55 nits. Step 15 Class 42.4 Things 2452.1 Total length = 2494.55 nits. Step 16 Class 42.4 Things 2452.1 Total length = 2494.55 nits. From setcom: Considered 1 pairs. Trying 6 7 Est. ben.= -49.90 Kill subs of 6 because 99 Kill subs of 7 because 99 Step 17 Class 42.4 Things 2452.1 Total length = 2494.55 nits. Remaking subclasses of 5 Remaking subclasses of 6 Remaking subclasses of 7 # because these subclasses have been around for at least 20 iterations, # and have not proved worth a split, Snob remakes the subclasses starting # from a new random division of the class into halves, in the hope of # finding something more interesting. Step 18 Class 42.4 Things 2452.1 Total length = 2494.55 nits. Step 19 Class 42.4 Things 2452.1 Total length = 2494.55 nits. Step 20 Class 42.4 Things 2452.1 Total length = 2494.55 nits. Step 21 Class 42.4 Things 2452.1 Total length = 2494.55 nits. From setcom: Trial combine of classes 6 7 had benefit -39.47 nits From setcom: Considered 0 pairs but none any good Step 22 Class 42.4 Things 2452.1 Total length = 2494.55 nits. Step 23 Class 42.4 Things 2452.1 Total length = 2494.55 nits. Step 24 Class 42.4 Things 2452.1 Total length = 2494.55 nits. Step 25 Class 42.4 Things 2452.1 Total length = 2494.55 nits. From setcom: Considered 0 pairs but none any good Step 26 Class 42.4 Things 2452.1 Total length = 2494.55 nits. Step 27 Class 42.4 Things 2452.1 Total length = 2494.55 nits. Step 28 Class 42.4 Things 2452.1 Total length = 2494.55 nits. Step 29 Class 42.4 Things 2452.1 Total length = 2494.55 nits. From setcom: Considered 1 pairs. Trying 5 6 Est. ben.= -78.94 Step 30 Class 42.5 Things 2452.1 Total length = 2494.55 nits. Cost has not improved in last 29 steps. Adjust stopped *** SUMMARY WITH 3 MAIN CLASSES *** Sample size is 300 = 1.000 * 300 One-class length 2603.8 Currently 2494.6 Sequence 79 Class Size R/ab Age SubA SubB Descrip Att.Len Spl.Ben Rawcost 2 200 0.657 1 5 6 13.0 -61.4 1737.6 5 113 0.376 50 37 76 13.3 936.6 -11.8 943.9 6 84 0.281 38 75 10 13.4 671.8 -10.7 687.5 7 103 0.343 38 77 25 13.2 843.7 -14.1 853.5 # essentially the same classification as was restored from the .bst file # Renumbering classes from 3 to 5 *** SUMMARY WITH 3 MAIN CLASSES *** Sample size is 300 = 1.000 * 300 One-class length 2603.8 Currently 2494.6 Sequence 79 Class Size R/ab Age SubA SubB Descrip Att.Len Spl.Ben Rawcost 3 113 0.376 50 37 76 13.3 936.6 -11.8 943.9 4 84 0.281 38 75 10 13.4 671.8 -10.7 687.5 5 103 0.343 38 77 25 13.2 843.7 -14.1 853.5 Classes saved on sd1.sav dated Thu Aug 10 23:37:19 1995 Save dated Thu Aug 10 23:37:11 1995 Classes restored from temp.sav Wiping all classes Adding 4 random classes, Seed = 1011443978 4 mains. Class cost= 56.99 Thing cost= 2633.66 Total= 2690.65 Beginning 10 adjust cycles Step 1 Class 61.4 Things 2473.5 Total length = 2534.90 nits. Step 2 Class 61.7 Things 2461.2 Total length = 2522.90 nits. Step 3 Class 61.7 Things 2457.8 Total length = 2519.52 nits. From setcom: Considered 0 pairs but none any good Step 4 Class 61.4 Things 2456.7 Total length = 2518.07 nits. Step 5 Class 60.9 Things 2456.5 Total length = 2517.42 nits. Step 6 Class 54.9 Things 2455.9 Total length = 2510.78 nits. Step 7 Class 47.8 Things 2453.3 Total length = 2501.12 nits. From setcom: Considered 6 pairs. Trying 4 5 Est. ben.= -11.04 # note that the trial combination of classes 4 and 5 is now showing a # "combine benefit" of -11.04, which, while still negative, is potentially # promising. Remaking subclasses of 3 Remaking subclasses of 4 Remaking subclasses of 6 Combining 4 5 into 7 benefit 4.20 # the potentially promising join has been seen to prove good! Step 8 Class 42.7 Things 2452.9 Total length = 2495.62 nits. Step 9 Class 42.8 Things 2452.2 Total length = 2494.97 nits. Step 10 Class 42.7 Things 2452.1 Total length = 2494.81 nits. *** SUMMARY WITH 3 MAIN CLASSES *** Sample size is 300 = 1.000 * 300 One-class length 2603.8 Currently 2494.8 Sequence 48 Class Size R/ab Age SubA SubB Descrip Att.Len Spl.Ben Rawcost 3 105 0.352 10 67 38 13.2 865.4 -14.6 875.1 6 80 0.264 10 39 41 13.8 633.2 -14.0 647.7 7 115 0.384 3 95 20 13.3 953.4 -4.6 960.9 Save dated Thu Aug 10 23:37:19 1995 Classes restored from sd1.sav Control input file 5 # noting the end of sd1.con, and automatic return to L.U. 5 for more # commands. If the command 'file sd1.con' were at the end of the file # 'sd1.con', then this control file would repeatedly call itself. Enter action # Prompts will now appear, for terminal input. Opening report file sd1.rep Enter action repclas # note Snob repeats these command words. Every command results in some # output to L.U. 6. Enter action repclas Enter action repatt Enter action repthng Enter action Thing Length Class(Rel Prob) > -1 8.1 4(97) 3( 3) # Even inactive things are classified in a prthng or repthng output. # Thing 1 is shown as having a prob. of 97% of belonging to class 3 and # 3% of belonging to class 5. 20.00 2 2 # the data values for thing 1. > 0 0 End Thing Length Class(Rel Prob) > -2 7.5 3(99) 0.00 1 1 > 0 0 End Thing Length Class(Rel Prob) > -3 7.4 5(99) 50.00 3 2 > 0 0 End Thing Length Class(Rel Prob) >? -4 8.5 4(82) 3(18) 15.00 2 Missing > 0 0 End # the query ? shows that the most probable classification of thing 4 has # less than 90% prob. of being correct. # note also the Missing value. # Enter action Classes saved on sd1.sav dated Thu Aug 10 23:37:49 1995 End of run. # # END OF L.U. 6 OUTPUT The end of the run leaves the data file sd1.raw unaltered. A new file sd1.bin has been made holding binary data, and savefiles temp.sav and sd1.sav have been made or overwritten with saved class structures. The report file sd1.rep has been made, if it did not already exist, and reports have been appended to it. REPORTS APPENDED TO sd1.rep (a full unedited version is included as osd1.rep . If you want it and can't find it, please contact us.) # # The first report comes from "repclas 1" # it gives the attribute distribution parameters for the whole population. # the "sig" values are meaningless. # Class report 3 classes at sequence 78 Length= 2494.5 **---****---****---****---****---****---****---****---****---****---** Serial 1 age 1 Relab 1.000 Size 300. Class desc= 15.41 nits. Unclassified= 2588.40 nits Att 1 vals 300. description cost 7.51 sig 1.0 Mean 25.4432 S.D. 22.7260 Att 2 vals 299. description cost 4.90 sig 1.0 1: 0.3943 2: 0.2845 3: 0.3211 # these are the relative probs of the 3 states of this discrete variable # the "vals" shows that 299 active things had a value for this variable Att 3 vals 299. description cost 3.00 sig 1.0 1: 0.4917 2: 0.5083 # # # The next report comes from "repclas -1" which reports all main classes # Such a report is automatically preceded by a summary # *** SUMMARY WITH 3 MAIN CLASSES *** Sample size is 300 = 1.000 * 300 One-class length 2603.8 Currently 2494.5 Sequence 78 Class Size R/ab Age SubA SubB Descrip Att.Len Spl.Ben Rawcost 3 113 0.376 50 27 86 13.3 936.6 -10.1 944.0 4 103 0.343 37 37 66 13.2 843.9 -12.3 853.6 5 84 0.281 37 47 38 13.4 671.6 -13.7 687.3 Class report 3 classes at sequence 78 Length= 2494.5 **---****---****---****---****---****---****---****---****---****---** Serial 3 age 50 Relab 0.376 Size 113. Class desc= 13.29 nits. Attrib. len.= 936.64 nits .01% Att 1 vals 113. description cost 7.29 sig 100.9 Mean 0.6199 S.D. 10.7938 .01% Att 2 vals 112. description cost 4.82 sig 39.1 1: 0.8152 2: 0.0638 3: 0.1210 # the 0.01% shows that these distribution paras are very different from # those for the whole population. # Att 3 is not reported because its distribution in class 3 is like its # distribution in the whole population. **---****---****---****---****---****---****---****---****---****---** Serial 4 age 37 Relab 0.343 Size 103. Class desc= 13.19 nits. Attrib. len.= 843.85 nits .01% Att 1 vals 103. description cost 7.56 sig 120.5 Mean 49.8399 S.D. 7.5005 .01% Att 2 vals 103. description cost 4.39 sig 29.3 1: 0.2182 2: 0.0807 3: 0.7011 **---****---****---****---****---****---****---****---****---****---** Serial 5 age 37 Relab 0.281 Size 84. Class desc= 13.44 nits. Attrib. len.= 671.58 nits .01% Att 1 vals 84. description cost 7.47 sig 57.3 Mean 28.8991 S.D. 6.7829 .01% Att 2 vals 84. description cost 4.65 sig 52.9 1: 0.0500 2: 0.8267 3: 0.1232 # # # The next report comes from "repatt 0" and reports all attributes 1 ATTRIBUTE REPORT 3 classes at sequence 78 Length = 2494.55 Class Popln 3 4 5 Size 300. 113. 103. 84. # sizes of active popln. and classes Att 1 .01%s .01%s .01%s #Attribute 1 significance indicators Class Popln 3 4 5 Mean 25.4 0.6 49.8 28.9 S.D. 22.7 10.8 7.5 6.8 Att 2 .01%s .01%s .01%s Class Popln 3 4 5 1 0.394 0.815 0.218 0.050 # Probs of state 1 in popln and classes 2 0.285 0.064 0.081 0.827 3 0.321 0.121 0.701 0.123 Att 3 Not significant in any class # For distribution in popln., see repclas 1 report. # # # # The next report comes from "repthng 1 0" and reports without data vals # on all things. Note the things are sorted on serial number. # Since the serials (after 4) have "true class" as first digit, the Snob # classes come out more or less in order. >>FT Thing at Sun Oct 15 14:22:39 1995 2494.55 nits on sd1.bin 3 classes. Serials are: 3 4 5 Thing Length Class(Rel Prob) > -1 8.1 5(97) 3( 3) > -2 7.5 3(99) > -3 7.4 4(99) >? -4 8.5 5(82) 3(18) > 1008 9.1 3(99) > 1009 7.3 3(99) # "true" class 1 = Snob class 5 > 1016 7.6 3(99) > 1023 8.0 3(99) Thing Report 3 classes at sequence 48 Length= 2494.5 Class serials are 3 4 5 Thing Length Class(Rel Prob) 1016 7.6 5(99) 1023 8.0 5(99) # etc. etc. until :- >? 1051 9.1 3(81) 5(19) # a bit dubious # then more of the same until :- > 1077 7.8 5(99) # A missclassification! # the data for this thing are 22.26, 2, 1, which make it look like class 5 # A few such sampling accidents have occurred in generating the test data. # # then more until :- >? 1081 9.4 5(76) 3(24) # one on the borderline # etc. till :- > 1296 7.5 3(99) >? 1298 10.1 3(77) 5(23) > 1299 7.6 3(99) > 2000 7.3 5(99) > 2001 8.8 4(97) 5( 3) > 2002 7.8 5(99) > 2005 8.0 5(98) 3( 2) >? 2006 10.0 3(64) 5(36) >? 2015 9.3 3(67) 5(33) >? 2017 9.0 5(64) 4(36) > 2022 7.4 5(99) > 2025 7.7 5(98) 4( 2) > 2027 7.4 5(99) # we have hit members of "true" class 2. These are mostly put in Snob # class 5, but a few leak over into classes 3 or 4 >? 2028 9.6 5(81) 3(18) >? 2032 8.6 4(75) 5(25) > 2035 7.6 5(98) 4( 2) > 2037 8.1 3(99) > 2038 7.5 5(99) >? 2039 8.3 4(88) 5(12) >? 2042 9.1 3(83) 5(16) >? 2043 8.4 5(87) 4(13) > 2047 7.7 5(98) 4( 2) > 2049 7.4 5(99) > 2052 7.5 5(99) >? 2054 8.4 4(83) 5(17) > 2060 7.3 5(99) > 2064 7.4 5(99) > 2068 7.4 5(99) > 2080 7.9 5(98) 3( 2) > 2082 7.3 5(99) > 2084 7.6 5(98) 4( 2) > 2089 7.4 5(99) >? 2092 9.0 5(69) 4(30) > 2094 7.4 5(99) >? 2097 9.0 5(81) 4(18) > 2100 7.4 5(99) # etc. etc till :- >? 2169 9.7 5(66) 3(28) 4( 6) # This one is quite uncertain. Its data are 28.93, 1, 1 # then more till :- > 2282 7.7 5(99) >? 2283 9.8 5(50) 4(44) 3( 6) # gets one query because its highest prob is only 50%, # would get two querys if its highest prob were less than 50% > 2284 7.5 5(99) > 2290 8.8 5(91) 3( 9) >? 2295 9.7 4(57) 5(38) 3( 4) > 2297 7.5 5(99) > 3003 8.6 4(99) > 3004 7.4 4(99) > 3007 7.8 4(99) > 3010 9.7 4(99) >? 3011 9.5 4(77) 5(22) 3( 2) > 3012 7.7 4(98) 5( 2) > 3013 8.7 4(99) # "true" class 3 corresponds to Snob class 4 > 3014 8.9 4(99) > 3018 7.4 4(99) # and so on to :- > 3278 8.6 4(99) > 3281 7.5 4(99) > 3287 8.8 4(99) > 3291 7.8 4(97) 5( 3) > 0 0 End # finally, the "stop" command closes the report file after writing: End of run. # # END OF sd1.rep REPORT FILE 13 LIMITATIONS 13.1 ON SIZE OF PROBLEM (a) Number of attributes: 199 max. Set by parameter "most" in include file "snob.com". A change will require a recompilation of almost all the routines in Snob. (b) "Arity" of Snob discrete multi-state attributes: 50 max. Set by the parameter "mststt" in the file "MAIN.f". (c) Total storage required internally depends basically on a formula like: (atts + k1) * things + (atts + k2) * classes * k3 The space available is set by a parameter "lstore" in MAIN.f, at present set at 3000000 (32-bit words). In a virtual-memory paging operating system, this parameter can be above what is really needed for the problem in hand without incurring a cost. After reading raw data, Snob will say if the space is not enough or, if it is enough, will say how many classes there is room for. A change to the parameter should require only MAIN.f to be recompiled. A re-link is then needed. 13.2 ON FILE NAMES Limited to 20 characters. To change this will require finding all character*20 declarations in MAIN.f, smain.f and snob.com and changing them. There are not many. 13.3 ON CLASSES The number of main classes cannot exceed the value of the parameter "mstman" in MAIN.f (at present set at 100 or 250) or such lesser number as could be (but, at present, is not) announced immediately after the data is read in. An increase (with recompilation) of the store size will allow the maximum number of classes with a large data set to be lifted to, but not above, mstman. (The number of (main)classes cannot exceed mstman, and is also limited by the value assigned to the size of working store, lstore. lstore is, in turn, clearly limited by the memory capacity of the machine being used.) 13.4 ON NATURE OF PROBLEM This version of Snob is restricted to describing the population as a one-level (non-hierarchic) set of classes having in each class simple, uncorrelated attribute distributions. However, the limitations on most other classification programs are equally severe, often less realistic, and usually much more obscure. ACKNOWLEDGEMENTS This work was supported by Australian Research Council (ARC) Grant A49330656 and ARC 1992 small grant 9103169 . REFERENCES -- MAIN REFERENCES C S Wallace and D L Dowe (1994), "Intrinsic classification by MML - the Snob program", Proc. 7th Australian Joint Conference on Artificial Intelligence (UNE, Armidale, NSW, Australia, November 1994), World Scientific, pp37-44. Wallace, C.S. and D.L. Dowe (1996). MML Mixture Modelling of Multi-State, Poisson, von Mises Circular and Gaussian Distributions. p197, Proc. Sydney International Statistical Congress (SISC-96), Sydney, Australia, July 1996. C S Wallace and D L Dowe (1997), "MML mixture modelling of multi-state, Poisson, von Mises circular and Gaussian distributions", Proc. 6th International Workshop on Artificial Intelligence and Statistics, Ft. Lauderdale, Florida, U.S.A., Jan. 1997, pp529-536. -- LONGER LIST OF REFERENCES Wallace, C.S. and D.L. Dowe (1996). MML Mixture Modelling of Multi-State, Poisson, von Mises Circular and Gaussian Distributions. p197, Proc. Sydney International Statistical Congress (SISC-96), Sydney, Australia, July 1996. D.L. Dowe, L. Allison, T.I. Dix, L. Hunter, C.S. Wallace and T. Edgoose, 'Circular clustering of protein dihedral angles by Minimum Message Length', Proc. 1st Pacific Symposium on Biocomputing (PSB-1), HI, U.S.A., 1996, pp242-255. D.L. Dowe, L. Allison, T.I. Dix, L. Hunter, C.S. Wallace and T. Edgoose, 'Circular clustering by Minimum Message Length of protein dihedral angles', Technical Report #95/237, Dept of Computer Science, Monash University, Clayton, Vic 3168, Australia. C.S. Wallace, `Multiple Factor Analysis by MML Estimation', Technical Report #95/218, Dept of Computer Science, Monash University, Clayton, Vic 3168, Australia. C S Wallace and D L Dowe, "Intrinsic classification by MML - the Snob program", Proc. 7th Australian Joint Conference on Artificial Intelligence (UNE, Armidale, NSW, Australia, November 1994), World Scientific, pp37-44 . C S Wallace and D L Dowe, `Estimation of the von Mises concentration parameter using Minimum Message Length', 12th Australian Statistical Society conf. 1994, Monash University, Melbourne, Australia, 1994 . C S Wallace and D L Dowe, "MML estimation of the von Mises concentration parameter", Technical Report #93/193, Department of Computer Science, Monash University, Clayton 3168, Australia. J.D. Patrick, 'Snob: A program for discriminating between classes', Tech Rept #91/151, Dept of Computer Science, Monash University, Clayton, Vic 3168, Australia. C.S. Wallace, `Classification by Minimum-Message-Length Inference', S.G. Akl et al (eds.) Advances in Computing and Information - ICCI'90, Niagara Falls, Lecture Notes in Computer Science, No.468, Springer-Verlag, pp 72-81, 1990. Wallace.C.S, Freeman.P.R., `Estimation and Inference by Compact Coding', The Journal of the Royal Statistical Society, Series B, Methodology, 49, 3, 1987, pp 223-265. Wallace.C.S., `An Improved Program for Classification', ACSC-9, vol 8, no 1, pp 357-366, February 1986. Wallace C.S. & Boulton, D.M., `An Information Measure for Classification' \fIComputer Journal\fP, Vol.11, No.2, 1968, pp 185-194. APPENDIX Ap.1 Sampling. Snob has a scheme intended to speed up the early stages of an analysis of a data set having a large number of things. A sampling fraction can be set by the user. If it is set less than one (the default is one), then Snob will on each iteration assign only that fraction of the sample to classes, and will ignore the rest of the things. The things used are chosen randomly from the whole sample, and a different selection is made on each iteration. Because most of Snob's time on a large data set is used in the assignment of things to classes, sampling cuts the time for each iteration. Sampling imposes a fairly stringent test of stability on the class structure, since classes estimated from one subset of things are used on the next iteration to classify a different subset. Also, the iterative process only converges in a stochastic sense. Thus sampling is probably useful only in the early stages of an analysis, while the broad features of the class structure are being discovered. The sample fraction should be reset to one to tidy up. Snob will not reduce the sample size to less than 100 things. Ap.2 Using rdclas to describe class(es) by attributes or members. Note that rdclas does not currently permit von Mises attributes (see label 700 in rdclas.f). To begin a Snob analysis, it is necessary to have some initial class structure to begin with. If nothing is known about the population which would suggest a possible classification, one should use the "random" action to create a few initial classes. A reasonable start can usually be got by making 3 or 4 random classes. If a possible classification can be suggested a priori, one may input the class structure describing it by using the rdclas action. Following a rdclas action word, Snob will prompt with: Enter relab for class and incode: 1=atts, 2=things The class number is assigned by Snob. The response to this prompt should be, in free format and beginning on a new record, a relative abundance for the class in the range >0 to 1.0, followed by 1 or 2. If a relative abundance of 1 is given, Snob will make the relative abundance equal to 1 (if this class will be the only class) or 1.0 / (number of extant classes). At the end of the rdclas action, all class relative abundances are re-normalised, so the effect of specifying a relab of 1 is to make the class the same size as the average of the other classes already present. The "incode", 1 or 2, says how you will specify the class. Incode 1: input by attributes: Snob will ask for an attribute number. Respond with an integer, from 1 up to the number of attributes. Snob will then ask for mean and S.D. if the att. is continuous, or a list of state probabilities if it is discrete. For a continuous att., you may give an S.D. of zero, in which case Snob will assume an S.D. of half the population S.D. for the attribute. Discrete state probs need not be normalised: Snob will do that. A zero state prob. will be replaced by 0.001. Note all state probs must be entered as one record. Snob will then ask for another attribute number. You can specify the distributions for as many atts. as you like, in any order. You may even respecify an att. already entered to correct an error. Respond with a zero att. number to terminate input of the class description. Atts. of the class not specified will be given the population-wide distributions. That is, they will be assumed to be insignificant for the class. Incode 2: input by things: Snob will ask for a list of thing serial numbers, to be entered one per line. You may enter as many valid serial numbers as you like, terminating with zero. Snob will then estimate a class distribution from the attribute values of the things you have named. However, Snob will not assume that the number of things you named is an indication of the abundance of the class: it takes the relative abundance figure you entered, and treats the named things as being only typical members of the class, not an exhaustive list. Having finished input of one class, Snob will ask for the relative abundance and incode for another class. In one rdclas action, you can enter as many classes as you like, with any mixture of incodes. Terminate input of classes by responding 0,0 When you terminate class input, Snob will ask for "Number of cycles to inhibit overlap". If you input , for the next n iteration cycles, Snob will assign each thing wholly to that class (and subclass) most likely to contain such a thing, rather than partially assigning it to all classes in proportion to their probabilities of holding such a thing. During these cycles the message length will not be correct, as all-or-none assignment does not give the shortest possible length. However, if you have input additional classes, or a whole class structure, with only a loose description of each class using, say, only ony or two attribute distributions for each class, or a small set of "typical" things, the normal (partial) assignment rule could lead to very slow convergence, or the killing of some of the new classes. If the input classes are loosely described, many things may plausibly belong to any of several different classes. Partial assignment of these things to all of these classes would leave the classes still poorly distinguished, and the most abundant could grab most of the things simply because it is the most abundant. Non-overlap assignment forces a clear distinction to be made among the classes, and leads to a rapid convergence towards a structure in which all classes are distinct. This structure is not, of course, exactly right, but it will be a good starting point for improvement when overlapped assignment is resumed. About 4 or 5 non- overlap cycles should be enough in these circumstances. On the other hand, you may have input a class structure which you wish to test on the message-length measure. In that case, you will want Snob to calculate the total message length correctly before it mucks about with your stucture trying to "improve" it. You should then ask for zero non-overlap cycles.