adtree and xadtree
user guide

F. De Comité D. Devigne

1 Introduction

adtree is an implementation of Freund & Mason [FM99] Alternating Decision Tree Algorithm, extended to multilabel classification [CGT01].

xadtree is a graphic interface designed to simplify the use of the algorithm.

2 Requirements

adtree is written in basic C, and should compile and run on any platform, provided the following softwares are available :

dot : the graph display algorithm dowloadable at :
http://www.research.att.com/sw/tools/graphviz
display : an image processing and display program provided by quite all Linux releases.
gtk xadtree was developped under gtk+ version 1.2.6 release 1 (libgtk+1.2-devel-1.2.10-25mdk).

3 Installation

Programs sources and documentation are in the file http://www.grappa.univ-lille3.fr/ftp/Softs/ADTree.tgz.
Uncompress the file : tar -xvzf ADTree.tgz
This will create a directory (ADTree) containing two subdirectories :
- doc : The documentation in Postscript and HTML.
- src : programs sources.
and a Makefile.
Go into ADTree (cd ADTree) then compile both programs : make
If compilation meets no problems, both programs are now in ADTree directory : adtree and xadtree. Put them with your other programs, for example in ~/bin or /usr/local/bin.

In case of problem, send us a mail, together wih the error message : francesco.de-comite@univ-lille1.fr.

4 Data Format

The program needs at least two files :

A first file describing classes and attributes, with suffix .names.
Another file containing the data, with suffix .data.
A test file with other examples can also be provided, its extension is .test. Its format is similar to the .data file's one.

Two example files are provided in subdirectories:

./shortExample : the example described below.
./ReutersMini : 1100 examples from the Reuters Corpus (Reuters 21450), formatted to be handled by adtree.

The file format is very close to Roos Quinlan's C4.5 format [Qui93], and in fact all the files from the UCI [MM98] can be processed by adtree. In addition, new datasets allowing examples to have multiple classes are recognized and run by adtree.

The first part of file name must be the same for all three files.

For both files, a vertical bar (|) stands for the beginning of a comment, continuing till the end of the line. Comments and blank lines can be inserted anywhere.

A line can end with a '.', but it is no compulsory. If the last line of the file ends with a '.', it must be followed by carriage return.

4.0.1 Format of the `.names` file

The first line lists the names of the classes, separated with a comma. A class name can use any character, excepted comma. To put a comma in a class name, one must preceed it with a \.

Each of the following lines describes an attribute, in the same order they will appear in the data files.

An attribute description begins with the name of the attribute (any sequence of characters), followed by ':' and the attribute type.

The attribute type is one of the following :

ignore]: this attribute will not be used neither to construct the tree, nor to classify an example.
discrete n : this attribute takes a value among n, those values being unknown until the data file is read. An error message is sent if more than n values are found while reading the data file. Not very used, but interesting for a first approximation of tree building.
continuous : the attribute is a number, comparison tests will be possible on this type of attribute.
the enumeration of all the discrete values the attribute can take, comma separated.
text : A text (with no commas in it). Tests will consider the occurence or the absence of a word in an example.
scoredtext : A list of words, together with their number of occurences in the example. Treatment of such type of attribute is not yet implemented. We plan to introduce it in order to be able to fully compare adtree with boostexter [SS00].

| An example of  .names  file (this line is a comment)
velo,auto,ballon à gaz,fusée. 

nombre de roues : continuous. 
marque : ignore.
inconvenients : discrete 6.
destination : nulle part,ecole,la mer,ciel,lune,mars

4.0.2 Format of the `.data` file

An example is composed with the list of its attribute values, in the same order they were declared in the .names file, comma separated.

Attributes values are followed with the list of all classes an example belongs to, each class name being separated from the next one with a comma. (this is a difference beetween adtree and boostexter).

The following .data file corresponds to the above .names. Second example belongs to two classes, sixth example belongs to no class.

2,Eddy Merckx,crevaison,ecole,velo.
4,Peugeot,crevaison,nulle part,velo,auto
4,Opel,consomme de l'essence,la mer,auto
0,Mongolfier,atterrit n'importe où,ciel,ballon à gaz,fusée
0,Saturn 5,coûte cher,lune,fusée
8,Aldi,pas garanti,nulle part.

5 adtree

5.1 Basic commands

Provided adtree binary is in your path, the following command :

adtree -f filename

will build a ten nodes tree, and display it in text mode.

For the above short example, the first three rules will be :

Number of rules : 10
-------------------------
Rule 0 
if TRUE 
(Number of items verifying the precondition : 6)
 Class :                         velo    auto    ballon  fusée
         If TRUE then            -0.34   -0.34   -0.80   -0.34
Else 0
-------------------------
Rule 1 
if True 
(Number of items verifying the precondition : 6)
                               Class :          velo    auto    ballon  fusée
         If nombre de roues >= 1.00 then         0.00    0.00    -2.63   -2.63
                              Class :            velo    auto    ballon  fusée
         If nombre de roues < 1.00 then          -2.29   -2.29   0.00    2.29
Else 0
-------------------------
Rule 2 
If nombre de roues >=  1.00000
(Number of items verifying the precondition : 4)
                                Class :          velo    auto    ballon  fusée
         If inconvenients =crevaison then        2.67    0.00    -1.38   -1.38
                                 Class :         velo    auto    ballon  fusée
         If inconvenients !=crevaison then       -2.67   0.00    -1.38   -1.38
Else 0

The following figure shows the first four rules of the tree build from the exemple.

5.1.1 Reading the results

Decision structure is a tree, each rule corresponds to a branch, or part of a branch.

Branches are made of successive confidence nodes and test nodes. The root is a confidence node.

Classifying an example means make it go down the tree, according to the tests it encounters on its way, adding accordingly the values it meets.

Rule 0 corresponds to the root of the tree : each example is credited with -0.8 for the class ``ballon'', and -0.34 for the other classes.

Rule 1 is directly under the root, the value of harvested coefficients depends on the attribute ``nombre de roues'' (number of wheels). If the number of wheels is less than one, the chance to have an example of class ``velo'' (bicycle) decreases, which is shown by the -2.29 coefficient value. On the other hand, this increases the chances to have a ``fusée''(rocket).

Rule 2 is mode deeply situated in the tree : it is used only for examples having at least one whell.

When an example has a missing value for an attribute used in the test, it is cut into pieces, and those pieces follow all the branches under this test.

5.2 adtree command line options

The option :

adtree -h

gives the list of available options.

 Options:
         usage : adtree -f file_name
         Options : 
         -u : Evaluation on test set
         -b #n : number of boosting steps (default 10)
         -v : verbose
         -h : this text
         -n : non normalized weights (default : normalized)
         -d : drawing a postcript visualization of the tree
         -t : enable n_ary tests
         -B : boostexter behavior
         -r : random choice of precondition to develop
         -O : display One-Error
         -H : display Hamming loss
         -C : display confusion matrix and class by class results
         -A : display coverage and avg prec

Detail :

-u Evaluate errors and statistics on the provided test set .test extension.
-b #n : Number of boosting steps. Default : 10.
-v : Add verbosity to output : mainly used for debugging, not very useful for normal user.
-h : a reminder of the options.
-n : Each example has weights (one for each class) : choosing this option prevents the algorithm to normalize the weights after each boosting step.
-d : display a graphical representation of the tree. The dot binary, part of graphviz package must be reachable. You can download it at :

http://www.research.att.com/sw/tools/graphviz,

This option create two files :
- file.dot, a text description of the tree, used by dot to build the image.
- file.ps : a PostScript file containing the image.
-t Enable N-ary tests. The default behaviour for discrete tests is to only ask questions like ``Does this example have the value x for this attribute or not''. N-ary tests add questions like ``what is the value of the example for this attribute ?''.
-B Forces adtree to behave like boostexter (http://www.research.att.com/~schapire/BoosTexter/), which means that all tests will be direct sons of the root node (no tree-structure). Nevertheless, it remains two main differences beetween adtree in boostexter mode and boostexter itself :
- adtree computes confidence values for the root node, while for boostexter, those values are set to zero.
- boostexter is able to handle SCOREDTEXT, which is not yet implemented in adtree.
-r : Instead of looking for the best attribute to use at the best place in the tree, this option chooses randomly the place where the next node will be created.
-O : Displays One-Error. An exemple is considered as misclassified if adtree classifies it in an class it does not belong too.
-H : Displays Hamming loss.
-C : Displays a confusion matrix at each boosting step.
-A : Displays coverage and average precision.

6 xadtree

xadtree is a graphical interface, calling adtree to create trees, classifying examples, display statistics.

xadtree can load files, choose adtree's options, build and visualize trees, show how an example go through a tree...xadtree was written by Damien Devigne (devigne@lifl.fr), student in Maîtrise d'Informatique at the University of Sciences (Lille).

Development is still in progress, please contact us for remarks, bugs and wishes.

6.1 Using `xadtree`

The command xadtree will launch the program.

6.1.1 Main Window

The first window opens, the only possible action is to enter a file name, without extension in the Data field.

(We plan to allow users to enter different file names, but it is still not implemented : button Use different filenames)

Once the name entered and validated, files .data,.names, and eventually .test, are loaded.

One can then :

Inspect examples values (Data and Test thumb indexes), or classes names and attributes definition (Names thumb index).
Creat a tree (button New Adtree) : a new window opens (see below).
Choose the options to be used while creating a tree (button Configure).

Many buttons are reacting when the mouse passes upon them, so they could be later used for new functionnalities (attribute caracteristics, for example ...)

6.2 Configuration window

Allows the user to choose among adtree options :

-u : Computing errors on test set. The test set must be present.
-n : No weights normalization.
-t : Enables N-ary tests
-r : Random choice of precondition.
-v : Verbosity.
-B : Boostexter behavior.
-b #n : Number of boosting steps (limited to 1000).

Once the options are chosen, the user can, from left to right :

Quit the window and create a tree according to the chosen options.
Quit the window, save the options, without creating a tree.
Quit the window, without saving the options.

6.3 Result window

The two ways to reach this screen are from button New Adtree in main window, or button Save and Create new tree of configuration window.

It consists of two or three parts, whether a test set is available or not :

The tree.
A Data window listing the examples numbers of teaching set elements.
A test window listing the examples numbers of test set elements (if any).

This page contains also four buttons :

Test item: : To see the path followed by an example through the tree.
Clear path: : to clear path information and come back to the original tree.
Create PostScript: : display, and save a PostScript representation of the tree.
Toggle Tree Display/ Stats Display: : to toggle between the tree window and the statistics window.

All those buttons and window are described hereunder.

6.3.1 The tree window

Blue lines are confidence nodes, green lines are test nodes.

A confidence line is splitted into three zones. From left to right :

Rule number (the order in which they were created).
The attribute used in this rule.
The number of examples reaching this node.

A test line contains :

A row describing the condition.
For each class, the coefficient to apply at this point.

6.3.2 `Data` and `Test` Windows

Both windows are only lists of increasing numbers. They are the rank of the examples in the data files.

Selecting one example makes button Test item active.

6.3.3 `Test Item`

When an example is selected in either Data or Test window, one can visualize the path followed by this example through the tree.

Green branches are those followed by the example. Red branches are not taken by it.

If an example has a missing value for a test in the tree, all the branches are green : the example follows all of them.

The value inside the brackets (at the top of the tree) is the maximum value of the coefficients vector. If the classification is said to give only one classification per example, this corresponds to the class found by adtree.

6.3.4 `Clear path`

Clear all red branches, and return to the original tree.

6.3.5 `Create PostScript`

Call a function which :

Create a text description of the tree, useable by dot (in file filename.dot).
Display a graphical view of the tree, with the same informations as in the tree window.
Save the image in a PostScript file (file.ps)

6.3.6 `Toggle Tree Display / Stats Display`

Toggle between the tree window and the statistics window.

The statistics window displays One-Error, Hamming loss, coverage and average precision.

Confusion matrices give also class by class informations.

References

[CGT01]: F. De Comité, R. Gilleron, and M. Tommasi. Learning multi-label alternating decision trees and applications. In Gilles Bisson, editor, Proceedings of CAP'01 : Conférence en Apprentissage Automatique, pages 195--210, 2001.
[FM99]: Yoav Freund and Llew Mason. The alternating decision tree learning algorithm. In Proc. 16th International Conf. on Machine Learning, pages 124--133, 1999.
[MM98]: C.J. Merz and P.M. Murphy. UCI repository of machine learning databases, 1998.
[Qui93]: J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA, 1993.
[SS00]: Robert E. Schapire and Yoram Singer. Boostexter: A boosting-based system for text categorization. Machine Learning, 39(2/3):135--168, 2000.

This document was translated from L^AT_EX by H^EV^EA.

adtree and xadtreeuser guide