Documents clustering

History

To give a sample of the benefits of using XML, within my PhD work (report.ps.gz), I wrote a program in 1998 to do clustering of AML (Astronomical Markup Language) documents. It was using both the meaningful links between the documents, and the keywords associated with them, using a noising partitioning technique, and displaying the result on a topic map. The documents could be retrieved automatically from various sources, starting from an initial document and using the AML links to retrieve the related documents. It was a success, but as many cool PhD software, it disappeared from the web since it could not be maintained anymore.

Back to 2004, I needed a program to cluster other documents, and couldn't find any free software to do this simple task. I decided to resurrect this project, and I found a way to specify the list of documents, keywords and links in an external XML document. This way, it can now work for any collection of documents, even non-XML documents.

Using the program

Here is a sample document list, with keywords and links. The DTD is included in the package.

<DOCLIST>
    <DOCUMENT id="108">
        <URL>section2_1_2_7_APPRENDRE.html</URL>
        <TITLE>Vitesse orbitale</TITLE>
        <KEYWORDS>
            <KEYWORD>KEPLER</KEYWORD>
            <KEYWORD>MASSE</KEYWORD>
            <KEYWORD>MOUVEMENT</KEYWORD>
            <KEYWORD>TRAJECTOIRE</KEYWORD>
            <KEYWORD>VITESSE</KEYWORD>
        </KEYWORDS>
        <LINKS>
            <LINK toid="110"/>
        </LINKS>
    </DOCUMENT>
</DOCLIST>

When the document list is ready, the clustering program can be launched (just double-click on Clustering.jar).

clustering.png

The clustering algorithm is first spreading the documents randomly on the grid, then move them in order to reduce the "cost" progressively. After a while, it stops and the result is recorded in a grid.xml file.

This grid XML file can then be displayed with the DispGrid applet, with an HTML file containing this code:

<applet code="dispgrid.DispGrid" archive="DispGrid.jar" width="100" height="100">
    <param name="url" value="http://server/grid.xml">
</applet>

dispgrid.png

Example

Here is an example of clustering with the documents of the formation Fenêtres sur l'Univers, with their links and keywords.

Topic map for the F.S.U. demo website

Download

The software is available under GPL licence.

Clustering.tar.gz

Author: Damien Guillaume