ISSA is a system integrating different structural mining tasks (only the sequencial case for the moment) by using a common unifying framework: a Galois lattice adapted to sequential data. After characterizing the Galois lattice, it is possible to calculate either partial orders, or frequent sequential patterns, or also a new notion of association rules with order by just traversing its nodes. Our aim is providing a visual tool based on a sound theory that can be useful in different contexts of the analys of structures.


ISSA Performance

The preliminary implemented version of  ISSA is programmed in C++ using STL libraries, under a unix platform. For the moment ISSA is just a command line program with options allowing the calculation of the different mining tasks. Note that in most of the cases it is necessary the visualization of the results, such as the final partial orders, or the hierarchical clusters of the input sequences. For those cases, ISSA outputs the necessary files in a stardard GraphML format (see http://graphml.graphdrawing.org/) , that can be visualized through any tool for graph visualizations, such as  Yed-Java Graph Editor (see http://www.yworks.com/en/products_yfiles_about.htm) , that we are currently using for our experimentations.

We distinguish here three phases to analyze the performance of the system: 1/ identifying the frequent stable sequences from the lattice; 2/ organizing stable sequences into the lattice and outputting results for the desired mining task; and finally  3/ visualizing the results. The first phase is the most I/O intensive since it requires to examine a computationally explosive number of patterns before identifying the stable ones. We are currently running our own implementation for this phase, which is an Apriori-like algorithm; however, there are other strategies that get better results thanks to the use of wise data structures, such as CloSpan of (Yan, Han, and Afshar, 2003) or BIDE of (Wang and Han, 2004). Any of these other algorithms can be used for this first phase, and we refer to those papers for a thorough comparison of their memory management and performance. The second phase of our approach requires the organization of the stable sequences into the lattice structure. This is not an intensive step: the input to be examined is only the set of frequent stable sequences, and we do not require to combine those patterns, just organizing them in the lattice structure; moreover, the operations performed to get such lattice are simply standard operations of sets, such as inclusion, intersection and so on. The construction of the lattice and calculation of the necessary outputs takes usually a few seconds and a not significant use of memory. Finally, the third phase is the visualization of the results such as the partial orders or the clustering of input sequences (that is, the lattice). ISSA outputs the necessary GraphML file, so that we can visualize it by means of  the Yed-Java Graph Editor, whose performance is not dependant of our program.

EXPERIMENTAL RESULTS

We evaluate our approach by performing experiments on two different discrete sequential databases: a first database of 50000 transactions of synthetic data, generated by means of a probabilistic context-free grammar; and a second smaller database of 1382 transactions of corresponding to the command history of a unix computer user (downloaded from the UCI repository at http://kdd.ics.uci.edu/summary.data.type.html).

Our first data comes from running a context-free grammar as a generative model of sentences in Spanish language. We start from an initial set of symbols that our synthetic generator will combine to construct the sentences following the grammar; some of these symbols are: A meaning a feminine noun, B is a masculine noun, C is an intransitive irregular verb, D is a transitive regular verb, J is an adverb, K is a preposition, L is an exclamation, M is a modifier, X is a feminine adjective coming from a verb, V indicates the past tense of a verb and so on. An example of a sequence generated by this grammar is:  < MA  F   X   A   Q   D   N   BES   HOS   SE   F >; so, each one of the items might be instantiated with the proper vocabulary to provide meaning to the sequence combination. Indeed, each one of our sequences can be seen as a very coarse analogue to the grammatical analysis performed by the NLP (Natural Language Processing) when tagging a document text with the corresponding grammatical codes.

A first set of numerical comparisons obtained with the synthetic data are showed in the following table 6. We compare, for different minimum supports, the number of frequent sequences found in the data, the number of stable sequences (closed sequences) and the number of partial orders coming from grouping stable sequences into nodes. These are not forming necessarily a closure system; in case we wanted a closure system we still would need to compute the intersections, as mentioned above.


Minimum support

Frequent sequences

Stable sequences

Partial orders

10000

33

29

28

5000

130

107

104

3000

407

343

326

2000

935

733

705

1000

3415

2647

2517

Tab 6. Counting patterns for the synthetic data.

We observe that  the number of obtained partial orders is always less than or equal to the  number of closed sequential patterns; moreover, as we decrease the minimum support and we get more frequent sequences, the difference with the partial orders gets  still higher. Actually, we observed that, at a small enough minimum support, it may happen that the number of frequent sets and closed sequences becomes larger than the number of transactions in this database; however, the number of partial orders gets more stable with this kind of data and it does not beat this limit.

A portion of the lattice containing the partial orders generated by ISSA is showed in figure 7. Each one of the nodes represents the partial order that can be extracted when considering the corresponding stable sequences as their maximal paths. The number indicated along each node corresponds to the number of transactions where the partial order is compatible (since depicting the tid list, as usual, may be quite large in this case). In this small portion of the lattice, we can see that the transitive verb is the starting point of any order; then it is followed by the combination of a masculine adjective coming from a verb, and a masculine noun in plural.

Fig 7. Example of a small portion of the lattice with synthetic data

We also performed experimentation with the command unix data. The same statistics are showed in following table 8.

Minimum support

Frequent sequences

Stable sequences

Partial orders

50

176

175

175

30

782

765

762

20

3150

2892

2729

Tab 8. Counting patterns for the command unix data.

In this case we did not force the minimum support condition less than 20, since the number of discovered patterns clearly overcame the 1382 initial transactions. As it was to be expected, the number of partial orders coming from the nodes of the lattice, not forming a closure system, is always less than the number of stable sequences and frequent sequences. Some partial orders extracted from this dataset are showed in the piece of lattice of figure 9.

Fig 9. Example of a small portion of the lattice with unix command data


Interestingly enough, when dealing with unix user data, the frequent closed partial orders may be later used as the normal user profile for the intrusion detection systems (such as proposed in by Lee W., Stolfo S.J., and Mok K.W. (1999). However, these closed posets may have other utilities in the field of knowledge discovery depending on the context.


CONCLUSIONS

In this paper we analyze some of the most important tasks for the analysis of sequential data by means of a Galois lattice.  We have seen that this properly generated lattice provides a convenient unifying framework: it allows the generation of the most specific partial orders summarizing the data, the set of association rules with order, and a way of clustering input sequences, by just traversing the nodes of the lattice. Given the theoretical power of this lattice we decided to integrate all the results in a system named ISSA. Basically, this system constructs the lattice by means of the set of frequent stable sequences, and it allows the extraction of different knowledge that can be later used to take decisions over the specific domain. As the most important task, ISSA allows the derivation of the most specific partial orders summarizing the input sequences without mining the data, just by overlapping wisely the positions of the sequences in a closed set. For the moment, our system is a simple command line program, which keeps the lattice in a plain text file to be used as a bridge from one task to another. All the graphical outputs of ISSA, such as the partial orders or the clusters of input sequences, can be viewed through any standard graph visualization tool accepting GraphML as input. Finally, we evaluated the performance of ISSA with preliminary experiments, which show a promising behavior of our proposals.


FUTURE WORK

The next step to complete our proposal is to incorporate into ISSA the visualization subsystem, in order to compute the results and seamlessly presenting them to the user in a graphical form. This will not affect the functionality of the system, but it is likely to improve the usability, for instance the comparison of the different forms of knowledge.  We would like ISSA to be a fully functional tool to analyze sequential data, of which there are many to analyze numerical or symbolic data (e.g. Weka, SAS, DBMiner …). Once this basic tool is available, an important natural step is to allow incremental updates of the lattice with the arrival of new input sequences to the domain. We want the previous computed results to be incremented just with the new knowledge, without the need of recomputing the whole lattice from the start.

Finally, an important extension of our work is to allow the same framework for other complex structured input data, such as graphs, molecules and so on. We think that a Galois lattice storing just the important information, such as paths of graphs, can be used in the same way we showed here for sequences. We would like that just the combination of information of the nodes would lead to characterizations in generally structured contexts.


REFERENCES

Agrawal R., and Srikant R. (1995). Mining Sequential Patterns. Proceedings of the 11th Int. Conference on Data Engineering. 3-14.

Balcázar J.L., and Casas-Garriga G. (2005). On Horn axiomatizations for sequential data. Proceedings of the 10th Int. Conference on Database Theory. 215-229.

Carpineto C., and Romano G. (2004). Concept Data Analysis: Theory and Applications. John Wiley & Sons, Ltd.

Carpineto C., and Romano G. (1993). GALOIS: An Order-Theoretic Approach to Conceptual Clustering. Proceedings of the Int. Conference on Machine Learning. 33-40.

Casas-Garriga G. (2003). Towards a formal framework for mining general patterns from ordered data. KDD Int. Worshop on Multirelational Data Mining. 14-26.

Casas-Garriga G., and Balcázar J.L. (2004). Coproduct Transformations on Lattices of Closed Partial Orders. In 2nd International Conference on Graph Transformation. 336-352.

Casas-Garriga G. (2005). Summarizing sequential data with closed partial orders. To appear in 5th Int. Conference on SIAM Data Mining.

Davey B.A and Priestly H.A. (2002). Introduction to Lattices and Order. Cambridge.

Ganter B., and Wille R. (1998). Formal Concept Analysis. Mathematical Foundations. Springer.

Garofalakis M., Rastogi R., and Shim K. (1999). Spirit: Sequential pattern mining with regular expression constraints. Proc. of the 25th Int. Conf. Very Large Data Bases. 223–234.

Guralnik V., and Karypis G. (2001). A Scalable Algorithm for Clustering Sequential Data. Proceedings of Int. Conference on Data Mining. 179-186.

Lee W., Stolfo S.J., and Mok K.W. (1999). A data mining framework for building intrusion detection models. Proc. of the Symposium on Security and Privacy. 120-132

Mannila H., and Meek C. (2000) Global partial orders from sequential data. Proceedings of the Int. Conference on Knowledge discovery and data mining. 161-168.

Mannila H., Toivonen H., and Verkamo A.I. (1997). Discovery of frequent episodes in event sequences. Data Mining and Knowledge Discovery, 1(3), 1997.

Özden B., Ramaswamy S., and Silberschatz A. (1998). Cyclic association rules. Proceedings of the 14th Int. Conference on Data Engineering. 412-421.

Pei J., Han J., Mortazavi-Asl B., Pinto H., Chen Q., Dayal U., and Hsu M-C. (2001). PrefixSpan: Mining Sequential Patterns Efficiently by PrefixProjected Pattern Growth. Proceedings of the 17th Int. Conference on Data Engineering. 215-224.

Srikant R., and Agrawal R. (1996). Mining sequential patterns: Generalizations and performance improvements. Proc. of the 5th Int. Conf. Extending Database Technology. 3-17.

Wang J., and Han J. (2004). BIDE: Efficient mining of frequent closed sequences. Proc. of the 20th Int. Conference on Data Engineering. 79-90.

Yan X., Han J., and Afsjar R. (2003). CloSpan: Mining Closed Sequential Patterns in Large Datasets. Proceedings of the Int. Conference SIAM Data Mining. 166-177.

Zaki H.J., and Ogihara M. (1998). Theoretical Foundations of Association
Rules. SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.

Zaki M.J. (2001). SPADE: An Efficient Algorithm for Mining Frequent Sequences Machine Learning Journal (Doug Fisher, ed.), 42(1/2). 31-60.