ISSA is a system integrating different structural mining tasks (only the sequencial case for the moment) by using a common unifying framework: a Galois lattice adapted to sequential data. After characterizing the Galois lattice, it is possible to calculate either partial orders, or frequent sequential patterns, or also a new notion of association rules with order by just traversing its nodes. Our aim is providing a visual tool based on a sound theory that can be useful in different contexts of the analys of structures.

Introduction

AN EXAMPLE IMAGE

Mining structured data is a challenging but very relevant problem in the area of knowledge discovery. Beyond the plain relational model, the most basic structure that data can exhibit corresponds to the sequential categorical domain, that is, the input consists of a list of transactions where each transaction is a symbolic sequence. Such description is common in many contexts such as web usage data, DNA sequences, plan execution traces, texts, recurrent illnesses and so on. Formally, this data was first introduced by Agrawal and Srikant (1995), and it can be viewed as a set of input sequences, where each sequence is a list <(i1) ... (in)> such that ii is an item (or attribute) occurring before another item ij when i < j. A simple example of this data is presented in figure 1, with four input sequences D={d1,d2,d3,d4} where items are presented in capital letters.

Seq id

Sequence

d1

<(A)(C)(D)(A)(A)(C)(A)>

d2

<(C)(B)(C)(A)(C)>

d3

<(C)(B)(A)(B)(C)(C)(A)(A)>

d4

<(A)(B)(C)(D)(C)(C)(A)(A)>

 

Fig 1. Example of a collection of sequential data

Associated to the analysis of these sequences there are different tasks, whose results provide a global knowledge about the domain.  Among all the tasks that one could imagine on such data, we are interested in the following ones: mining frequent sequential patterns and the so-called closed sequences, summarizing the data by means of the most specific partial orders, mining association rules with order, or clustering input sequences.  We introduce each one of these tasks in the next subsections.


Mining Sequential Patterns

The classical sequential mining problem aims at discovering frequent subsequences in the database, that is, all those subsequences whose number of occurrences is over a user-specified threshold. The number of occurrences of a sequence, also called support, is counted as the number of input sequences (transactions) containing that sequence; e.g. in figure 1, <(B)(C)(A)> occurs in the second, third and fourth transaction, so that it has support 3. These frequent patterns can be useful in many domains, for instance in anomaly detection for computer security (Lee, Stolfo, and Mok, 1999).

Managing sequential patterns and counting their support in a database is a quite challenging task, since one needs to examine a combinatorially explosive number of possible frequent patterns. Many studies have contributed with algorithms to this problem (e.g. Garofalakis, Rastogi, and Shim, 1999; Pei et al, 2001; Srikant and Agrawal, 1996; or Zaki, 2001, among many others). Unfortunately, there are important cases where the number of frequent patterns is too large for a thorough examination and the algorithms face several computational problems; these include the cases of considering a very low threshold or a dense database. A proper solution to this problem was initially proposed in (Yan, Han, and Afshar, 2003), and it consists on mining just a compact and more significative set of sequential patterns called the frequent closed sequential patterns. Closed sequential patterns are those not extendable to others with the same support. For e.g. <(A)(C)> is a closed sequence for data from figure 1 since it is maximal with support 4; however, <(B)(C)> is not closed since it can be extended to the longer <(B)(C)(A)> with the same support, 3. 

Mining Partial Orders

Alternatively to the mining of sequential patterns, another approach was proposed in (Mannila, Toivonen, and Verkamo, 1997); it consists in describing portions of the input data with so-called episodes. An episode is described as a directed acyclic graph, where the relation between nodes is reflexive, antisymmetric and transitive; moreover, there is a labelling function assigning names (corresponding to items) to each one of the nodes. Episodes can be classified into serial episodes (total orders), parallel episodes (trivial orders) and hybrid episodes (indeed, general partial orders). An example of a hybrid episode, thus a general partial order, is depicted in following figure 2.

Fig 2. Example of a partial order


Partial orders will be depicted here by means of its transitive reduction to make them more understandable, but we must consider that all the edges of the transitive closure are included in the formalization. The graphical representation of a partial order is particularly useful for displaying results: we display the poset by using arrows between the connected labelled vertices, and the symbol úú (parallel) to indicate trivial order among the different components of a partial order.

Broadly speaking, we say that a partial order is compatible with a set of input sequences if the labels of the nodes from the poset occur in each one of the input sequences in the corresponding order given by the edges. For instance, the partial order given in figure 2 is compatible with the first, third and fourth input sequences from figure 1. The support of a partial order is the number of input sequences that the poset is compatible with, and a poset is frequent when its support is over a user-specified value. A poset p is said to be more specific than another poset p’ when it is possible to transform p into p’ by removing edges or vertices. This can be formalized also by defining an injective label-preserving function that maps edges and vertices of p’ to edges and vertices of p respectively. E.g. the partial order from figure 2 is more specific than the trivial order ||A,C||, but more general than the total order A→C→A→C→A; moreover, this poset in figure 2 is also the most specific one compatible with first, third and fourth input sequences from data in figure 1, i.e., there is no other poset summarizing with more information these three input sequences.

The goal is, then, to describe the input sequential data by means of the most specific partial orders with support over the user-specified threshold. However, the problem arises with the complexity of managing these structures and the combinatorial explosion to tackle all the cases. As a way of example, the classical approach called Winepi, presented in (Mannila, Toivonen, and Verkamo, 1997), looks for such structures in a Apriori fashion: i.e., with each complete pass along the data, the algorithm computes the support of current candidates and, after each pass, it generates new partial orders as long as the antimonotonicity property of support keeps them alive.  This natural approach performs two complex operations: first, generating new candidates to partial orders out of smaller ones, and second, identifying partial orders in each transaction to update the support. So, the algorithm obviously incurs in a substantial overhead. In a similar line, another work summarizing sequential data with partial orders is (Mannila and Meek, 2000). There, partial orders are seen as generative models for sequences, and the resulting ordering relationships get a global overview of the input data. However, the authors must restrict the problem by just dealing with the specific case of serial-parallel partial orders and avoid the repetitions of items in the input sequences to make the task more tractable. In general, finding partial orders directly from the data  is a complex task. Later we will show how to address such problems with the Galois lattice.

Mining Association Rules with Order

Another natural way of extracting knowledge from data is to look for causal relations, where the presence of a fact suggests another fact. Indeed, mining association rules is a very popular task for binary databases (i.e., data is unordered and each transaction can be regarded as a bit vector of 0/1 assigments to attributes); for example, in market basket data it is quite natural to represent the purchasing patterns of the costumers by means of logical implications. The general interest is to find groups of products that, bought together, imply the purchase of another group of products with a certain confidence. However, although mining association rules is a very succesful task, little work has been done to provide a similar notion in case of having categorical sequences.  Usually, the studies with order in association rules involve the mining for temporal implications, such as in (Özden, Ramaswamy, and Silberschatz, 1998), where only a categorized temporal attribute is added to the binary data. However, here we are interested in dealing directly with sequences, that is, attributes ordered along a time line. A first antecedent to our proposal is the work of (Mannila, Toivonen, and Verkamo, 1997), where total orders can be seen as sequences, and then an association rule is a fragment of order implying another fragment of order, so that the antecedent is a subsequence of the consequent. Here we will consider a generalization of this idea, and we will present a novel notion of association rules with order where a set of fragments of order implies an individual fragment of order; moreover, in our proposal the sequences in the antecedent may not be a subsequence of the consequent, which makes more informative the predictive rule.

Clustering Sequences

Clustering is the task of grouping together input objects into meaningful subclasses. In this work we want to focus on clustering sequential data, i.e. each object is represented as a sequence of items. This is an interesting task for example, when considering order in the subsequent shopping bags of the same customer in the case of market basket data, so that the different final clusters lead to groups of customers with similar purchasing patterns. One of the key steps in clustering algorithms is the method for computing the similarity between the objects being clustered. The work in (Guralnik and Karypis, 2001) considers as discriminating features all the sequential patterns over a certain support with length between two values given by the user. The critical step is then to project the new incoming sequences into the new feature space; to avoid this overhead, the authors in the mentioned work decide to restrict the space to a reduced set of features. Our proposal will be to achieve the clustering of input sequences naturally, by means of the Galois lattice; we will show that each one of the nodes of our lattice represents in fact a set of discriminating features, and this will lead to an organization of the input sequences into hierarchical clusters.

Nowadays, these tasks of the sequence analysis are used independently (if at all), that is, they are not integrated in a system with a common framework that eases the comparison of different results. Our next study will present a unifying framework for the study of the different presented tasks. This framework is only based on a closure system of sequences, that is, a Galois lattice adapted to ordered data. We will see that this system provides a tool, named ISSA, for the analysis of sequential categorical data.


Next: GALOIS LATTICE FOR ORDERED CONTEXTS