
ISSA is a system integrating different structural mining tasks (only the sequencial case for the moment) by using a common unifying framework: a Galois lattice adapted to sequential data. After characterizing the Galois lattice, it is possible to calculate either partial orders, or frequent sequential patterns, or also a new notion of association rules with order by just traversing its nodes. Our aim is providing a visual tool based on a sound theory that can be useful in different contexts of the analys of structures.
Introduction
Mining structured data is a challenging but very relevant problem in the area of knowledge discovery. Beyond the plain relational model, the most basic structure that data can exhibit corresponds to the sequential categorical domain, that is, the input consists of a list of transactions where each transaction is a symbolic sequence. Such description is common in many contexts such as web usage data, DNA sequences, plan execution traces, texts, recurrent illnesses and so on. Formally, this data was first introduced by Agrawal and Srikant (1995), and it can be viewed as a set of input sequences, where each sequence is a list <(i1) ... (in)> such that ii is an item (or attribute) occurring before another item ij when i < j. A simple example of this data is presented in figure 1, with four input sequences D={d1,d2,d3,d4} where items are presented in capital letters.
|
Seq id |
Sequence |
|
d1 |
<(A)(C)(D)(A)(A)(C)(A)> |
|
d2 |
<(C)(B)(C)(A)(C)> |
|
d3 |
<(C)(B)(A)(B)(C)(C)(A)(A)> |
|
d4 |
<(A)(B)(C)(D)(C)(C)(A)(A)> |
Fig 1. Example of a
collection of sequential data
Associated to the analysis of
these sequences there are different tasks,
whose results provide a global knowledge about the domain. Among all the tasks that
one could imagine on
such data, we are interested in the following ones: mining frequent
sequential
patterns and the so-called closed sequences, summarizing the data by
means of
the most specific partial orders, mining association rules with order,
or
clustering input sequences. We
introduce
each one of these tasks in the next subsections.
Mining
Sequential Patterns
The classical sequential mining
problem aims at discovering frequent
subsequences in the database, that is, all those subsequences whose
number of
occurrences is over a user-specified threshold. The number of
occurrences of a
sequence, also called support, is
counted as the number of input sequences (transactions) containing that
sequence; e.g. in figure 1, <(B)(C)(A)> occurs in the second, third and
fourth
transaction, so that it has support 3. These frequent patterns can be
useful in
many domains, for instance in anomaly detection for computer security
(Lee,
Stolfo, and Mok, 1999).
Mining
Partial Orders
Alternatively to the mining of
sequential patterns, another approach was
proposed in (Mannila, Toivonen, and Verkamo, 1997); it consists in
describing
portions of the input data with so-called episodes. An episode is
described as
a directed acyclic graph, where the relation between nodes is
reflexive,
antisymmetric and transitive; moreover, there is a labelling function
assigning
names (corresponding to items) to each one of the nodes. Episodes can
be
classified into serial episodes (total orders), parallel episodes
(trivial
orders) and hybrid episodes (indeed, general partial orders). An
example of a
hybrid episode, thus a general partial order, is depicted in following
figure
2.

Fig 2. Example of a partial order
Partial orders will be depicted
here by means of its transitive reduction
to make them more understandable, but we must consider that all the
edges of
the transitive closure are included in the formalization. The graphical
representation of a partial order is particularly useful for displaying
results: we display the poset by using arrows between the connected
labelled
vertices, and the symbol úú (parallel) to indicate trivial
order
among the different components of a partial order.
Broadly speaking, we say that a
partial order is compatible with a
set of input sequences if the labels of the nodes
from the poset occur in each one of the input sequences in the
corresponding
order given by the edges. For instance, the partial order given in
figure 2 is
compatible with the first, third and fourth input sequences from figure
1. The
support of a partial order is the number of input sequences that the
poset is
compatible with, and a poset is frequent when its support is over a
user-specified value. A poset p is
said to be more specific than
another
poset p’ when it is
possible to
transform p into p’
by removing edges or vertices. This can be formalized also by
defining an injective label-preserving function that maps edges and
vertices of
p’ to edges and vertices
of p respectively. E.g. the
partial order
from figure 2 is more specific than the trivial order ||A,C||, but more
general
than the total order A→C→A→C→A;
moreover, this poset in figure
2 is
also the most specific one compatible with first, third and fourth
input
sequences from data in figure 1, i.e., there is no other poset
summarizing with
more information these three input sequences.
The goal is, then, to describe the
input sequential data by means of the
most specific partial orders with support over the user-specified
threshold.
However, the problem arises with the complexity of managing these
structures
and the combinatorial explosion to tackle all the cases. As a way of
example,
the classical approach called Winepi, presented in (Mannila, Toivonen,
and
Verkamo, 1997), looks for such structures in a Apriori fashion: i.e.,
with each
complete pass along the data, the algorithm computes the support of
current
candidates and, after each pass, it generates new partial orders as
long as the
antimonotonicity property of support keeps them alive.
This natural approach performs two complex
operations: first, generating new candidates to partial orders out of
smaller
ones, and second, identifying partial orders in each transaction to
update the
support. So, the algorithm obviously incurs in a substantial overhead.
In a
similar line, another work summarizing sequential data with partial
orders is
(Mannila and Meek, 2000). There, partial orders are seen as generative
models
for sequences, and the resulting ordering relationships get a global
overview
of the input data. However, the authors must restrict the problem by
just
dealing with the specific case of serial-parallel partial orders and
avoid the
repetitions of items in the input sequences to make the task more
tractable. In
general, finding partial orders directly from the data
is a complex task. Later we will show how to
address such problems with the Galois lattice.
Mining
Association Rules
with Order
Another natural way of extracting
knowledge from data is to look for causal
relations, where the presence of a fact suggests another fact. Indeed,
mining
association rules is a very popular task for binary databases (i.e.,
data is
unordered and each transaction can be regarded as a bit vector of 0/1
assigments to attributes); for example, in market basket data it is
quite
natural to represent the purchasing patterns of the costumers by means
of
logical implications. The general interest is to find groups of
products that,
bought together, imply the purchase of another group of products with a
certain
confidence. However, although mining association rules is a very
succesful
task, little work has been done to provide a similar notion in case of
having
categorical sequences. Usually,
the
studies with order in association rules involve the mining for temporal
implications, such as in (Özden, Ramaswamy, and Silberschatz,
1998), where only a categorized
temporal attribute is added to the binary data. However, here we are
interested
in dealing directly with sequences, that is, attributes ordered along a
time
line. A first antecedent to our proposal is the work of (Mannila, Toivonen, and Verkamo,
1997), where
total orders can be seen as sequences, and then an association rule is
a
fragment of order implying another fragment of order, so that the
antecedent is
a subsequence of the consequent. Here we will consider a generalization
of this
idea, and we will present a novel notion of association rules with
order where
a set of fragments of order implies an individual fragment of order;
moreover,
in our proposal the sequences in the antecedent may not be a
subsequence of the
consequent, which makes more informative the predictive rule.
Clustering
Sequences
Clustering is the task of grouping
together input objects into meaningful
subclasses. In this work we want to focus on clustering sequential
data, i.e.
each object is represented as a sequence of items. This is an
interesting task
for example, when considering order in the subsequent shopping bags of
the same
customer in the case of market basket data, so that the different final
clusters lead to groups of customers with similar purchasing patterns.
One of
the key steps in clustering algorithms is the method for computing the
similarity
between the objects being clustered. The work in (Guralnik and Karypis, 2001)
considers as
discriminating features all the sequential patterns over a certain
support with
length between two values given by the user. The critical step is then
to project
the new incoming sequences into the new feature space; to avoid this
overhead,
the authors in the mentioned work decide to restrict the space to a
reduced set
of features. Our proposal will be to achieve the clustering of input
sequences
naturally, by means of the Galois lattice; we will show that each one
of the
nodes of our lattice represents in fact a set of discriminating
features, and
this will lead to an organization of the input sequences into
hierarchical
clusters.
Nowadays, these tasks of the
sequence analysis are used independently
(if at all), that is, they are not integrated in a system with a common
framework that eases the comparison of different results. Our next
study will
present a unifying framework for the study of the different presented
tasks.
This framework is only based on a closure system of sequences, that is,
a
Galois lattice adapted to ordered data. We will see that this system
provides a
tool, named ISSA, for the analysis of sequential categorical data.