
ISSA is a system integrating different structural mining tasks (only the sequencial case for the moment) by using a common unifying framework: a Galois lattice adapted to sequential data. After characterizing the Galois lattice, it is possible to calculate either partial orders, or frequent sequential patterns, or also a new notion of association rules with order by just traversing its nodes. Our aim is providing a visual tool based on a sound theory that can be useful in different contexts of the analys of structures.
The preliminary implemented
version of
ISSA is programmed in C++ using STL libraries, under a
unix platform.
For the moment ISSA is just a command line program with options
allowing the
calculation of the different mining tasks. Note that in most of the
cases it is
necessary the visualization of the results, such as the final partial
orders,
or the hierarchical clusters of the input sequences. For those cases,
ISSA
outputs the necessary files in a stardard GraphML format (see http://graphml.graphdrawing.org/)
, that can be visualized through any tool for graph visualizations,
such
as Yed-Java Graph
Editor (see http://www.yworks.com/en/products_yfiles_about.htm)
, that we are currently using for our experimentations. We distinguish here three
phases to analyze the performance of the
system: 1/ identifying the frequent stable sequences from the lattice;
2/
organizing stable sequences into the lattice and outputting results for
the
desired mining task; and finally 3/
visualizing the results. The first phase is the most I/O intensive
since it
requires to examine a computationally explosive number of patterns
before
identifying the stable ones. We are currently running our own
implementation
for this phase, which is an Apriori-like algorithm; however, there are
other
strategies that get better results thanks to the use of wise data
structures,
such as CloSpan of (Yan, Han, and Afshar, 2003) or BIDE of (Wang and
Han,
2004). Any of these other algorithms can be used for this first phase,
and we
refer to those papers for a thorough comparison of their memory
management and
performance. The second phase of our approach requires the organization
of the
stable sequences into the lattice structure. This is not an intensive
step: the
input to be examined is only the set of frequent stable sequences, and
we do
not require to combine those patterns, just organizing them in the
lattice
structure; moreover, the operations performed to get such lattice are
simply
standard operations of sets, such as inclusion, intersection and so on.
The
construction of the lattice and calculation of the necessary outputs
takes
usually a few seconds and a not significant use of memory. Finally, the
third
phase is the visualization of the results such as the partial orders or
the
clustering of input sequences (that is, the lattice). ISSA outputs the
necessary GraphML file, so that we can visualize it by means of the Yed-Java Graph Editor,
whose performance
is not dependant of our program. We
evaluate our approach by performing experiments on two different
discrete
sequential databases: a first database of 50000 transactions of
synthetic data,
generated by means of a probabilistic context-free grammar; and a
second smaller
database of 1382 transactions of corresponding to the command history
of a unix
computer user (downloaded from the UCI repository at http://kdd.ics.uci.edu/summary.data.type.html). Our first data comes from
running a context-free grammar as a generative
model of sentences in Spanish language. We start from an initial set of
symbols
that our synthetic generator will combine to construct the sentences
following
the grammar; some of these symbols are: A meaning a feminine noun, B is
a
masculine noun, C is an intransitive irregular verb, D is a transitive
regular
verb, J is an adverb, K is a preposition, L is an exclamation, M is a
modifier,
X is a feminine adjective coming from a verb, V indicates the past
tense of a
verb and so on. An example of a sequence generated by this grammar is: < MA
F X A
Q D N
BES HOS SE F
>; so, each one of the items might be instantiated with the
proper
vocabulary to provide meaning to the sequence combination. Indeed, each
one of
our sequences can be seen as a very coarse analogue to the grammatical
analysis
performed by the NLP (Natural Language Processing) when tagging a
document text
with the corresponding grammatical codes. A first set of numerical
comparisons obtained with the synthetic data
are showed in the following table 6. We compare, for different minimum
supports, the number of frequent sequences found in the data, the
number of
stable sequences (closed sequences) and the number of partial orders
coming
from grouping stable sequences into nodes. These are not forming
necessarily a
closure system; in case we wanted a closure system we still would need
to
compute the intersections, as mentioned above. Minimum support Frequent sequences Stable sequences Partial orders 10000 33 29 28 5000 130 107 104 3000 407 343 326 2000 935 733 705 1000 3415 2647 2517 Tab 6. Counting patterns for
the
synthetic data. We observe that
the number of
obtained partial orders is always less than or equal to the number of closed
sequential patterns;
moreover, as we decrease the minimum support and we get more frequent
sequences, the difference with the partial orders gets
still higher. Actually, we observed that, at
a small enough minimum support, it may happen that the number of
frequent sets
and closed sequences becomes larger than the number of transactions in
this
database; however, the number of partial orders gets more stable with
this kind
of data and it does not beat this limit. A portion of the lattice
containing the partial orders generated by ISSA
is showed in figure 7. Each one of the nodes represents the partial
order that
can be extracted when considering the corresponding stable sequences as
their
maximal paths. The number indicated along each node corresponds to the
number
of transactions where the partial order is compatible (since depicting
the tid
list, as usual, may be quite large in this case). In this small portion
of the
lattice, we can see that the transitive verb is the starting point of
any
order; then it is followed by the combination of a masculine adjective
coming
from a verb, and a masculine noun in plural. Fig 7. Example of a small
portion of
the lattice with synthetic data Minimum support Frequent sequences Stable sequences Partial orders 50 176 175 175 30 782 765 762 20 3150 2892 2729 Tab 8. Counting patterns for
the
command unix data. Fig 9. Example of a small
portion of
the lattice with unix command data Interestingly
enough, when dealing with unix user data, the frequent closed partial
orders
may be later used as the normal user profile for the intrusion
detection
systems (such as
proposed in by Lee W., Stolfo S.J., and Mok K.W. (1999). However, these
closed
posets may have other utilities in the field of knowledge discovery
depending
on the context. In this paper we analyze some
of the most important tasks for the
analysis of sequential data by means of a Galois lattice. We have seen that this
properly generated
lattice provides a convenient unifying framework: it allows the
generation of
the most specific partial orders summarizing the data, the set of
association
rules with order, and a way of clustering input sequences, by just
traversing
the nodes of the lattice. Given the theoretical power of this lattice
we
decided to integrate all the results in a system named ISSA. Basically,
this
system constructs the lattice by means of the set of frequent stable
sequences,
and it allows the extraction of different knowledge that can be later
used to
take decisions over the specific domain. As the most important task,
ISSA
allows the derivation of the most specific partial orders summarizing
the input
sequences without mining the data, just by overlapping wisely the
positions of
the sequences in a closed set. For the moment, our system is a simple
command
line program, which keeps the lattice in a plain text file to be used
as a
bridge from one task to another. All the graphical outputs of ISSA,
such as the
partial orders or the clusters of input sequences, can be viewed
through any
standard graph visualization tool accepting GraphML as input. Finally,
we
evaluated the performance of ISSA with preliminary experiments, which
show a
promising behavior of our proposals. The next step to complete our
proposal is to incorporate into ISSA the
visualization subsystem, in order to compute the results and seamlessly
presenting them to the user in a graphical form. This will not affect
the
functionality of the system, but it is likely to improve the usability,
for
instance the comparison of the different forms of knowledge. We would like ISSA to be a
fully functional
tool to analyze sequential data, of which there are many to analyze
numerical
or symbolic data (e.g. Weka, SAS, DBMiner …). Once this
basic tool is
available, an important natural step is to allow incremental updates of
the
lattice with the arrival of new input sequences to the domain. We want
the
previous computed results to be incremented just with the new
knowledge,
without the need of recomputing the whole lattice from the start. Finally, an important extension
of our work is to allow the same
framework for other complex structured input data, such as graphs,
molecules
and so on. We think that a Galois lattice storing just the important
information, such as paths of graphs, can be used in the same way we
showed
here for sequences. We would like that just the combination of
information of
the nodes would lead to characterizations in generally structured
contexts. Agrawal R., and
Srikant R. (1995). Mining Sequential Patterns. Proceedings of the 11th
Int.
Conference on Data Engineering. 3-14. Balcázar J.L., and Casas-Garriga G.
(2005).
On Horn axiomatizations for sequential data. Proceedings of the 10th
Int.
Conference on Database Theory. 215-229. Carpineto C., and Romano G. (2004). Concept
Data Analysis: Theory and Applications. John Wiley & Sons, Ltd. Carpineto C., and Romano G. (1993). GALOIS:
An Order-Theoretic Approach to Conceptual Clustering. Proceedings of
the Int.
Conference on Machine Learning. 33-40. Casas-Garriga G. (2003). Towards a formal
framework for mining general patterns from ordered data. KDD Int.
Worshop on
Multirelational Data Mining. 14-26. Casas-Garriga G., and Balcázar J.L.
(2004). Coproduct
Transformations on Lattices of Closed Partial Orders. In
2nd International Conference on Graph Transformation. 336-352. Casas-Garriga G. (2005). Summarizing
sequential data with closed partial orders. To appear in 5th Int.
Conference on
SIAM Data Mining. Davey B.A and Priestly H.A. (2002).
Introduction to Lattices and Order. Cambridge. Ganter B., and Wille R. (1998). Formal
Concept Analysis. Mathematical Foundations. Springer. Garofalakis M., Rastogi R., and Shim K.
(1999). Spirit: Sequential pattern mining with regular expression
constraints.
Proc. of the 25th Int. Conf. Very Large Data
Bases. 223–234. Guralnik V., and Karypis G. (2001). A
Scalable Algorithm for Clustering Sequential Data. Proceedings of Int.
Conference on Data Mining. 179-186. Lee W., Stolfo S.J., and Mok K.W. (1999). A
data mining framework for building intrusion detection models. Proc. of
the
Symposium on Security and Privacy. 120-132 Mannila
H., and
Meek C. (2000) Global partial orders from sequential data. Proceedings
of the
Int. Conference on Knowledge discovery and data mining. 161-168. Pei J., Han J., Mortazavi-Asl B., Pinto H.,
Chen Q., Dayal U., and Hsu M-C. (2001). PrefixSpan: Mining Sequential
Patterns
Efficiently by PrefixProjected Pattern Growth. Proceedings of the 17th
Int.
Conference on Data Engineering. 215-224. Srikant R., and Agrawal R. (1996). Mining
sequential patterns: Generalizations and performance improvements.
Proc. of the
5th Int. Conf. Extending Database Technology. 3-17. Wang J., and Han J. (2004). BIDE: Efficient
mining of frequent closed sequences. Proc. of the 20th Int. Conference
on Data
Engineering. 79-90. Yan X., Han J., and Afsjar R. (2003). CloSpan:
Mining Closed Sequential Patterns in Large Datasets. Proceedings of the
Int.
Conference SIAM Data Mining. 166-177. Zaki H.J., and Ogihara M. (1998).
Theoretical Foundations of Association Zaki M.J. (2001). SPADE: An Efficient
Algorithm for Mining Frequent Sequences Machine Learning Journal (Doug
Fisher,
ed.), 42(1/2). 31-60.
ISSA
Performance
EXPERIMENTAL
RESULTS


CONCLUSIONS
FUTURE
WORK
REFERENCES
Mannila H., Toivonen H., and Verkamo A.I. (1997). Discovery of frequent
episodes in event sequences. Data Mining and Knowledge Discovery, 1(3),
1997.
Özden B., Ramaswamy S., and Silberschatz A. (1998). Cyclic
association rules.
Proceedings of the 14th Int. Conference on Data Engineering. 412-421.
Rules. SIGMOD Workshop on Research Issues in Data Mining and Knowledge
Discovery.