I can't log in
 
LSI
Accions del document

PhD Thesis

Announcements of the last step towards the PhD.


LogoDelicious  Digg!

Novel Computational Methods for Large Scale Genoma Comparison

PhD Candidate: Todd James Treanger

Advisor: Xavier Messeguer

Summary: The current wealth of available genomic data embodies a wide variety of species, spanning all domains of life, thus providing an unprecedented opportunity to compare and contrast evolutionary histories of closely and distantly related organisms. The identification of homologous DNA is a fundamental building block of comparative genomic and molecular evolution studies. To date, sequence alignment has proven to be a versatile tool for comparing closely and distantly related organisms, nucleotide at a time. For pair wise comparisons, optimal global and local alignment for a given scoring scheme requires O (m x n) time and space with respect to the lengths m,n of the genomes. This computational bottleneck becomes exponential for multiple genomes, O (nk) for k genomes, making large scale multiple genome alignment a daunting task using traditional methods.

The focus of this dissertation is on developing novel algorithms and software for efficient global and local comparison of multiple genomes and the application of these methods for a biologically relevant case study. The thesis research is organized following the main focus into three successive phases, specifically: (1) multiple genome alignment of closely related species, (2) local multiple alignments of interspersed repeats, and finally, (3) a comparative genomics case study of DNA uptake sequences (DUS) in Nesseria.

In Phase 1, we first develop an efficient algorithm and data structure for maximal substring match search in multiple genome sequences. Our approach for searching for maximal unique substrings yields significant improvements, in both time and space, over existing methods when dealing with multiple genome sequences. Specifically, given S1 … Sm genome sequences (where S1 is the smallest genome), we are able to find the maximal unique substrings among all of the sequence in linear time O(|S1|+ … + Sm|) and linear space O(|S1|). We then implement these contributions into an interactive multiple genome comparison and alignment tool, M-GCAT, which can efficiently construct multiple genome comparison frameworks in closely related species. In Phase 2, we present a novel computational method for local multiple alignments of interspersed repeats. Our method for local alignment of interspersed repeats features a novel method for gapped extensions of chained seed matches, joining global multiple alignments with a homology test based on a hidden Markov model. We implement our method for local alignment of interspersed repeats into the open source procrastAlign software. In Phase 3, using the results from the previous two phases we perform a case study of neisserial genomes by tracking the propagation of repeat sequence elements in attempt to understand why the important pathogens of the neisserial group have sexual exchange of DNA by natural transformation.

In conclusion, our global contributions in this dissertation have focused on comparing and contrasting evolutionary histories of related organisms via their genomes. We have proposed novel data structures, algorithms, and software for active areas of research in comparative genomics. We have experimentally demonstrated the efficiency and accuracy of the computational methods presented in this thesis for multiple alignment, and have implemented all data structures and algorithms in freely available software.

Date: 25th of June, 2008

Time: 11:00h

Place: La Sala del Llac, Campus Nord, edifici R.


Empirical Machine Translation and its Evaluation

PhD Candidate: Jesús Ángel Giménez Linares

Advisor: Dr. Lluís Màrquez Villodre

Summary: In this thesis we have exploited current Natural Language Processing technology for Empirical Machine Translation and its Evaluation.

On the one side, we have studied the problem of automatic MT evaluation. We have analyzed the main deficiencies of current evaluation methods, which arise, in our opinion, from the shallow quality principles upon which they are based. Instead of relying on the lexical dimension alone, we suggest a novel path towards heterogeneous evaluations. Our approach is based on the design of a rich set of automatic metrics devoted to capture a wide variety of translation quality aspects at different linguistic levels (lexical, syntactic and semantic). Linguistic metrics have been evaluated over different scenarios. The most notable finding is that metrics based on deeper linguistic information (syntactic/semantic) are able to produce more reliable system rankings than metrics which limit their scope to the lexical dimension, specially when the systems under evaluation are different in nature. However, at the sentence level, some of these metrics suffer a significant decrease, which is mainly attributable to parsing errors. In order to improve sentence-level evaluation, apart from backing off to lexical similarity in the absence of parsing, we have also studied the possibility of combining the scores conferred by metrics at different linguistic levels into a single measure of quality. Two valid non-parametric strategies for metric combination have been presented. These offer the important advantage of not having to adjust the relative contribution of each metric to the overall score. As a complementary issue, we show how to use the heterogeneous set of metrics to obtain automatic and detailed linguistic error analysis reports.

On the other side, we have studied the problem of lexical selection in Statistical Machine Translation. For that purpose, we have constructed a Spanish-to-English baseline phrase-based Statistical Machine Translation system and iterated across its development cycle, analyzing how to ameliorate its performance through the incorporation of linguistic knowledge. First, we have extended the system by combining shallow-syntactic translation models based on linguistic data views. A significant improvement is reported. This system is further enhanced using dedicated discriminative phrase translation models. These models allow for a better representation of the translation context in which phrases occur, effectively yielding an improved lexical choice. However, based on the proposed heterogeneous evaluation methods and manual evaluations conducted, we have found that improvements in lexical selection do not necessarily imply an improved overall syntactic or semantic structure. The incorporation of dedicated predictions into the statistical framework requires, therefore, further study.

As a side question, we have studied one of the main criticisms against empirical MT systems, i.e., their strong domain dependence, and how its negative effects may be mitigated by properly combining outer knowledge sources when porting a system into a new domain. We have successfully ported an Englishto-Spanish phrase-based Statistical Machine Translation system trained on the political domain to the domain of dictionary definitions.

The two parts of this thesis are tightly connected, since the hands-on development of an actual MT system has allowed us to experience in first person the role of the evaluation methodology in the development cycle of MT systems.

Date: outstanding

Time: outstanding

Place: outstanding

An i*-based Reengineering Framework for Requirements Engineering

PhD Candidate:
Gemma Grau Colom

Advisor:
Dr. Xavier Franch Gutiérrez

Summary:
Information Systems are a crucial asset of the organizations and can provide competitive advantages to them. However, once the Information System is built, it has to be maintained and evolved, which includes changes on the requirements, the technology used, or the business processes supported. All these changes are diverse in nature and may require different treatments according to their impact, ranging from small improvements to the deployment of a new Information System. In both situations, changes are addressed at the requirements level, where decisions are analysed involving less resources. Because Requirements Engineering and Business Process Reengineering methods share common activities, and the design of the Information System with the business strategy has to be maintained during its evolution, a Business Process Reengineering approach is adequate for addressing Information Systems Development when there is an existing Information System to be used as starting point.

The i* framework is a well-consolidated goal-oriented approach that allows to model Information Systems in a graphical way, in terms of actors and dependencies among them. The i* framework addresses Requirements Engineering and Business Process Reengineering but none of the i*-based existing approaches provide a complete framework for reengineering. In order to explore the applicability of i* for a reengineering framework, we have defined PRiM: a Process Reengineering i* Method, which assumes that there is an existing process that is the basis for the specification of the new Information System. PRiM is a sixphase method that combines techniques from the fields of Business Process Reengineering and Requirements Engineering and defines new techniques when needed. As a result PRiM addresses: 1) the analysis of the current process using sociotechnical analysis techniques; 2) the construction of the i* model by differentiating the operationalization of the process form the strategic intentionality behind it; 3) the reengineering of the current process based on its analysis for improvements using goal acquisition techniques; 4) the generation of alternatives based on heuristics and patterns; 5) the evaluation of alternatives by defining structural metrics; and, 6) the specification of the new Information System from the selected i* model.

There are several techniques from the Requirements Engineering and Business Process Reengineering fields, that can be used instead the ones selected in PRiM. Therefore, in order to not enforce the application of a certain technique we propose a more generic framework where to use and combine them. Method Engineering is the discipline that constructs new methods from parts of existing ones and, so, it is the approach adopted to define ReeF: a Reengineering Framework. In ReeF the six phases of PRiM are abstracted and generalized in order to allow selecting the most appropriate techniques for each of the phases, depending on the user expertise and the domain of application. As an example of the applicability of ReeF, the new method SARiM is defined.

The main contributions of this work are twofold. On the one hand, two i*-based methods are defined: the PRiM method, which addresses process reengineering, and SARiM, which addresses software architecture reengineering. On the other hand, we provide several i*-based techniques to be used for constructing i* models, generating alternatives, and evaluating them using Structural Metrics. These methods and techniques are based on exhaustive review of existing work and their validation is done by means of several formative case studies and an industrial case study. Tool support has been developed for the approach: REDEPEND-REACT supporting the graphical modelling of i*, the generation of alternatives and the definition of Structural Metrics; and, J-PRiM supporting all the phases of the PRiM method using a textual visualization of the i* models.

Date: outstanding

Time: outstanding

Place: outstanding

Applying Causal State Splitting Reconstruction Algorithm to Natural Language Processing Tasks

PhD Candidate: Muntsa Padró i Cirera

Advisor:
Dr. Lluís Padró

Summary:
This thesis is focused on the study and use of Causal State Splitting Reconstruction (CSSR) algorithm for Natural Language Processing (NLP) tasks. CSSR is an algorithm that captures patterns from data building automata in the form of visible Markov Models. It is based on the principles of Computational Mechanics and takes advantage of many properties of causal state theory. One of the main advantages of CSSR with respect to Markov Models is that it builds states containing more than one $n$gram (called history in computational mechanics), so the obtained automata are much smaller than the equivalent Markov Model.

In this work, we first study the behaviour of the algorithm when learning patterns related to NLP tasks but without performing any annotation task. These first experiments are useful to understand the parameters that affect the algorithm and to check that it is able to capture the patterns present in natural language sentences.

Secondly, we propose a way to apply CSSR to NLP annotation tasks. The algorithm is not originally conceived to use the hidden information necessary for annotation tasks, so we devised a way to introduce it into the system in order to obtain automata including this information that can be afterwards used to annotate new text. Also, some methods to deal with unseen events and a modification of the algorithm to make it more suitable for NLP tasks have been presented and tested. These three aspects conform the first line of contributions of this research, altogether with a deep experimental study of the proposed methods.

The experimental study of the proposed approach is performed in three different tasks: Named Entity Recognition in general and Biomedical domain and Chunking. The obtained results are promising in the two first tasks though not so good for Chunking. Nevertheless, it is not easy to improve the obtained performance following the same approach, since CSSR needs quite reduced feature sets to build correct automaton and that limits the performance of the developed system. For that reason, we propose to combine CSSR with graphical models, in order to enrich the features that the system can take into account.

This combination conforms the second line of contributions of this thesis. There is a variety of possible graphical models that can be used, but for the moment we propose to combine CSSR algorithm with Maximum Entropy (ME) models. ME models can be used as a way of introducing more information into the system, encoding it as features. In this line, we propose and test two methods for combining CSSR and ME models in order to improve the results obtained with original CSSR. The first method is simple and does not modify the automata building algorithm while the second one is more sophisticated and builds automata taking into account the ME features. We will see that though much simpler, the first method leads to an important improvement with respect to original CSSR but the second method does not.


Date: outstanding

Time: outstanding

Place: outstanding

Contact Press:
ilapuente@lsi.upc.edu

(Back to the Newsletter)
 
Darrera modificació: Juny 2008
© UPC. Technical University of Catalonia
Departament de Llenguatges i Sistemes Informàtics
About this web.