PhD Thesis
Announcements of the last step towards the PhD.
Empirical Machine Translation and its Evaluation
PhD Candidate: Jesús Ángel Giménez Linares
Advisor: Dr. Lluís Màrquez Villodre
Summary: In this thesis we have exploited current Natural Language Processing technology for Empirical Machine Translation and its Evaluation.
On the one side, we have studied the problem of automatic MT evaluation. We have analyzed the main deficiencies of current evaluation methods, which arise, in our opinion, from the shallow quality principles upon which they are based. Instead of relying on the lexical dimension alone, we suggest a novel path towards heterogeneous evaluations. Our approach is based on the design of a rich set of automatic metrics devoted to capture a wide variety of translation quality aspects at different linguistic levels (lexical, syntactic and semantic). Linguistic metrics have been evaluated over different scenarios. The most notable finding is that metrics based on deeper linguistic information (syntactic/semantic) are able to produce more reliable system rankings than metrics which limit their scope to the lexical dimension, specially when the systems under evaluation are different in nature. However, at the sentence level, some of these metrics suffer a significant decrease, which is mainly attributable to parsing errors. In order to improve sentence-level evaluation, apart from backing off to lexical similarity in the absence of parsing, we have also studied the possibility of combining the scores conferred by metrics at different linguistic levels into a single measure of quality. Two valid non-parametric strategies for metric combination have been presented. These offer the important advantage of not having to adjust the relative contribution of each metric to the overall score. As a complementary issue, we show how to use the heterogeneous set of metrics to obtain automatic and detailed linguistic error analysis reports.
On the other side, we have studied the problem of lexical selection in Statistical Machine Translation. For that purpose, we have constructed a Spanish-to-English baseline phrase-based Statistical Machine Translation system and iterated across its development cycle, analyzing how to ameliorate its performance through the incorporation of linguistic knowledge. First, we have extended the system by combining shallow-syntactic translation models based on linguistic data views. A significant improvement is reported. This system is further enhanced using dedicated discriminative phrase translation models. These models allow for a better representation of the translation context in which phrases occur, effectively yielding an improved lexical choice. However, based on the proposed heterogeneous evaluation methods and manual evaluations conducted, we have found that improvements in lexical selection do not necessarily imply an improved overall syntactic or semantic structure. The incorporation of dedicated predictions into the statistical framework requires, therefore, further study.
As a side question, we have studied one of the main criticisms against empirical MT systems, i.e., their strong domain dependence, and how its negative effects may be mitigated by properly combining outer knowledge sources when porting a system into a new domain. We have successfully ported an Englishto-Spanish phrase-based Statistical Machine Translation system trained on the political domain to the domain of dictionary definitions.
The two parts of this thesis are tightly connected, since the hands-on development of an actual MT system has allowed us to experience in first person the role of the evaluation methodology in the development cycle of MT systems.
Date: 02/07/2008
Time: 10:00h
Place: Sala del Llac a l'Edifici del Rectorat, al Campus Nord.
An i*-based Reengineering Framework for Requirements Engineering
PhD Candidate: Gemma Grau Colom
Advisor: Dr. Xavier Franch Gutiérrez
Summary: Information Systems are a crucial asset of the organizations and can provide competitive advantages to them. However, once the Information System is built, it has to be maintained and evolved, which includes changes on the requirements, the technology used, or the business processes supported. All these changes are diverse in nature and may require different treatments according to their impact, ranging from small improvements to the deployment of a new Information System. In both situations, changes are addressed at the requirements level, where decisions are analysed involving less resources. Because Requirements Engineering and Business Process Reengineering methods share common activities, and the design of the Information System with the business strategy has to be maintained during its evolution, a Business Process Reengineering approach is adequate for addressing Information Systems Development when there is an existing Information System to be used as starting point.
The i* framework is a well-consolidated goal-oriented approach that allows to model Information Systems in a graphical way, in terms of actors and dependencies among them. The i* framework addresses Requirements Engineering and Business Process Reengineering but none of the i*-based existing approaches provide a complete framework for reengineering. In order to explore the applicability of i* for a reengineering framework, we have defined PRiM: a Process Reengineering i* Method, which assumes that there is an existing process that is the basis for the specification of the new Information System. PRiM is a sixphase method that combines techniques from the fields of Business Process Reengineering and Requirements Engineering and defines new techniques when needed. As a result PRiM addresses: 1) the analysis of the current process using sociotechnical analysis techniques; 2) the construction of the i* model by differentiating the operationalization of the process form the strategic intentionality behind it; 3) the reengineering of the current process based on its analysis for improvements using goal acquisition techniques; 4) the generation of alternatives based on heuristics and patterns; 5) the evaluation of alternatives by defining structural metrics; and, 6) the specification of the new Information System from the selected i* model.
There are several techniques from the Requirements Engineering and Business Process Reengineering fields, that can be used instead the ones selected in PRiM. Therefore, in order to not enforce the application of a certain technique we propose a more generic framework where to use and combine them. Method Engineering is the discipline that constructs new methods from parts of existing ones and, so, it is the approach adopted to define ReeF: a Reengineering Framework. In ReeF the six phases of PRiM are abstracted and generalized in order to allow selecting the most appropriate techniques for each of the phases, depending on the user expertise and the domain of application. As an example of the applicability of ReeF, the new method SARiM is defined.
The main contributions of this work are twofold. On the one hand, two i*-based methods are defined: the PRiM method, which addresses process reengineering, and SARiM, which addresses software architecture reengineering. On the other hand, we provide several i*-based techniques to be used for constructing i* models, generating alternatives, and evaluating them using Structural Metrics. These methods and techniques are based on exhaustive review of existing work and their validation is done by means of several formative case studies and an industrial case study. Tool support has been developed for the approach: REDEPEND-REACT supporting the graphical modelling of i*, the generation of alternatives and the definition of Structural Metrics; and, J-PRiM supporting all the phases of the PRiM method using a textual visualization of the i* models.
Date: 07/07/2008
Time: 12:00h
Place: Sala d’Actes de la Facultat d’Informàtica de Barcelona, edifici B6. Campus Nord.
Applying Causal State Splitting Reconstruction Algorithm to Natural Language Processing Tasks
PhD Candidate: Muntsa Padró i Cirera
Advisor: Dr. Lluís Padró
Summary: This thesis is focused on the study and use of Causal State Splitting Reconstruction (CSSR) algorithm for Natural Language Processing (NLP) tasks. CSSR is an algorithm that captures patterns from data building automata in the form of visible Markov Models. It is based on the principles of Computational Mechanics and takes advantage of many properties of causal state theory. One of the main advantages of CSSR with respect to Markov Models is that it builds states containing more than one $n$gram (called history in computational mechanics), so the obtained automata are much smaller than the equivalent Markov Model.
In this work, we first study the behaviour of the algorithm when learning patterns related to NLP tasks but without performing any annotation task. These first experiments are useful to understand the parameters that affect the algorithm and to check that it is able to capture the patterns present in natural language sentences.
Secondly, we propose a way to apply CSSR to NLP annotation tasks. The algorithm is not originally conceived to use the hidden information necessary for annotation tasks, so we devised a way to introduce it into the system in order to obtain automata including this information that can be afterwards used to annotate new text. Also, some methods to deal with unseen events and a modification of the algorithm to make it more suitable for NLP tasks have been presented and tested. These three aspects conform the first line of contributions of this research, altogether with a deep experimental study of the proposed methods.
The experimental study of the proposed approach is performed in three different tasks: Named Entity Recognition in general and Biomedical domain and Chunking. The obtained results are promising in the two first tasks though not so good for Chunking. Nevertheless, it is not easy to improve the obtained performance following the same approach, since CSSR needs quite reduced feature sets to build correct automaton and that limits the performance of the developed system. For that reason, we propose to combine CSSR with graphical models, in order to enrich the features that the system can take into account.
This combination conforms the second line of contributions of this thesis. There is a variety of possible graphical models that can be used, but for the moment we propose to combine CSSR algorithm with Maximum Entropy (ME) models. ME models can be used as a way of introducing more information into the system, encoding it as features. In this line, we propose and test two methods for combining CSSR and ME models in order to improve the results obtained with original CSSR. The first method is simple and does not modify the automata building algorithm while the second one is more sophisticated and builds automata taking into account the ME features. We will see that though much simpler, the first method leads to an important improvement with respect to original CSSR but the second method does not.
Date: 18/07/2008
Time: 12:00h
Place: Aula de Teensenyament, edifici B3. Campus Nord.
A Group Selection pattern for multiagent systems and its application to grid computing
PhD Candidate: Isaac Chao Andrade
Advisor: Drs. Ramon Sangüesa and Óscar Ardáiz
Summary: This thesis deals with groups and their self-organized management in engeenering-friendly environments. That is, the mechanism provided must prove applicable in practical settings. Groups exist in nature, in societies and in artificial systems. Individuals in biological populations organize themselves in groups. Human beings show special sociability and a tendency to become organized in groups, from “ghettos” to firms, from neighbor associations to online communities. Participants in today’s networked infrastructures (such as social networking communities in the Internet) also tend to form cliqués or groups of agents showing special preference to interact between them, partially isolated from the rest of the network. Grids are part of the next generation of networked infrastructures, which are made up not only from information but also from resources and users (human or artificial agents). Grids also organize activities around groups called Virtual Organizations. State of the art mechanisms in multiagent systems used to manage group formation (coalitions, congregations, etc) tend to be static and computationally costly, while the systems being developed in reality (grids, P2P and other overlays on top of internet) require for high adaptiveness and a dynamic view of the system. There is a need for emergent and selforganized management of the entities composing the system.
Advisor: Drs. Ramon Sangüesa and Óscar Ardáiz
Summary: This thesis deals with groups and their self-organized management in engeenering-friendly environments. That is, the mechanism provided must prove applicable in practical settings. Groups exist in nature, in societies and in artificial systems. Individuals in biological populations organize themselves in groups. Human beings show special sociability and a tendency to become organized in groups, from “ghettos” to firms, from neighbor associations to online communities. Participants in today’s networked infrastructures (such as social networking communities in the Internet) also tend to form cliqués or groups of agents showing special preference to interact between them, partially isolated from the rest of the network. Grids are part of the next generation of networked infrastructures, which are made up not only from information but also from resources and users (human or artificial agents). Grids also organize activities around groups called Virtual Organizations. State of the art mechanisms in multiagent systems used to manage group formation (coalitions, congregations, etc) tend to be static and computationally costly, while the systems being developed in reality (grids, P2P and other overlays on top of internet) require for high adaptiveness and a dynamic view of the system. There is a need for emergent and selforganized management of the entities composing the system.
In this thesis we depart from the study of coordination and social dilemmas in multiagent systems, and we introduce a Group Selection process which, comming from Socio-biology, meets exactly the requirements mentioned above: first, it provides a mechanism by which multiagent systems incorporating high levels of uncertainty and dynamicity can be handled. Second, the mechanism implies few assumptions in agent’s capabilities. In this thesis, a formalization of the Group Selection process in an engineering pattern is accomplished. Theoretical grounds in multiagent learning are provided. We propose several instantiations of the pattern in relevant coordination and social dilemma scenarios: pure coordination games, collective coordination games, prisioner’s dilemma, and N-player prisoner’s dilemma. As a technology application, we also provide several additional instantiations of the pattern in grid computing applications such as adaptive job scheduling, decentralized grid markets, and resource sharing policies coordination in Virtual Organizations.
The resuls of the Group Selection patern application in both multiagent systems and grid scenarios are improved cooperation and coordination, incorporating the self-organized management of the system entities and their interactions. The conclusion draw from these results is: “Dynamically partitioning a population of agents in small groups and further co-evolving these sub-groups through Group Selection improves coordination levels in both social dilemma-based and fully cooperative multiagent systems, including grids”. This research is highly interdisciplinary by nature: Biology, Sociology, multiagent systems and grids play an important role in it. However, the Group Selection pattern as we propose it, aims to be considered a general mechanism for the engineering of multiagent systems. Biology and Sociology are the roots inspiring the pattern and grid computing is a first application, but any artificial system structured in groups could benefit from the results of this thesis.
Date: 30/07/2008
Time: 12:00h
Place: Sala D'actes of the Barcelona School of Informatics (FIB). Bulding B6. Campus Nord.
(Back to the Newsletter)
Date: 30/07/2008
Time: 12:00h
Place: Sala D'actes of the Barcelona School of Informatics (FIB). Bulding B6. Campus Nord.
Contact Press:
ilapuente@lsi.upc.edu(Back to the Newsletter)
