framework of the project
This is a postdoctoral project funded by the AGAUR (Agència de Gestió d'Ajuts Universitaris i de Recerca de la Generalitat de Catalunya) program Beatriu de Pinós, to be carried out within the Natural Language Processing Group at the Department of Computing Sciences in the Engineering School of the Universidad de la República, in Uruguay, from july 2006 to august 2008.
objectives
The overall target of this project is to obtain a data-driven, cross-lingual characterization of discourse markers that is both linguistically sound and useful for NLP applications.
Theoretical contributions:
- establish an inventory of basic discursive
meanings based on empirical evidence provided by
naturally occurring instances of discourse markers that
are coherent accross a subset of indoeuropean languages.
- identify textual features that
contribute to determine markerhood and
discursive meanings
- a lexicon of discourse markers in
Catalan, Spanish and English, and their equivalences in
the other languages, if any.
- a small corpus, where instances of
discourse markers are manually associated to their
meaning and scope.
- a parser to find the structure
created by discourse markers in irrestricted text.
for an extensive description of the project, see the project proposal (in Catalan).
in progress
- compilation of a big number of naturally occurring
instances of discourse markers, from
- journalistic corpus (El Periodico de Catalunya)
- the web (via searches)
the target is to have at least 1000 instances of each of the least frequent discourse markers under study. This means that the most frequent will have a very big number of occurrences.
Related tasks:- represent graphically the distribution of
discourse markers in corpus, comparing:
- very frequent, frequent and rare discourse
markers
- these three frequency-based classes of
discourse markers with the parallel classes of
words
- these distributions in an homogeneous
journalistic corpus and in irrestricted web
corpus
- gathering features that characterize the occurrence
of discourse markers in text
- words
- tags
- location in paragraph
- layout tags
- location in document
- selecting those that are more useful to characterize
discourse markers, both to distinguish them from other
word classes and to characterize their discursive meaning
- those with highest mutual information
- for each discourse marker
- accross all discourse markers
- those with highest distinctive power (discovered
by decision trees)
- a java-based parser exploiting the information
provided by discourse markers