in this site...





framework of the project



This is a postdoctoral project funded by the AGAUR (Agència de Gestió d'Ajuts Universitaris i de Recerca de la Generalitat de Catalunya) program Beatriu de Pinós, to be carried out within the Natural Language Processing Group at the Department of Computing Sciences in the Engineering School of the Universidad de la República, in Uruguay, from july 2006 to august 2008.


objectives



The overall target of this project is to obtain a data-driven, cross-lingual characterization of discourse markers that is both linguistically sound and useful for NLP applications.

Theoretical contributions:

  • establish an inventory of basic discursive meanings based on empirical evidence provided by naturally occurring instances of discourse markers that are coherent accross a subset of indoeuropean languages.
  • identify textual features that contribute to determine markerhood and discursive meanings


Applied contributions:

  • a lexicon of discourse markers in Catalan, Spanish and English, and their equivalences in the other languages, if any.
  • a small corpus, where instances of discourse markers are manually associated to their meaning and scope.
  • a parser to find the structure created by discourse markers in irrestricted text.

for an extensive description of the project, see the project proposal (in Catalan).


in progress


  • compilation of a big number of naturally occurring instances of discourse markers, from

    • journalistic corpus (El Periodico de Catalunya)
    • the web (via searches)

    the target is to have at least 1000 instances of each of the least frequent discourse markers under study. This means that the most frequent will have a very big number of occurrences.
    Related tasks:

    • represent graphically the distribution of discourse markers in corpus, comparing:

      • very frequent, frequent and rare discourse markers
      • these three frequency-based classes of discourse markers with the parallel classes of words
      • these distributions in an homogeneous journalistic corpus and in irrestricted web corpus


  • gathering features that characterize the occurrence of discourse markers in text

    • words
    • tags
    • location in paragraph
    • layout tags
    • location in document

  • selecting those that are more useful to characterize discourse markers, both to distinguish them from other word classes and to characterize their discursive meaning

    • those with highest mutual information

      • for each discourse marker
      • accross all discourse markers

    • those with highest distinctive power (discovered by decision trees)

  • a java-based parser exploiting the information provided by discourse markers