parallel in Catalan, Spanish and English






description of the lexicon

This is the seminal discourse marker lexicon used in the thesis Representing discourse for automatic text summarization via shallow NLP techniques. The discourse markers listed here were the primary source of evidence to draw the semantic maps to obtain an inventory of basic discursive meanings. This lexicon is also the basis for the implementations of a discourse segmenter and for the discourse analysis exploited by the e-mail summarizer Carpanta.

The lexicon is parallel in three languages: Catalan, Spanish and English. Therefore, in this starting version of the lexicon we have only included those discourse markers that have a near-synonym in one of the other languages. Those that do not have a near-synonym have been included in the extended version of the lexicon created by bootstrapping techniques applied to this starting lexicon.

The discourse markers that constitute the prototypical lexicon were obtained from previous work, mostly Knott (1996) and Marcu (1997), with the restriction that they are highly grammaticalized. We have also included in the lexicon some closed class words, obtained from the dictionary of the FreeLing morphosyntactic analyzer. We have discarded closed class words that are very vague and highly ambiguous discourse markers.

In this lexicon, discourse markers are characterized by their structural (continuation or elaboration) and semantic (revision, cause, equality, context) meanings, and they are also associated to a morphosyntactic class (part of speech, PoS), one of adverbial (A), phrasal (P) or conjunctive (C).

No information has been encoded about the reliability of discourse markers with respect to their discursive (vs. sentential) function. The only information of this kind that we provide is that discourse markers that are highly ambiguous with respect to their function are not included in the lexicon.

Sometimes a discourse marker is underspecified with respect to a meaning. We encode this with a hash. This tends to happen with structural meanings, because these meanings can well be established by discursive mechanisms other than discourse markers, and the presence of the discourse marker just reinforces the relation, whichever it may be.

Sometimes a discourse marker is ambiguous with respect to two meanings. In this cases, we write the predominant meaning in italics, and the secondary meaning in parentheses, or both of them in italics if no predominant meaning can be determined. Resolving such ambiguities normally requires information about the context of occurrence, but we have not associated discourse markers with the contextual features that can be of aid to disambiguate them. Nevertheless, it seems that determining the adequate meaning associated to a particular instance of a discourse marker can be well addressed by general procedures, directly implemented in those algorithms that exploit the information stored in a lexicon (segmentation algorithms, discourse parsers, etc.).

All in all, the lexicon is formed by 84 discourse markers, representing different discursive meanings. Some discourse markers have been assigned to more or less than one meaning per dimension, because they are ambiguous or underspecified, respectively.

revision cause equality context total
elaboration 4 9 10 22 41
continuation 9 9 6 4 28
underspecified 1 -- 10 4 15
total 14 18 26 32 84







discourse markers signalling revision


Catalan English Spanish structural semantic PoS
a pesar de despite a pesar de elaboration revision P
encara que although aunque elaboration revision P
excepte except excepto elaboration revision P
malgrat in spite of pese a elaboration revision P
no obstant however no obstante continuation revision A
nogensmenys nevertheless sin embargo continuation revision A
en realitat actually en realidad continuation revision A
de fet in fact de hecho continuation revision A
al contrari on the contrary al contrario continuation revision A
el fet és que the fact is el hecho es que continuation revision P
és cert que it is true that es cierto que continuation revision P
però but pero continuation revision P
tot i això even though con todo continuation revision A
ara bé well now ahora bien continuation revision A
de tota manera anyway de todos modos -- revision A
  • however differs from although in their values for continuation or elaboration, although each of them can be used to rephrase the other in some contexts, however is attached to the segment that indicates continuation, although is attached to the segment that indicates elaboration.
  • actually / in fact their primary meaning is marking evidentiality [ex], but they tend to be structurally equivalent to however, as we have shown using multiple alignment techniques (Alonso et al. 2004). In English their evidentiality meaning is more predominant than the revision meaning, and so their contribution as discourse markers of revision is only reliable when it co-occurs with other discourse markers also signalling revision [ex] or in certain punctuation contexts [ex], although they can also signal revision without any of these further evidence [ex]. In Spanish and Catalan their primary meaning is revision ( [ex] and [ex]), respectively) comparable to it is true that. The kind of revision that these discourse markers tend to convey in Spanish and Catalan is correction (a prototypical example of correction would be: ``This is not black, but white.''). We can speculate that the reason why the revision meaning of these discourse markers is more primary in Spanish or Catalan than in English is because in these languages the correction meaning tends to be expressed by discourse markers, as can be seen in the fact that it is lexicalized (sinó, sino), while in English it is covered by the all-purpose revision discourse marker but, and correction is only distinguished from other kinds of revision by other linguistic features.

    example:
    They could also help themselves by thinking through a problem before phoning the support desk.
    In many cases a user will actually solve his or her own problem while on the phone to Neptune!

    example:
        a Standardisation has never been the IT industry's strong point, and the answer is "probably not". However, they don't actually all do the same job.
        b He then argues that "it is not sufficient (for me) to tell the conference that there will be no return to mass picketing". Actually, I never mentioned picketing, mass picketing or otherwise, in my speech, but let that pass.

    example:
        a Amnesty warmly welcomed the release of prisoners of conscience and the repeal of certain articles, but has urged that the legislation be extended to include reform or repeal of further articles of the Turkish Penal Code, under which POCs may be held. The new law may in fact increase the already serious risk of torture facing political detainees.
        b La idea inicial de Maragall fue celebrar una exposición internacional, pero ese propósito falló cuando alguien de su gabinete descubrió que habían llegado tarde para obtener el reconocimiento internacional para un acontecimiento de este tipo. En realidad poco importaba qué se hiciera. Tanto Clos como Maragall perseguían en esencia poner una nueva fecha al futuro de la ciudad.
        c Tot va començar, com en les novel.les policíaques, amb un fiscal, entestat a treure a la llum el taló d'Aquil.les del president demòcrata. L'ham: una becària de 22 anys, grassoneta --usa la talla 46--, de pits exuberants i boca àmplia, una mica esbojarrada ja que creia tenir una relació sentimental quan en realitat va mantenir 10 trobades sexuals servides a domicili amb el senyor Clinton, qui, durant set mesos, es va obstinar a negar haver mantingut contacte físic amb ella.

  • it is true that in contrast with actually or in fact, its primary meaning is revision, like en realidad, en realitat, de fet, de hecho in Spanish and Catalan.




discourse markers signalling cause


Catalan English Spanish structural semantic PoS
donat que given that dado que elaboration cause P
perquè because porque elaboration cause P
degut a due to debido a elaboration cause P
gràcies a thanks to gracias a elaboration cause P
per si in case por si elaboration cause P
per because of por elaboration cause P
per això that's why por eso continuation cause A
en conclusió in conclusion en conclusión continuation cause A
així que thus así que continuation cause P/A/P
com a conseqüència as a consequence como consecuencia continuation cause A
per in order to para continuation cause P
perquè so that para que continuation cause P
per aquesta raó for this reason por esta razón continuation cause A
per tant so por tanto continuation cause A/C/A
en efecte in effect en efecto continuation cause A
  • in conclusion while it looks similar to in sum, this discourse marker tends to convey new information, not to rephrase it. Compare the following example with the example for in sum. With respect to the effects on coherence and relevance, it is comparable to consecutive discourse markers like that's why or so then, which can also signal relations that are not motivated by a causal relation in the real world, but have the same rhetorical strength as those that are motivated by a real causal relation. It is comparable to in effect.

    example
    The European Court further ruled in this case that Arts 48 and 59 of the EC Treaty do not prevent a member state from requiring that the exercise of the profession of auditor in that state by a person qualified to carry on that profession in another member state be conditions which are objectively necessary to guarantee observation of professional rules concerning the permanence of the infrastructure in place for the completion of the work, the effective presence in the member state and assurance of the observation of professional ethics, unless respect for such rules and conditions is already guaranteed by a reviseur d'entreprises, whether a natural person or a firm, established and recognised in the state, and in whose service is placed, for the duration of the work, the person who intends to exercise the profession of auditor. In conclusion, one has to wonder whether the borders are in fact open.

  • perquè / per in Catalan these discourse markers are underspecified with respect to structural meaning, they can be equivalent to so that / to [ex] or to because / because of [ex].

    example
         a Avui sento por perquè han declarat impunes tots els caps d'Estat. Today I feel frightened because all heads of State have been declared impune.
         b La Generalitat ha fet una crida a la solidaritat perquè s'ocupin aquestes cases. The Generalitat has made a call to solidarity so that these houses are occupied.





discourse markers signalling equality


Catalan English Spanish structural semantic PoS
en resum in sum en resumen elaboration equality A
concretament specifically concretamente elaboration equality A
en essència essentially en esencia elaboration equality A
en comparació in comparison en comparación elaboration equality A
en altres paraules in other words en otras palabras elaboration equality A
en particular in particular en particular elaboration equality A
és a dir that is to say es decir elaboration equality C
per exemple for example por ejemplo elaboration equality A
precisament precisely precisamente elaboration equality A
tal com such as tal como elaboration equality P
en darrer lloc lastly por último continuation equality A
per una banda on the one hand por un lado continuation equality A
per altra banda on the other hand por otro lado continuation equality A
a propòsit by the way a propósito continuation equality A
no només not only no sólo continuation equality P
sinó també but also sino también continuation equality P
en dues paraules in short en dos palabras -- equality A
a més moreover además -- equality A
també also también -- equality A
a banda besides aparte -- equality A
encara més what's more aún es más -- equality A
fins i tot incluso even -- equality P
especialment specially especialmente -- equality A
sobretot above all sobretodo -- equality A
  • not only ... but also
  • lastly unlike first of all or to begin with, and like secondly, thirdly, this discourse marker is not ambiguous, because it requires a context of sequence to be felicitous.
  • on the one hand / on the other hand like lastly, they require a sequence context to be felicitous, so they are not ambiguous with respect to their structural or semantic meaning, but their ambiguity with respect to scope varies greatly. If they co-occur [ex], their scope can be determined if we consider that the scope of on the one hand reaches until the point of occurrence of on the other hand, and that the latter has a scope of an equivalent size. However, if on the other hand occurs alone, its scope is very hard to determine automatically, and probably also by human judges.

    example
    It does occur to Fukuyama that religion might have some sort of unease to express with all this, but he appears to conceive of religion under only two modes. On the one hand, there is fundamentalist counter-ideology, the Islamic theocratic state. This, it is to be assumed, his liberal readers may take seriously as a threat, but hardly as an option. And on the other hand, there are "less organised religious impulses", religion as individual preference. This he knows can readily be accommodated another sort of consumer commodity, "within the sphere of personal life permitted in liberal societies".

  • in short is ambiguous with respect to continuation or elaboration, because the discourse unit to which the discoruse marker is attached can sometimes contribute new information, as in the following example.

    example
    The authors maintain that the role of women in the Tigrayan society is still closely linked to their status in the feudal system. 1975 women were treated as children. They were not allowed to own land nor speak. In short women were at the bottom of the hierarchy of oppression with no rights of any kind.

  • in sum / essentially convey an elaborative relation because they repeats information that has already been given, even if this information is given in a shorter form. The utilitty of these discourse markers for automatic summarization is an ad-hoc property, subject to the task and not to their effects with respect to coherence and relevance assessment. Therefore, it has to be treated by manually creating a special rule that overrides general discursive rules.




discourse markers signalling context


Catalan English Spanish structural semantic PoS
considerant considering teniendo en cuenta elaboration context P
després after después elaboration context P
abans before antes elaboration context A
originalment originally originalmente elaboration context A
a condició de provided that a condición de elaboration context P
durant during durante elaboration context P
mentre while mientras elaboration context P
a no ser que unless a no ser que elaboration context P
quan when cuando elaboration context P
on where donde elaboration context P
d'acord amb in accordance with de acuerdo con elaboration context P
lluny de far from lejos de elaboration context P
tan aviat com as soon as tan pronto como elaboration context P
de moment for the moment por el momento elaboration context A
entre between entre elaboration context P
cap a towards hacia elaboration context P
fins a until hasta elaboration context P
mitjançant by means of mediante elaboration context P
segons following según elaboration context P
en qualsevol cas in any case en cualquier caso continuation context A
aleshores then entonces continuation context A
respecte de with respect to respecto a continuation context P
en aquest cas in that case en ese caso continuation context A
si if si -- context P
sempre que whenever siempre que -- context P
sens dubte no doubt sin duda -- context A
alhora at the same time a la vez -- context A
  • first of all / to begin with as many discourse markers, these are lexically underspecified with respect to elaboration or continuation, they can reinforce progressive and elaborative relations that are actually signalled by means other than this discourse marker. They are also ambiguous between context and equality. If it is part of a sequence, as in example [ex], it will signal equality, if not, as in example [ex], it will signal context. By default, we ascribe it to context, and only if there is enough evidence is it ascribed to equality.

    example
         a Police say that cars are being stolen to be resold in car-starved east European countries. To begin with, thieves went for the likes of Golf GTis and BMWs, but now bread-and-butter cars are also being taken.
         b In Four Saints Thomson's informality was given free reign since he first of all improvised the music at the piano then, when it stuck, wrote it down to a figured bass.

    example:
         a But there never was a threat to a new German-American special relationship, since there never was such a special relationship to begin with.
         b "We are a bit of a way from that. But I certainly believe, first of all, we have to give what help we can," said Mr Hurd.

  • in any case indicates continuation and context. It seems to have effects comparable to revision, but it is hard to find what is denied. It seems that it has contrastive functions, which can be best attributed to the properties of continuation than to any possible revision. It is comparable to topic-based but, but in that case there seems to be more correlation with items signalling negative polarity, which seems to support an interpretation as revision. In this respect, it is different from anyway, which always conveys revision.

    example
    In truth, however, humble photocopying has been overtaken by the wonders of the fax and personal computers complete with printers. Whatever the price of these latter (20 times their cost in the West) and reinforced customs procedures for their import, they are finding their way in. The controls in any case are surely doomed to fail.

  • then is characterized by the two least marked meanings in each dimension, which makes it very close to narration.
  • unless even if it has inherent negative polarity, it does not convey revision, but context, comparable to if or in case.




Highly polysemic discourse markers


Catalan Spanish English structural semantic
com como like elaboration equality, context, cause
com como since elaboration cause, context
desde desde since elaboration cause, context
sobre sobre about continuation, elaboration context
sobre sobre over continuation, elaboration context
abans de res first of all antes que nada -- context (or equality)
per començar to begin with para empezar -- context (or equality)




Closed class words with very vague meaning


Catalan Spanish English
i y/e and
ni ni nor/neither
o o/u or
que que that
amb con with
sense sin without
contra contra against
en en in
a at/to a
to