some hints for the task of clustering to find equivalence classes (march 27 and 29)

The purpose of this excercise is that you get to know the task of clustering and think about the question of how to transform linguistic objects into vectors, to decide which features to choose as characterizing attributes, to get the flavour of the results of a clustering process, etc.


  • you have to turn words in text into vectors that can be treated by clustering. Here you have words2vectors, an example perl script that turns into vectors the 1000 most frequent words in a corpus, considering as attributes only the probability of occurrence of the previous word.
  • if you do not want to cluster words just as they come, try one of the following generalizations:

    • consider lemmas instead of words, but not to generalize on objects but on their characterizing features.
    • idem with PoS instead of words, for bigger generalization, you can consider only the first n digits of the PoS tag (EAGLES style).
    • idem with combinations of lemma and (some digits of) PoS tag.

  • you may also consider using other features, instead of just the previous word, for example, the previous and the following PoS, its syntactic head (in case you have a syntactic analysis of the corpus), etc.

If you want the corpus to be annotated, associating lemmas and PoS to each word, (the bigger one, since the smaller is already annotated) you can use freeling or else download this automatically annotated version.

If you are using weka, the way you can store the qualitative results of clustering in a file is by right-clicking on the solutions window (down left corner), choosing the option "Visualize cluster assignments" and then "Save" (remember to have the box "Store clusters for visualization" checked while you are clustering). The ouput is an .arff file identical to your input file, with an extra column stating the cluster to which each word has been assigned.

In order to evaluate better, it is a good idea to include an attribute "form of the word" in each vector, this attribute should not be used for clustering but it can be useful to get an idea of the contents of each class. You can also include an attribute "PoS" that can be helpful to carry out the "classes to cluster" evaluation of clustering, another way of inspecting the results of the clustering solution (more than a real evaluation). Remember not to use this attribute for clustering, either!

If weka cannot cluster your datasets because of memory problems, try cluto.