Example application of Multiple Sequence Alignment (MSA) to linguistic phenomena

Here we provide an example of how the patterns of occurrence of the Spanish verb "venir" (to come) can be modelled via MSA.


First, a set of clauses whose main verb is venir are extracted from a newspaper corpus in Spanish (an approximate English translation is provided).


Then, clauses are analyzed, so that the sequences of words are transformed into sequences of chunks, see for example the analysis of the first of the above clauses:


Sequences of chunks are transformed into sequences of letters, in the format required by Alphamalig, the MSAligner we employ. Each chunk is assigned a letter, but it is no necessary that there is a one-to-one correspondence between the kinds of chunks and the letters that represent them. Different modellizations useful to study different aspects of the same input. For example, if one wants to study the distribution of noun phrases, chunks can be translated to letters as follows:

If, on the contrary, it is the distribution of prepositional phrases that is to be studied, the following equivalences could be used:
In this examples, the equivalences used were the following:
Which resulted in the following transformation of the above sentences:

Next, comparable sequences were grouped. The comparability criterion was length: two sequences were considered to be comparable when they had a similar length.

The following similarity criterion was established to determine the score assigned to match, mismatch and gap insertion for every symbol of the chosen alphabet:
8
vnpadrx-
vnpadrx-
v10000
n-10001000
p-100010001000
a-1000111000
d-10001110001000
r-100011111
x-1000000000
--10000-1000-1000-1000-1000-1000-1000-10000

This criterion maximizes the match for the verb, so that alignments tend to be established around an axis constituted by the verb. On the ohter hand, the two kinds of noun phrases (nominal and pronominal) are considered to be more similar to each other than to prepositional phrases, and viceversa. This favours matches between the two kinds of noun phrases and disfavours matches between less similar kinds of chunks. Gap insertion, expressed in this similarity criterion as the score assigned to the match between each symbol and the gap, is assigned a very low score, in order to force correspondences between chunks.

Sets of comparable sequences were aligned, and the following profile of the alignment was obtained:

--Vx-
This profile is little informative, because the gap is the predominant symbol, and it conveys no information. The other two predominant symbols are the verb and X, the symbol assigned to the categories that are not object of study. These two symbols are also very little informative. The verb is little informative because the high similarity measure assigned to it imposes a very strong bias to its alignment, which becomes then quite predictable. On the other hand, X is a very frequent symbol, so the probability that it is representative of a profile is very high.


In order to obtain more informative alignments, the X symbol was removed from the sequences to align, and also from the similarity criterion. Then, comparable sequences were aligned, and the following profile of the alignment was obtained:

--nVd-
This profile is much more informative, as it reflects the fact that, for a given set of clauses containing the verb venir, a recurrent pattern is the occurrence of a noun phrase before the verb and a prepositional phrase introduced by the preposition "de" (from) after the verb.