Example application of Multiple Sequence Alignment (MSA) to linguistic phenomena

Here we provide an example of how the patterns of occurrence of the Spanish verb "venir" (to come) can be modelled via MSA.

First, a set of clauses whose main verb is venir are extracted from a newspaper corpus in Spanish (an approximate English translation is provided).

El malestar en la policía local se viene arrastrando desde hace meses.
The uneasiness in the local police is unsolved (comes) since months ago.
viene de lejos.
it comes from long ago.
Los niños vienen de Marruecos.
Children come from Morocco.
La mayor parte de los ingresos del Comité vienen de los derechos de televisión.
Most of the income of the Committee come from television rights.
Si en septiembre algún ministro viene a visitar la zona del Besòs,
If any minister comes to visit the Besos zone in September,
Los ancianos no vienen a ligar.
The elderly do not come to flirt.
viene
comes
Sólo después de la paz viene la gloria.
Only after peace, glory comes.
Esta costumbre viene del centro de Europa.
This tradition comes from the center of Europe.

Then, clauses are analyzed, so that the sequences of words are transformed into sequences of chunks, see for example the analysis of the first of the above clauses:

[ word=El|malestar ] [ pos=sn ] [ lema=malestar ] [ gen=s ] [ num=m ] (The uneasiness)
[ word=en|la|policía|local ] [ pos=grup-sp ] [ anchor=en ] [ lema=policía ] (in the local police)
[ word=se ] [ pos=morfema-verbal ] [ lema=se ] (verbal morpheme)
[ word=viene|arrastrando ] [ pos=grup-verb ] [ lema=arrastrar ] [ gen=s ] [ pers=3 ] (is unsolved (comes))
[ word=desde|hace|meses ] [ pos=grup-sp ] [ anchor=desde ] [ lema=mes ] (since months ago)
[ word=punt ] [ pos=Fp ] [ lema=punt ] (full stop)

Sequences of chunks are transformed into sequences of letters, in the format required by Alphamalig, the MSAligner we employ. Each chunk is assigned a letter, but it is no necessary that there is a one-to-one correspondence between the kinds of chunks and the letters that represent them. Different modellizations useful to study different aspects of the same input. For example, if one wants to study the distribution of noun phrases, chunks can be translated to letters as follows:

V --> verb "venir" (to come)
N --> noun phrases with a nominal head
P --> noun phrases with a pronominal head
A --> noun phrases with a nominal head and an adjective
B --> noun phrases with a pronominal head and an adjective
X --> the rest of possible phrases

If, on the contrary, it is the distribution of prepositional phrases that is to be studied, the following equivalences could be used:

V --> verb "venir" (to come)
A --> prepositional phrases beginning by the prepositions a or hacia (to)
D --> prepositional phrases beginning by the prepositions de or desde (from)
P --> prepositional phrases beginning by the prepositions por or para (for)
L --> prepositional phrases beginning by the prepositions ante, bajo, en, entre, sobre or tras (location or temporal prepositions: before, under, in, between, over, behind/after)
C --> prepositional phrases beginning by the prepositions con or contra (with or against)
X --> the rest of possible phrases

In this examples, the equivalences used were the following:

V --> verb "venir" (to come)
A --> prepositional phrases beginning by the prepositions a or hacia (to)
D --> prepositional phrases beginning by the prepositions de or desde (from)
N --> noun phrases with a nominal head
P --> noun phrases with a pronominal head
R --> adverbial phrases
X --> the rest of possible phrases

Which resulted in the following transformation of the above sentences:

NXPVD
VD
NVD
NVD
CXNVAN
NVA
V
RRVN
NVD

Next, comparable sequences were grouped. The comparability criterion was length: two sequences were considered to be comparable when they had a similar length.

The following similarity criterion was established to determine the score assigned to match, mismatch and gap insertion for every symbol of the chosen alphabet: 8 vnpadrx- v n p a d r x - v 10000 n -1000 1000 p -1000 1000 1000 a -1000 1 1 1000 d -1000 1 1 1000 1000 r -1000 1 1 1 1 1 x -1000 0 0 0 0 0 0 - -10000 -1000 -1000 -1000 -1000 -1000 -1000 -10000 This criterion maximizes the match for the verb, so that alignments tend to be established around an axis constituted by the verb. On the ohter hand, the two kinds of noun phrases (nominal and pronominal) are considered to be more similar to each other than to prepositional phrases, and viceversa. This favours matches between the two kinds of noun phrases and disfavours matches between less similar kinds of chunks. Gap insertion, expressed in this similarity criterion as the score assigned to the match between each symbol and the gap, is assigned a very low score, in order to force correspondences between chunks.

Sets of comparable sequences were aligned, and the following profile of the alignment was obtained:

--Vx-

This profile is little informative, because the gap is the predominant symbol, and it conveys no information. The other two predominant symbols are the verb and X, the symbol assigned to the categories that are not object of study. These two symbols are also very little informative. The verb is little informative because the high similarity measure assigned to it imposes a very strong bias to its alignment, which becomes then quite predictable. On the other hand, X is a very frequent symbol, so the probability that it is representative of a profile is very high.

In order to obtain more informative alignments, the X symbol was removed from the sequences to align, and also from the similarity criterion. Then, comparable sequences were aligned, and the following profile of the alignment was obtained:

--nVd-

This profile is much more informative, as it reflects the fact that, for a given set of clauses containing the verb venir, a recurrent pattern is the occurrence of a noun phrase before the verb and a prepositional phrase introduced by the preposition "de" (from) after the verb.

8
vnpadrx-

	v	n	p	a	d	r	x	-

v	10000
n	-1000	1000
p	-1000	1000	1000
a	-1000	1	1	1000
d	-1000	1	1	1000	1000
r	-1000	1	1	1	1	1
x	-1000	0	0	0	0	0	0
-	-10000	-1000	-1000	-1000	-1000	-1000	-1000	-10000