Example application of Multiple Sequence
Alignment (MSA) to linguistic phenomena
Here we provide an example of how the patterns of occurrence of the Spanish verb "venir" (to come) can be modelled via MSA.
- First, a set of clauses whose main verb is venir are
extracted from a newspaper corpus in Spanish (an approximate English translation is provided).
-
- El malestar en la policía local se viene arrastrando desde hace meses.
   The uneasiness in the local police is unsolved (comes) since months ago.
- viene de lejos.
   it comes from long ago.
- Los niños vienen de Marruecos.
   Children come from Morocco.
- La mayor parte de los ingresos del Comité vienen de los derechos de televisión.
   Most of the income of the Committee come from television rights.
- Si en septiembre algún ministro viene a visitar la zona del Besòs,
   If any minister comes to visit the Besos zone in September,
- Los ancianos no vienen a ligar.
   The elderly do not come to flirt.
- viene
   comes
- Sólo después de la paz viene la gloria.
   Only after peace, glory comes.
- Esta costumbre viene del centro de Europa.
   This tradition comes from the center of Europe.
- Then, clauses are analyzed, so that the sequences of words are transformed into sequences of chunks, see for example the analysis of the first of the above clauses:
-
- [ word=El|malestar ] [ pos=sn ] [ lema=malestar ] [ gen=s ] [ num=m ] (The uneasiness)
- [ word=en|la|policía|local ] [ pos=grup-sp ] [ anchor=en ] [ lema=policía ] (in the local police)
- [ word=se ] [ pos=morfema-verbal ] [ lema=se ] (verbal morpheme)
- [ word=viene|arrastrando ] [ pos=grup-verb ] [ lema=arrastrar ] [ gen=s ] [ pers=3 ] (is unsolved (comes))
- [ word=desde|hace|meses ] [ pos=grup-sp ] [ anchor=desde ] [ lema=mes ] (since months ago)
- [ word=punt ] [ pos=Fp ] [ lema=punt ] (full stop)
- Sequences of chunks are transformed into sequences of letters, in
the format required by Alphamalig, the MSAligner we employ.
Each chunk is assigned a letter, but it is no necessary that there is
a one-to-one correspondence between the kinds of chunks and the
letters that represent them. Different modellizations useful to study
different aspects of the same input. For example, if one wants to
study the distribution of noun phrases, chunks can be translated to
letters as follows:
-
- V --> verb "venir" (to come)
- N --> noun phrases with a nominal head
- P --> noun phrases with a pronominal head
- A --> noun phrases with a nominal head and an adjective
- B --> noun phrases with a pronominal head and an adjective
- X --> the rest of possible phrases
If, on the contrary, it is the distribution of prepositional phrases
that is to be studied, the following equivalences could be used:
- V --> verb "venir" (to come)
- A --> prepositional phrases beginning by the prepositions a or hacia (to)
- D --> prepositional phrases beginning by the prepositions de or desde (from)
- P --> prepositional phrases beginning by the prepositions por or para (for)
- L --> prepositional phrases beginning by the prepositions ante, bajo, en, entre, sobre or tras (location or temporal prepositions: before, under, in, between, over, behind/after)
- C --> prepositional phrases beginning by the prepositions con or contra (with or against)
- X --> the rest of possible phrases
In this examples, the equivalences used were the following:
- V --> verb "venir" (to come)
- A --> prepositional phrases beginning by the prepositions a or hacia (to)
- D --> prepositional phrases beginning by the prepositions de or desde (from)
- N --> noun phrases with a nominal head
- P --> noun phrases with a pronominal head
- R --> adverbial phrases
- X --> the rest of possible phrases
Which resulted in the following transformation of the above sentences:
- NXPVD
- VD
- NVD
- NVD
- CXNVAN
- NVA
- V
- RRVN
- NVD
- Next, comparable sequences were grouped. The comparability
criterion was length: two sequences were considered to be comparable
when they had a similar length.
- The following similarity criterion was established to determine the score
assigned to match, mismatch and gap insertion for every symbol of the
chosen alphabet:
8 | | | | | | | | |
vnpadrx- | | | | | | | | |
| | | | | | | | |
| v | n | p | a | d | r | x | - |
| | | | | | | | |
v | 10000 | | | | | | | |
n | -1000 | 1000 | | | | | | |
p | -1000 | 1000 | 1000 | | | | | |
a | -1000 | 1 | 1 | 1000 | | | | |
d | -1000 | 1 | 1 | 1000 | 1000 | | | |
r | -1000 | 1 | 1 | 1 | 1 | 1 | | |
x | -1000 | 0 | 0 | 0 | 0 | 0 | 0 | |
- | -10000 | -1000 | -1000 | -1000 | -1000 | -1000 | -1000 | -10000 |
This criterion maximizes the match for the verb, so that alignments tend to be established around an axis constituted by the verb. On the ohter hand, the two kinds of noun phrases (nominal and pronominal) are considered to be more similar to each other than to prepositional phrases, and viceversa. This favours matches between the two kinds of noun phrases and disfavours matches between less similar kinds of chunks. Gap insertion, expressed in this similarity criterion as the score assigned to the match between each symbol and the gap, is assigned a very low score, in order to force correspondences between chunks.
- Sets of comparable sequences were aligned, and the following
profile of the alignment was obtained:
-
--Vx-
- This profile is little informative, because the gap is the
predominant symbol, and it conveys no information. The other two
predominant symbols are the verb and X, the symbol assigned to the
categories that are not object of study. These two symbols are also
very little informative. The verb is little informative because the
high similarity measure assigned to it imposes a very strong bias to
its alignment, which becomes then quite predictable. On the other
hand, X is a very frequent symbol, so the probability that it is
representative of a profile is very high.
- In order to obtain more informative alignments, the X symbol was
removed from the sequences to align, and also from the similarity
criterion. Then, comparable sequences were aligned, and the following
profile of the alignment was obtained:
-
--nVd-
- This profile is much more informative, as it reflects the fact that, for a given set of clauses containing the verb venir, a recurrent pattern is the occurrence of a noun phrase before the verb and a prepositional phrase introduced by the preposition "de" (from) after the verb.