ARABIC MORPHOLOGY ANALYSIS

Background

The Perl skills I developed for corpus compilation, lexical statistics and concordancing eventually allowed me to write my own morphological parser. I had participated in the design of morphological parsers in the early 1990's at Alpnet and the CRL at New Mexico State University, and I used this experience when designing my own system. My goal was to design a system in which the morphotactics and morphophonemic rules were built directly into the lexicon itself. Therefore, instead of compiling numerous lexicons of prefix and suffix morphemes (for which I would have to specify concatenation rules) I compiled a single lexicon of all prefixes—including all valid concatenations—and I also compiled a corresponding unified lexicon for suffixes. The morphophonemic rules were treated simply as orthographic variation and addressed by means of additional dictionary entries. Although an Arabic parser with "unvocalized" lexicon entries can be built and would be fully functional, I include short vowels and diacritics in my lexicons because without them the lexicon is extremely difficult to maintain and the analysis output is also difficult to interpret. I was asked by several Prague linguists to add lemma ID's to each entry, and I was encouraged by Mark Liberman of the LDC to build the lemma ID string by using the actual citation form and a unique number. Since complying with these requests I have found the lemma ID string to be quite useful, such as when reading the analyses of irregular "Form I" verbs, or for quickly locating lexicon entries.

My primary goal in building a morphological parser was lexical tagging or identification—for use in lexicography, especially lemmatization—rather than morphological analysis per se. In my approach I use word stems—rather than root and pattern morphemes—to identify lexical items. (The information on root and pattern morphemes could be added to each stem entry if this were desired). Here is the morphology analysis output of the first 30 words in my Aug. 2002 wordlist (30-words.txt - 11k), using a version of the lexicon dated Nov. 1, 2002. (Here's a link to my Transliteration table).

Arabic Morphology Analysis Algorithm

My Arabic morphology analysis is based on these assumptions:

words are composed of three elements: prefix, stem, and suffix
the prefix can have 0-4 characters
the stem can have 1-infinite characters
the suffix can have 0-6 characters

Given these rules, an Arabic word can be segmented as follows (using wbAlErbyp as an example):

Prefix Stem Suffix
  wbAlErbyp  
  wbAlErby p
  wbAlErb yp
  wbAlEr byp
  wbAlE rbyp
  wbAl Erbyp
  wbA lErbyp
w bAlErbyp  
w bAlErby p
w bAlErb yp
w bAlEr byp
w bAlE rbyp
w bAl Erbyp
w bA lErbyp
wb AlErbyp  
wb AlErby p
wb AlErb yp
wb AlEr byp
wb AlE rbyp
wb Al Erbyp
wb A lErbyp
wbA lErbyp  
wbA lErby p
wbA lErb yp
wbA lEr byp
wbA lE rbyp
wbA l Erbyp
wbAl Erbyp  
wbAl Erby p
wbAl Erb yp
wbAl Er byp
wbAl E rbyp

Arabic dictionary look-up consists of asking, for each segmentation:

does the prefix exist in the lexicon of prefixes?
if so, does the stem exist in the lexicon of stem?
if so, does the suffix exist in the lexicon of suffixes?

Note that the dictionary of prefixes contains not only the individual prefixes (wa-, fa-, li-, Al-, bi-, etc.) but all valid concatenations of these as well (waAl-, biAl-, wabiAl-, etc). The same applies to the dictionary of suffixes: (-ap, -At, -Ani, -athu, -Athum, -Anihi, -tumuwhA, etc).

Here are some sample entries from the dictionary of prefixes:

wl wali NPref-Li and + for/to <pos>wa/CONJ+li/PREP+</pos>
ll lil NPref-Li to/for + the <pos>li/PREP+Al/DET+</pos>
wll walil NPref-Lil and + to/for + the <pos>wa/CONJ+li/PREP+Al/DET+</pos>
wbAl wabiAl NPref-BiAl and + with/by the <pos>wa/CONJ+bi/PREP+Al/DET+</pos>

The first column contains the actual string that we look up, whereas the second column has the vocalized version of the same string. The third column has the morphological category (whose function is explained further below). The fourth column has the corresponding English glosses and contains part-of-speech information for the constituent morphemes.

Here are some sample entries from the dictionary of stems (lines beginning with ";; " contain the lemma ID string):

;; Earabiy~_1
Erby Earabiy~ N/ap Arab <pos>Earabiy~/NOUN</pos>
Erb Earab N Arabs <pos>Earab/NOUN</pos>
Erby Earabiy~ N/ap Arab <pos>Earabiy~/ADJ</pos>
Erb Earab N Arab <pos>Earab/ADJ</pos>
;; Earabiy~_2
Erby Earabiy~ N-ap Arabic;Arab <pos>Earabiy~/ADJ</pos>
;; Earabiy~_3
Erby Earabiy~ N0 Arabi <pos>Earabiy~/NOUN_PROP</pos>
;; Earabiy~ap_1
Erby Earabiy~ NapAt Arabic (language) <pos>Earabiy~/NOUN</pos>

And here are some sample entries from the dictionary of suffixes:

p ap NSuff-ap [fem.sg.] <pos>+ap/NSUFF_FEM_SG</pos>
Ak Aka NSuff-Ah your two <pos>+A/NSUFF_MASC_DU_NOM+ka/POSS_PRON_2MS</pos>
Ak Aki NSuff-Ah your two <pos>+A/NSUFF_MASC_DU_NOM+ki/POSS_PRON_2FS</pos>
tmwny otumuwniy PVSuff-th you [masc.pl.] <verb> me <pos>+tum/PVSUFF_SUBJ:2MP+niy/PVSUFF_DO:1S</pos>

If all three word elements (prefix, stem, suffix) are found in their respective lexicons, we then use their respective morphological categories (the string in column 3) to determine whether they are compatible. We ask:

is the morphological category of the prefix compatible with the morphological category of the stem? (i.e., is the combination pair found in the list of compatible prefix-stem morphological categories?)

if so, is the morphological category of the prefix compatible with the morphological category of the suffix? (i.e., is the combination found in the list of compatible prefix-suffix morphological categories?)

if so, is the morphological category of the stem compatible with the morphological category of the suffix? (i.e., is the combination found in the list of compatible stem-suffix morphological categories?)

If the answer to the last question is “yes” then the morphological analysis is valid.

Example:

INPUT STRING: وصفه
LOOK-UP WORD: wSfh
SOLUTION 1: (waSafahu) [waSaf-i_1] waSaf/VERB_PERFECT+a/PVSUFF_SUBJ:3MS+hu/PVSUFF_DO:3MS
(GLOSS): + describe/characterize + he/it <verb> it/him
SOLUTION 2: (waSafahu) [waSaf-i_1] waSaf/VERB_PERFECT+a/PVSUFF_SUBJ:3MS+hu/PVSUFF_DO:3MS
(GLOSS): + prescribe/give a prescription to + he/it <verb> it/him
SOLUTION 3: (waSofh) [waSof_1] waSof/NOUN+hu/POSS_PRON_3MS
(GLOSS): + description/portrayal/characterization + its/his
SOLUTION 4: (waSofh) [waSof_2] waSof/NOUN+hu/POSS_PRON_3MS
(GLOSS): + characteristic + its/his
SOLUTION 5: (waSaf~ahu) [Saf~-u_1] wa/CONJ+Saf~/VERB_PERFECT+a/PVSUFF_SUBJ:3MS+hu/PVSUFF_DO:3MS
(GLOSS): and + arrange/classify + he/it <verb> it/him
SOLUTION 6: (waSaf~h) [Saf~_1] wa/CONJ+Saf~/NOUN+hu/POSS_PRON_3MS
(GLOSS): and + line/row/class + its/his

Solution #1 was found to be valid because:

1. All 3 components(null)+wSf+h exist in their respective lexicons (note that there is a literal entry for the null prefix):

(null) (null) Pref-0 (null)
wSf waSaf PV describe;characterize
h ahu PVSuff-ah he/it <verb> it/him <pos>+a/PVSUFF_SUBJ:3MS+hu/PVSUFF_DO:3MS</pos>

2. The morphological categories of all 3 components are listed as compatible pairs in the relevant compatibility tables:

"Pref-0 PV" (listed in the table of compatible prefix-stem morphological categories)
"PV PVSuff-ah" (listed in the table of compatible stem-suffix morphological categories)
"Pref-0 PVSuff-ah" (listed in the table of compatible prefix-suffix morphological categories)

Solution #6 was found to be valid because:

1. All 3 components w+Sf+h exist in their respective lexicons:

w wa Pref-Wa and <pos>wa/CONJ+</pos>
Sf Saf~ Ndu line;row;class
h h NSuff-h its/his <pos>+hu/POSS_PRON_3MS</pos>

2. The morphological categories of all 3 components are listed as compatible pairs in the relevant compatibility tables:

"Pref-Wa Ndu" (listed in the table of compatible prefix-stem morphological categories)
"Ndu NSuff-h" (listed in the table of compatible stem-suffix morphological categories)
"Pref-Wa NSuff-h" (listed in the table of compatible prefix-suffix morphological categories)


HOME | CORPUS COMPILATION | WORD FREQUENCY COUNTS | CONCORDANCING | MORPHOLOGY ANALYSIS | ARABIC LEXICON

Copyright © 2002 QAMUS LLC