ARABIC MORPHOLOGY ANALYSIS
Background
The Perl skills I developed for corpus compilation, lexical statistics and concordancing eventually allowed me to write my own morphological parser. I had participated in the design of morphological parsers in the early 1990's at Alpnet and the CRL at New Mexico State University, and I used this experience when designing my own system. My goal was to design a system in which the morphotactics and morphophonemic rules were built directly into the lexicon itself. Therefore, instead of compiling numerous lexicons of prefix and suffix morphemes (for which I would have to specify concatenation rules) I compiled a single lexicon of all prefixesincluding all valid concatenationsand I also compiled a corresponding unified lexicon for suffixes. The morphophonemic rules were treated simply as orthographic variation and addressed by means of additional dictionary entries. Although an Arabic parser with "unvocalized" lexicon entries can be built and would be fully functional, I include short vowels and diacritics in my lexicons because without them the lexicon is extremely difficult to maintain and the analysis output is also difficult to interpret. I was asked by several Prague linguists to add lemma ID's to each entry, and I was encouraged by Mark Liberman of the LDC to build the lemma ID string by using the actual citation form and a unique number. Since complying with these requests I have found the lemma ID string to be quite useful, such as when reading the analyses of irregular "Form I" verbs, or for quickly locating lexicon entries.
My primary goal in building a morphological parser was lexical tagging or identificationfor use in lexicography, especially lemmatizationrather than morphological analysis per se. In my approach I use word stemsrather than root and pattern morphemesto identify lexical items. (The information on root and pattern morphemes could be added to each stem entry if this were desired). Here is the morphology analysis output of the first 30 words in my Aug. 2002 wordlist (30-words.txt - 11k), using a version of the lexicon dated Nov. 1, 2002. (Here's a link to my Transliteration table).
Arabic Morphology Analysis Algorithm
My Arabic morphology analysis is based on these assumptions:
words are composed of three elements: prefix, stem, and suffix
the prefix can have 0-4 characters
the stem can have 1-infinite characters
the suffix can have 0-6 characters
Given these rules, an Arabic word can be segmented as follows (using wbAlErbyp as an example):
Prefix | Stem | Suffix |
wbAlErbyp | ||
wbAlErby | p | |
wbAlErb | yp | |
wbAlEr | byp | |
wbAlE | rbyp | |
wbAl | Erbyp | |
wbA | lErbyp | |
w | bAlErbyp | |
w | bAlErby | p |
w | bAlErb | yp |
w | bAlEr | byp |
w | bAlE | rbyp |
w | bAl | Erbyp |
w | bA | lErbyp |
wb | AlErbyp | |
wb | AlErby | p |
wb | AlErb | yp |
wb | AlEr | byp |
wb | AlE | rbyp |
wb | Al | Erbyp |
wb | A | lErbyp |
wbA | lErbyp | |
wbA | lErby | p |
wbA | lErb | yp |
wbA | lEr | byp |
wbA | lE | rbyp |
wbA | l | Erbyp |
wbAl | Erbyp | |
wbAl | Erby | p |
wbAl | Erb | yp |
wbAl | Er | byp |
wbAl | E | rbyp |
Arabic dictionary look-up consists of asking, for each segmentation:
does the prefix exist in the lexicon of prefixes?
if so, does the stem exist in the lexicon of stem?
if so, does the suffix exist in the lexicon of suffixes?
Note that the dictionary of prefixes contains not only the individual prefixes (wa-, fa-, li-, Al-, bi-, etc.) but all valid concatenations of these as well (waAl-, biAl-, wabiAl-, etc). The same applies to the dictionary of suffixes: (-ap, -At, -Ani, -athu, -Athum, -Anihi, -tumuwhA, etc).
Here are some sample entries from the dictionary of prefixes:
wl | wali | NPref-Li | and + for/to <pos>wa/CONJ+li/PREP+</pos> |
ll | lil | NPref-Li | to/for + the <pos>li/PREP+Al/DET+</pos> |
wll | walil | NPref-Lil | and + to/for + the <pos>wa/CONJ+li/PREP+Al/DET+</pos> |
wbAl | wabiAl | NPref-BiAl | and + with/by the <pos>wa/CONJ+bi/PREP+Al/DET+</pos> |
The first column contains the actual string that we look up, whereas the second column has the vocalized version of the same string. The third column has the morphological category (whose function is explained further below). The fourth column has the corresponding English glosses and contains part-of-speech information for the constituent morphemes.
Here are some sample entries from the dictionary of stems (lines beginning with ";; " contain the lemma ID string):
;; Earabiy~_1 | |||
Erby | Earabiy~ | N/ap | Arab <pos>Earabiy~/NOUN</pos> |
Erb | Earab | N | Arabs <pos>Earab/NOUN</pos> |
Erby | Earabiy~ | N/ap | Arab <pos>Earabiy~/ADJ</pos> |
Erb | Earab | N | Arab <pos>Earab/ADJ</pos> |
;; Earabiy~_2 | |||
Erby | Earabiy~ | N-ap | Arabic;Arab <pos>Earabiy~/ADJ</pos> |
;; Earabiy~_3 | |||
Erby | Earabiy~ | N0 | Arabi <pos>Earabiy~/NOUN_PROP</pos> |
;; Earabiy~ap_1 | |||
Erby | Earabiy~ | NapAt | Arabic (language) <pos>Earabiy~/NOUN</pos> |
And here are some sample entries from the dictionary of suffixes:
p | ap | NSuff-ap | [fem.sg.] <pos>+ap/NSUFF_FEM_SG</pos> |
Ak | Aka | NSuff-Ah | your two <pos>+A/NSUFF_MASC_DU_NOM+ka/POSS_PRON_2MS</pos> |
Ak | Aki | NSuff-Ah | your two <pos>+A/NSUFF_MASC_DU_NOM+ki/POSS_PRON_2FS</pos> |
tmwny | otumuwniy | PVSuff-th | you [masc.pl.] <verb> me <pos>+tum/PVSUFF_SUBJ:2MP+niy/PVSUFF_DO:1S</pos> |
If all three word elements (prefix, stem, suffix) are found in their respective lexicons, we then use their respective morphological categories (the string in column 3) to determine whether they are compatible. We ask:
is the morphological category of the prefix compatible with the morphological category of the stem? (i.e., is the combination pair found in the list of compatible prefix-stem morphological categories?)
if so, is the morphological category of the prefix compatible with the morphological category of the suffix? (i.e., is the combination found in the list of compatible prefix-suffix morphological categories?)
if so, is the morphological category of the stem compatible with the morphological category of the suffix? (i.e., is the combination found in the list of compatible stem-suffix morphological categories?)
If the answer to the last question is yes then the morphological analysis is valid.
Example:
INPUT STRING: وصفه
LOOK-UP WORD: wSfh
SOLUTION 1: (waSafahu) [waSaf-i_1] waSaf/VERB_PERFECT+a/PVSUFF_SUBJ:3MS+hu/PVSUFF_DO:3MS
(GLOSS): + describe/characterize + he/it <verb> it/him
SOLUTION 2: (waSafahu) [waSaf-i_1] waSaf/VERB_PERFECT+a/PVSUFF_SUBJ:3MS+hu/PVSUFF_DO:3MS
(GLOSS): + prescribe/give a prescription to + he/it <verb>
it/him
SOLUTION 3: (waSofh) [waSof_1] waSof/NOUN+hu/POSS_PRON_3MS
(GLOSS): + description/portrayal/characterization + its/his
SOLUTION 4: (waSofh) [waSof_2] waSof/NOUN+hu/POSS_PRON_3MS
(GLOSS): + characteristic + its/his
SOLUTION 5: (waSaf~ahu) [Saf~-u_1] wa/CONJ+Saf~/VERB_PERFECT+a/PVSUFF_SUBJ:3MS+hu/PVSUFF_DO:3MS
(GLOSS): and + arrange/classify + he/it <verb> it/him
SOLUTION 6: (waSaf~h) [Saf~_1] wa/CONJ+Saf~/NOUN+hu/POSS_PRON_3MS
(GLOSS): and + line/row/class + its/his
Solution #1 was found to be valid because:
1. All 3 components(null)+wSf+h exist in their respective lexicons (note that there is a literal entry for the null prefix):
(null) | (null) | Pref-0 | (null) |
wSf | waSaf | PV | describe;characterize |
h | ahu | PVSuff-ah | he/it <verb> it/him <pos>+a/PVSUFF_SUBJ:3MS+hu/PVSUFF_DO:3MS</pos> |
2. The morphological categories of all 3 components are listed as compatible pairs in the relevant compatibility tables:
"Pref-0 PV" (listed in the table of compatible prefix-stem morphological categories)
"PV PVSuff-ah" (listed in the table of compatible stem-suffix morphological categories)
"Pref-0 PVSuff-ah" (listed in the table of compatible prefix-suffix morphological categories)
Solution #6 was found to be valid because:
1. All 3 components w+Sf+h exist in their respective lexicons:
w | wa | Pref-Wa | and <pos>wa/CONJ+</pos> |
Sf | Saf~ | Ndu | line;row;class |
h | h | NSuff-h | its/his <pos>+hu/POSS_PRON_3MS</pos> |
2. The morphological categories of all 3 components are listed as compatible pairs in the relevant compatibility tables:
"Pref-Wa Ndu" (listed in the table of compatible prefix-stem morphological categories)
"Ndu NSuff-h" (listed in the table of compatible stem-suffix morphological categories)
"Pref-Wa NSuff-h" (listed in the table of compatible prefix-suffix morphological categories)
HOME | CORPUS COMPILATION | WORD FREQUENCY COUNTS | CONCORDANCING | MORPHOLOGY ANALYSIS | ARABIC LEXICON
Copyright © 2002 QAMUS LLC