ARABIC WORD FREQUENCY COUNTS

When tokenizing a text for purposes of generating a word frequency count, I define an Arabic word as:

one or more consecutive Arabic characters [\xC1-\xD6\xD8-\xDB\xDD-\xDF\xE1\xE3-\xE6\xEC\xED]
including Persian characters
[\x81\x8D\x8E\x90]
short vowels and diacritics
[\xF0-\xF3\xF5\xF6\xF8\xFA]
and the lengthening character
[\xDC]

Note: All hex values are those of the Arabic Windows (1256) code page.

When tokenizing Arabic input it's a good idea to make a preliminary pass to detect and fix punctuation anomalies, such as the Arabic character ra' (\xD1) used as a numeric comma or "decimal separator" (U+066B), and the Arabic lengthening character (\xDC) used as an em dash or numeric hyphen. Numbers are sometimes encoded visually instead of logically, and the digit zero occasionally functions as a period (full stop).

Aften tokenizing according to the above criteria I remove all short vowels, diacritics and the lengthening character, and count the remainder as a word. Null strings are discarded.

Here are the types/tokens figures from the last three word frequencies I have generated:

date types tokens
Feb. 1999 1,359,309 167,216,930
Feb. 2001 2,578,709 589,184,483
Aug. 2002 3,509,499 1,141,563,654

The table below shows the top 30 words and their frequencies from my three frequency counts. Note that starting with my Feb. 2001 count I added a "file count" figure and began to use it—instead of the frequency count—as the primary sort key. On Nov. 13, 2002 I wrote a script to get the Google frequency: it's an interesting statistic, especially if you look at the web as the Mother of All Corpora.

Feb. 1999 Feb. 2001 Aug. 2002 Nov. 2002
rank

word

frequency rank

word

frequency file count rank

word

frequency file count Google
1 في 5,645,218 1 من 13,624,732 1,144,319 1 من 26,533,543 2,511,236 6,600,000
2 من 3,871,153 2 في 18,817,693 1,128,546 2 في 36,615,810 2,422,564 7,540,000
3 على 2,310,879 3 على 7,508,546 915,406 3 على 14,173,880 1,996,755 4,320,000
4 ان 2,219,600 4 عن 3,669,877 776,762 4 عن 6,717,942 1,684,706 2,660,000
5 الى 1,516,247 5 ان 6,518,011 748,416 5 ان 12,214,896 1,637,296 1,450,000
6 التي 1,072,702 6 التي 3,530,111 733,062 6 التي 6,963,708 1,625,725 1,710,000
7 عن 933,872 7 الى 4,228,886 666,316 7 الى 7,861,905 1,424,032 1,200,000
8 الذي 727,170 8 الذي 2,441,785 645,854 8 الذي 4,712,703 1,406,014 1,370,000
9 مع 673,928 9 مع 2,307,276 618,865 9 مع 4,597,178 1,401,634 1,500,000
10 هذا 664,751 10 هذا 2,262,268 562,428 10 هذا 4,234,060 1,216,084 1,730,000
11 هذه 621,972 11 هذه 2,056,768 533,077 11 هذه 3,936,525 1,167,126 1,440,000
12 ما 614,348 12 ما 2,134,871 516,655 12 بعد 2,728,220 1,119,473 1,240,000
13 لا 596,737 13 بين 1,557,175 498,405 13 بين 3,016,512 1,103,814 1,230,000
14 بين 471,859 14 بعد 1,355,002 498,352 14 ما 3,981,028 1,102,480 2,100,000
15 أن 444,508 15 لا 2,053,785 448,718 15 خلال 2,285,299 983,953 624,000
16 ذلك 390,446 16 خلال 1,128,034 433,457 16 كما 2,236,668 952,957 947,000
17 كان 385,909 17 كما 1,148,968 429,963 17 لا 3,780,512 949,315 2,000,000
18 بعد 383,454 18 كل 1,264,232 415,445 18 أن 4,960,500 944,445 2,320,000
19 كل 372,917 19 ذلك 1,336,611 414,018 19 كل 2,373,084 876,683 1,100,000
20 لم 347,762 20 أن 2,222,026 395,224 20 ذلك 2,432,523 874,247 1,190,000
21 إلى 336,817 21 كان 1,316,091 392,905 21 كان 2,417,022 828,361 1,170,000
22 بن 330,130 22 لم 1,165,657 378,694 22 لم 2,211,351 817,968 1,060,000
23 كما 316,837 23 حيث 886,973 349,694 23 وقال 1,998,242 805,975 531,000
24 او 300,602 24 العام 904,412 348,480 24 قد 1,750,386 795,319 828,000
25 هو 299,244 25 قبل 774,256 348,359 25 رئيس 1,850,654 781,413 507,000
26 خلال 297,653 26 رئيس 942,069 346,054 26 العام 1,847,613 781,323 579,000
27 العربية 289,300 27 وفي 921,810 344,874 27 حيث 1,756,690 777,851 628,000
28 انه 269,280 28 وقال 992,961 344,420 28 قبل 1,537,048 767,380 790,000
29 العام 268,549 29 قد 879,497 343,512 29 اليوم 1,426,832 765,200 840,000
30 الله 267,092 30 وقد 863,727 343,123 30 وفي 1,675,304 713,037 652,000

The above is not a lemmatized list. Although some word forms are easily merged (e.g., الى ‎=‎ إلى), most word forms require contextual analysis to be disambiguated (e.g. ان ‎=‎ أَنْ or أَنَّ or إِنْ or إِنَّ or آن).

Before retrieving citations from my corpus I find it useful to go to the wordlist first and use it to test the regular expression that I will later use for searching the corpus. The wordlist also allows me to see the total number of hits that I will get when generating a concordance and thus anticipate the file size of the concordance. Some word forms are unambiguous and lend themselves to fairly simple regular expressions when searching for them in the wordlist. For example, the regular expression /[A><]stqlAb/ produced:

30 forms (Total Freq: 570 = 1 every 2,002,743 words) (Here is a page with my Transliteration).

word rank freq filecnt
الاستقلاب 203,353 156 118
الاستقلابية 236,180 115 88
استقلاب 247,925 103 80
والاستقلاب 338,618 46 43
استقلابية 402,948 33 30
الاستقلابي 527,621 24 17
والاستقلابية 600,354 18 13
واستقلاب 670,143 11 11
لاستقلاب 738,984 9 9
باستقلاب 878,371 7 6
استقلابي 888,965 6 6
الإستقلاب 950,488 7 5
استقلابا 1,257,608 3 3
استقلابه 1,257,609 3 3
استقلابها 1,257,610 3 3
الأستقلاب 1,260,732 3 3
الأستقلابية 1,260,733 3 3
إستقلاب 1,514,774 2 2
استقلابات 1,526,307 2 2
استقلابيا 1,526,308 2 2
الاستقلابات 1,537,658 2 2
بالاستقلاب 1,596,397 2 2
كاستقلاب 1,724,178 2 2
للاستقلاب 1,763,413 2 2
الإستقلابية 2,135,762 1 1
الاستقلابنقص 2,147,334 1 1
وأمراضالاستقلاب 3,166,051 1 1
واستقلابا 3,185,260 1 1
واستقلابية 3,185,261 1 1
والاستقلابات 3,200,045 1 1

Other lemmas require more complex regular expressions, such as the following, for the noun stem fltAn:

/^[wf]?([blk]?|[bk]?Al|[blk]?hAl|ll)fltAn/ (Note: this includes the colloquial prefix hAl-)

29 forms (Total Freq: 1,728 = 1 every 660,627 words) (Here is a page with my Transliteration).

word rank freq filecnt
الفلتان 78,320 779 661
فلتان 101,135 469 426
والفلتان 192,043 144 132
وفلتان 290,316 60 59
للفلتان 298,073 56 56
فلتانا 351,632 41 40
فلتانة 367,836 49 36
بالفلتان 482,666 21 21
لفلتان 495,940 20 20
الفلتانة 514,485 22 18
بفلتان 662,372 11 11
فلتانه 665,759 11 11
فلتانها 737,270 9 9
الفلتاني 892,673 6 6
وفلتانة 931,416 6 6
والفلتانة 1,080,908 5 4
فالفلتان 1,335,814 3 3
فلتانين 1,710,064 2 2
والفلتانين 1,838,284 2 2
وبفلتان 1,853,312 2 2
وفلتانا 1,881,266 2 2
الفلتانية 2,262,027 1 1
الفلتانين 2,262,028 1 1
فلتانات 2,809,431 1 1
فلتاني 2,809,432 1 1

هالفلتان

3,138,700 1 1
وفلتانها 3,358,429 1 1
وفلتانهم 3,358,430 1 1
وللفلتان 3,395,987 1 1

The following regular expressions extract all the inflected forms of the preposition fy (both versions produce the same output):

/^[wf]?f[Yy](|h|hA|hmA|hm|hn|k|kmA|km|kn|nA)$/
/^[wf]?f[Yy]([hk]?|hA|[hk]mA?|[hk]n|nA)$/

48 forms (Total Freq: 42,810,237 = 1 every 26 words) (Here is a page with my Transliteration).

word rank freq filecnt
في 2 36,615,810 2,422,564
وفي 30 1,675,304 713,037
فيها 36 1,222,801 632,440
فيه 50 1,030,089 545,319
فى 132 1,894,110 321,671
ففي 920 122,836 95,251
فيهم 3,081 46,072 36,049
وفى 3,103 56,930 35,850
وفيه 5,674 46,993 20,247
فيهما 5,837 24,842 19,709
وفيها 7,115 19,925 16,027
فينا 7,938 18,509 14,195
فيك 11,933 12,639 8,909
ففى 25,202 3,981 3,560
ففيه 29,518 3,705 2,874
فيكم 30,851 4,485 2,712
ففيها 34,545 3,087 2,317
فيهن 46,295 2,632 1,518
وفيهم 52,220 1,792 1,266
وفينا 72,453 955 750
وفيهما 101,248 581 425
وفيك 120,370 357 314
فىها 160,503 399 185
فىه 168,479 340 169
ففيهما 170,194 204 166
ففيهم 186,930 158 139
وفيكم 190,969 213 133
فيكن 218,127 130 103
وفيهن 240,648 99 85
فيكما 291,082 83 58
ففيك 360,477 39 38
ففينا 456,896 29 23
ففيهن 562,437 18 15
فىنا 606,172 14 13
فىهم 687,798 11 10
وفىها 689,782 11 10
وفيكن 747,527 9 9
وفىه 769,046 9 8
فىهما 817,607 8 7
فىك 869,451 9 6
ففيكم 953,608 7 5
ففىه 1,340,653 3 3
ففىها 1,340,654 3 3
فىكم 1,715,068 2 2
ففيكن 2,805,347 1 1
فىهن 2,826,754 1 1
وفىهما 3,359,666 1 1
وفىهن 3,359,667 1 1

Some contamination from other lemmas does occur. Some of the homographs, such as وفي, are easy to spot, but others are quite unexpected. For example, فيكن is also the Armenian proper name Vigen (I thank my Arabic-L colleagues for pointing this out), and فينا is also the Belgian firm FINA and a rare spelling of Vienna. Unexpected contamination usually shows up when a concordance is generated (see CONCORDANCING). I can anticipate some of the extraneous lemmas by running the wordlist through my morphological parser (see MORPHOLOGY ANALYSIS) before generating the concordance.


HOME | CORPUS COMPILATION | WORD FREQUENCY COUNTS | CONCORDANCING | MORPHOLOGY ANALYSIS | ARABIC LEXICON

Copyright © 2002 QAMUS LLC