ARABIC WORD FREQUENCY COUNTS
When tokenizing a text for purposes of generating a word frequency count, I define an Arabic word as:
one or more consecutive Arabic characters [\xC1-\xD6\xD8-\xDB\xDD-\xDF\xE1\xE3-\xE6\xEC\xED]
including Persian characters [\x81\x8D\x8E\x90]
short vowels and diacritics [\xF0-\xF3\xF5\xF6\xF8\xFA]
and the lengthening character [\xDC]
Note: All hex values are those of the Arabic Windows (1256) code page.
When tokenizing Arabic input it's a good idea to make a preliminary pass to detect and fix punctuation anomalies, such as the Arabic character ra' (\xD1) used as a numeric comma or "decimal separator" (U+066B), and the Arabic lengthening character (\xDC) used as an em dash or numeric hyphen. Numbers are sometimes encoded visually instead of logically, and the digit zero occasionally functions as a period (full stop).
Aften tokenizing according to the above criteria I remove all short vowels, diacritics and the lengthening character, and count the remainder as a word. Null strings are discarded.
Here are the types/tokens figures from the last three word frequencies I have generated:
date | types | tokens |
Feb. 1999 | 1,359,309 | 167,216,930 |
Feb. 2001 | 2,578,709 | 589,184,483 |
Aug. 2002 | 3,509,499 | 1,141,563,654 |
The table below shows the top 30 words and their frequencies from my three frequency counts. Note that starting with my Feb. 2001 count I added a "file count" figure and began to use itinstead of the frequency countas the primary sort key. On Nov. 13, 2002 I wrote a script to get the Google frequency: it's an interesting statistic, especially if you look at the web as the Mother of All Corpora.
Feb. 1999 | Feb. 2001 | Aug. 2002 | Nov. 2002 | ||||||||
rank | word |
frequency | rank | word |
frequency | file count | rank | word |
frequency | file count | |
1 | في | 5,645,218 | 1 | من | 13,624,732 | 1,144,319 | 1 | من | 26,533,543 | 2,511,236 | 6,600,000 |
2 | من | 3,871,153 | 2 | في | 18,817,693 | 1,128,546 | 2 | في | 36,615,810 | 2,422,564 | 7,540,000 |
3 | على | 2,310,879 | 3 | على | 7,508,546 | 915,406 | 3 | على | 14,173,880 | 1,996,755 | 4,320,000 |
4 | ان | 2,219,600 | 4 | عن | 3,669,877 | 776,762 | 4 | عن | 6,717,942 | 1,684,706 | 2,660,000 |
5 | الى | 1,516,247 | 5 | ان | 6,518,011 | 748,416 | 5 | ان | 12,214,896 | 1,637,296 | 1,450,000 |
6 | التي | 1,072,702 | 6 | التي | 3,530,111 | 733,062 | 6 | التي | 6,963,708 | 1,625,725 | 1,710,000 |
7 | عن | 933,872 | 7 | الى | 4,228,886 | 666,316 | 7 | الى | 7,861,905 | 1,424,032 | 1,200,000 |
8 | الذي | 727,170 | 8 | الذي | 2,441,785 | 645,854 | 8 | الذي | 4,712,703 | 1,406,014 | 1,370,000 |
9 | مع | 673,928 | 9 | مع | 2,307,276 | 618,865 | 9 | مع | 4,597,178 | 1,401,634 | 1,500,000 |
10 | هذا | 664,751 | 10 | هذا | 2,262,268 | 562,428 | 10 | هذا | 4,234,060 | 1,216,084 | 1,730,000 |
11 | هذه | 621,972 | 11 | هذه | 2,056,768 | 533,077 | 11 | هذه | 3,936,525 | 1,167,126 | 1,440,000 |
12 | ما | 614,348 | 12 | ما | 2,134,871 | 516,655 | 12 | بعد | 2,728,220 | 1,119,473 | 1,240,000 |
13 | لا | 596,737 | 13 | بين | 1,557,175 | 498,405 | 13 | بين | 3,016,512 | 1,103,814 | 1,230,000 |
14 | بين | 471,859 | 14 | بعد | 1,355,002 | 498,352 | 14 | ما | 3,981,028 | 1,102,480 | 2,100,000 |
15 | أن | 444,508 | 15 | لا | 2,053,785 | 448,718 | 15 | خلال | 2,285,299 | 983,953 | 624,000 |
16 | ذلك | 390,446 | 16 | خلال | 1,128,034 | 433,457 | 16 | كما | 2,236,668 | 952,957 | 947,000 |
17 | كان | 385,909 | 17 | كما | 1,148,968 | 429,963 | 17 | لا | 3,780,512 | 949,315 | 2,000,000 |
18 | بعد | 383,454 | 18 | كل | 1,264,232 | 415,445 | 18 | أن | 4,960,500 | 944,445 | 2,320,000 |
19 | كل | 372,917 | 19 | ذلك | 1,336,611 | 414,018 | 19 | كل | 2,373,084 | 876,683 | 1,100,000 |
20 | لم | 347,762 | 20 | أن | 2,222,026 | 395,224 | 20 | ذلك | 2,432,523 | 874,247 | 1,190,000 |
21 | إلى | 336,817 | 21 | كان | 1,316,091 | 392,905 | 21 | كان | 2,417,022 | 828,361 | 1,170,000 |
22 | بن | 330,130 | 22 | لم | 1,165,657 | 378,694 | 22 | لم | 2,211,351 | 817,968 | 1,060,000 |
23 | كما | 316,837 | 23 | حيث | 886,973 | 349,694 | 23 | وقال | 1,998,242 | 805,975 | 531,000 |
24 | او | 300,602 | 24 | العام | 904,412 | 348,480 | 24 | قد | 1,750,386 | 795,319 | 828,000 |
25 | هو | 299,244 | 25 | قبل | 774,256 | 348,359 | 25 | رئيس | 1,850,654 | 781,413 | 507,000 |
26 | خلال | 297,653 | 26 | رئيس | 942,069 | 346,054 | 26 | العام | 1,847,613 | 781,323 | 579,000 |
27 | العربية | 289,300 | 27 | وفي | 921,810 | 344,874 | 27 | حيث | 1,756,690 | 777,851 | 628,000 |
28 | انه | 269,280 | 28 | وقال | 992,961 | 344,420 | 28 | قبل | 1,537,048 | 767,380 | 790,000 |
29 | العام | 268,549 | 29 | قد | 879,497 | 343,512 | 29 | اليوم | 1,426,832 | 765,200 | 840,000 |
30 | الله | 267,092 | 30 | وقد | 863,727 | 343,123 | 30 | وفي | 1,675,304 | 713,037 | 652,000 |
The above is not a lemmatized list. Although some word forms are easily merged (e.g., الى = إلى), most word forms require contextual analysis to be disambiguated (e.g. ان = أَنْ or أَنَّ or إِنْ or إِنَّ or آن).
Before retrieving citations from my corpus I find it useful to go to the wordlist first and use it to test the regular expression that I will later use for searching the corpus. The wordlist also allows me to see the total number of hits that I will get when generating a concordance and thus anticipate the file size of the concordance. Some word forms are unambiguous and lend themselves to fairly simple regular expressions when searching for them in the wordlist. For example, the regular expression /[A><]stqlAb/ produced:
30 forms (Total Freq: 570 = 1 every 2,002,743 words) (Here is a page with my Transliteration).
word | rank | freq | filecnt |
الاستقلاب | 203,353 | 156 | 118 |
الاستقلابية | 236,180 | 115 | 88 |
استقلاب | 247,925 | 103 | 80 |
والاستقلاب | 338,618 | 46 | 43 |
استقلابية | 402,948 | 33 | 30 |
الاستقلابي | 527,621 | 24 | 17 |
والاستقلابية | 600,354 | 18 | 13 |
واستقلاب | 670,143 | 11 | 11 |
لاستقلاب | 738,984 | 9 | 9 |
باستقلاب | 878,371 | 7 | 6 |
استقلابي | 888,965 | 6 | 6 |
الإستقلاب | 950,488 | 7 | 5 |
استقلابا | 1,257,608 | 3 | 3 |
استقلابه | 1,257,609 | 3 | 3 |
استقلابها | 1,257,610 | 3 | 3 |
الأستقلاب | 1,260,732 | 3 | 3 |
الأستقلابية | 1,260,733 | 3 | 3 |
إستقلاب | 1,514,774 | 2 | 2 |
استقلابات | 1,526,307 | 2 | 2 |
استقلابيا | 1,526,308 | 2 | 2 |
الاستقلابات | 1,537,658 | 2 | 2 |
بالاستقلاب | 1,596,397 | 2 | 2 |
كاستقلاب | 1,724,178 | 2 | 2 |
للاستقلاب | 1,763,413 | 2 | 2 |
الإستقلابية | 2,135,762 | 1 | 1 |
الاستقلابنقص | 2,147,334 | 1 | 1 |
وأمراضالاستقلاب | 3,166,051 | 1 | 1 |
واستقلابا | 3,185,260 | 1 | 1 |
واستقلابية | 3,185,261 | 1 | 1 |
والاستقلابات | 3,200,045 | 1 | 1 |
Other lemmas require more complex regular expressions, such as the following, for the noun stem fltAn:
/^[wf]?([blk]?|[bk]?Al|[blk]?hAl|ll)fltAn/ (Note: this includes the colloquial prefix hAl-)
29 forms (Total Freq: 1,728 = 1 every 660,627 words) (Here is a page with my Transliteration).
word | rank | freq | filecnt |
الفلتان | 78,320 | 779 | 661 |
فلتان | 101,135 | 469 | 426 |
والفلتان | 192,043 | 144 | 132 |
وفلتان | 290,316 | 60 | 59 |
للفلتان | 298,073 | 56 | 56 |
فلتانا | 351,632 | 41 | 40 |
فلتانة | 367,836 | 49 | 36 |
بالفلتان | 482,666 | 21 | 21 |
لفلتان | 495,940 | 20 | 20 |
الفلتانة | 514,485 | 22 | 18 |
بفلتان | 662,372 | 11 | 11 |
فلتانه | 665,759 | 11 | 11 |
فلتانها | 737,270 | 9 | 9 |
الفلتاني | 892,673 | 6 | 6 |
وفلتانة | 931,416 | 6 | 6 |
والفلتانة | 1,080,908 | 5 | 4 |
فالفلتان | 1,335,814 | 3 | 3 |
فلتانين | 1,710,064 | 2 | 2 |
والفلتانين | 1,838,284 | 2 | 2 |
وبفلتان | 1,853,312 | 2 | 2 |
وفلتانا | 1,881,266 | 2 | 2 |
الفلتانية | 2,262,027 | 1 | 1 |
الفلتانين | 2,262,028 | 1 | 1 |
فلتانات | 2,809,431 | 1 | 1 |
فلتاني | 2,809,432 | 1 | 1 |
هالفلتان |
3,138,700 | 1 | 1 |
وفلتانها | 3,358,429 | 1 | 1 |
وفلتانهم | 3,358,430 | 1 | 1 |
وللفلتان | 3,395,987 | 1 | 1 |
The following regular expressions extract all the inflected forms of the preposition fy (both versions produce the same output):
/^[wf]?f[Yy](|h|hA|hmA|hm|hn|k|kmA|km|kn|nA)$/
/^[wf]?f[Yy]([hk]?|hA|[hk]mA?|[hk]n|nA)$/
48 forms (Total Freq: 42,810,237 = 1 every 26 words) (Here is a page with my Transliteration).
word | rank | freq | filecnt |
في | 2 | 36,615,810 | 2,422,564 |
وفي | 30 | 1,675,304 | 713,037 |
فيها | 36 | 1,222,801 | 632,440 |
فيه | 50 | 1,030,089 | 545,319 |
فى | 132 | 1,894,110 | 321,671 |
ففي | 920 | 122,836 | 95,251 |
فيهم | 3,081 | 46,072 | 36,049 |
وفى | 3,103 | 56,930 | 35,850 |
وفيه | 5,674 | 46,993 | 20,247 |
فيهما | 5,837 | 24,842 | 19,709 |
وفيها | 7,115 | 19,925 | 16,027 |
فينا | 7,938 | 18,509 | 14,195 |
فيك | 11,933 | 12,639 | 8,909 |
ففى | 25,202 | 3,981 | 3,560 |
ففيه | 29,518 | 3,705 | 2,874 |
فيكم | 30,851 | 4,485 | 2,712 |
ففيها | 34,545 | 3,087 | 2,317 |
فيهن | 46,295 | 2,632 | 1,518 |
وفيهم | 52,220 | 1,792 | 1,266 |
وفينا | 72,453 | 955 | 750 |
وفيهما | 101,248 | 581 | 425 |
وفيك | 120,370 | 357 | 314 |
فىها | 160,503 | 399 | 185 |
فىه | 168,479 | 340 | 169 |
ففيهما | 170,194 | 204 | 166 |
ففيهم | 186,930 | 158 | 139 |
وفيكم | 190,969 | 213 | 133 |
فيكن | 218,127 | 130 | 103 |
وفيهن | 240,648 | 99 | 85 |
فيكما | 291,082 | 83 | 58 |
ففيك | 360,477 | 39 | 38 |
ففينا | 456,896 | 29 | 23 |
ففيهن | 562,437 | 18 | 15 |
فىنا | 606,172 | 14 | 13 |
فىهم | 687,798 | 11 | 10 |
وفىها | 689,782 | 11 | 10 |
وفيكن | 747,527 | 9 | 9 |
وفىه | 769,046 | 9 | 8 |
فىهما | 817,607 | 8 | 7 |
فىك | 869,451 | 9 | 6 |
ففيكم | 953,608 | 7 | 5 |
ففىه | 1,340,653 | 3 | 3 |
ففىها | 1,340,654 | 3 | 3 |
فىكم | 1,715,068 | 2 | 2 |
ففيكن | 2,805,347 | 1 | 1 |
فىهن | 2,826,754 | 1 | 1 |
وفىهما | 3,359,666 | 1 | 1 |
وفىهن | 3,359,667 | 1 | 1 |
Some contamination from other lemmas does occur. Some of the homographs, such as وفي, are easy to spot, but others are quite unexpected. For example, فيكن is also the Armenian proper name Vigen (I thank my Arabic-L colleagues for pointing this out), and فينا is also the Belgian firm FINA and a rare spelling of Vienna. Unexpected contamination usually shows up when a concordance is generated (see CONCORDANCING). I can anticipate some of the extraneous lemmas by running the wordlist through my morphological parser (see MORPHOLOGY ANALYSIS) before generating the concordance.
HOME | CORPUS COMPILATION | WORD FREQUENCY COUNTS | CONCORDANCING | MORPHOLOGY ANALYSIS | ARABIC LEXICON
Copyright © 2002 QAMUS LLC