Arabic Corpus Compilation

ARABIC CORPUS COMPILATION

The first task in Arabic lexicography is corpus compilation.

The following is a list of publicly available Arabic corpora, sorted chronologically. We have added some notes based on our experience in processing these corpora.

2013 - KSUCCA Corpus https://sourceforge.net/projects/ksucca-corpus/ The "King Saud University Corpus of Classical Arabic (KSUCCA) is a pioneering 50 million tokens annotated corpus of Classical Arabic texts from the period of pre-Islamic era until the fourth Hijri century (equivalent to the period from the seventh until early eleventh century CE), which is the period of pure classical Arabic." In processing this corpus we counted 47,856,885 tokens and 852,943 types, distributed over 410 separate documents.

2015 - KACST Corpus https://corpus.kacst.edu.sa/ "The KACST [King Abdulaziz City for Science and Technology] Arabic corpus comprises more than 700 million words from the pre-Islamic era to the present day (a period covering more than 1,500 years), collected from 10 diverse mediums [المخطوطات المحققه ـ الكتب ـ الصحف ـ المجلات ـ الدوريات المحكمة ـ الرسائل الجامعية ـ المناهج الدراسية ـ الإصدارات الرسمية ـ وكالات الأنباء ـ الإنترنت]. Each text has been further classified more specifically into domains and topics. The KACST Arabic corpus is freely available to explore on the Internet [..] using a variety of tools." The corpus can be searched online, with results displayed in concordance format (up to 5,000 hits), as well as word frequency distribution charts. Also available are word frequency lists (in .xlsx format) of the top 30,000 word forms for each of the ten different domains, including a master word frequency list for the whole corpus. According to the website the corpus has 1,182,515,633 tokens and 9,006,585 types, distributed over 1,323,185 texts.

2016 - ArabicWeb16 https://sites.google.com/view/arabicweb16 "the largest Arabic Web dataset (150M pages) that is publicly available!" After extensive cleanup we were left with 42,117,088 pages (documents) and 21,755,992,812 words (Arabic tokens). There is a lot of informal Arabic in this data, and dialect function words have a very high relative frequency. For example, بس ranks 102nd (943 per million), اللي ranks 147th (701 per million), and وين ranks 881st (110 per million).

2018 - Arab-Acquis https://camel.abudhabi.nyu.edu/arabacquis/ "dataset for evaluating machine translation between 22 European languages and Arabic" - over 600,000 words.

2018 - Arabic Speech Corpus http://en.arabicspeechcorpus.com/ "developed as part of PhD work carried out by Nawar Halabi at the University of Southampton." The text portion of the corpus runs to c. 60,000 words.

2020 - Corpus of Global Language Use http://www.earthlings.io/ "This project focuses on the global distribution of 464 languages, including [..] Arabic." Available Arabic data also includes that obtained from websites on servers in countries outside the Middle East and North Africa (e.g., alwatan.kuwait.tt) (http://www.earthlings.io/download_cglu.html). We counted 718,455,144 Arabic tokens distributed over 3,294,075 documents. (We didn't compute type counts for the full corpus).

2020 - Habibi Corpus https://www.lancaster.ac.uk/staff/elhaj/corpora.html "Habibi is the first freely available Arabic Song Lyrics corpus comprises more than 30,000 Arabic song lyrics in 6 Arabic dialects for singers from 18 different Arab countries. The lyrics are segmented into more than 500,000 sentences (song verses) with more than 3.5 million words." After some cleanup to remove redundancy we counted 3,171,840 Arabic tokens and 224,043 types. Among the words in the top 30 (start reading from the word قلبي): قلبي اللي انت حبيبي بس مش الدنيا الحب عمري ليه عيني حبك.

2021 - Abu El Khair Corpus http://www.abuelkhair.net/index.php/en/arabic/abu-el-khair-corpus "Abu El-Khair Corpus is an Arabic text corpus that includes more than five million newspaper articles [..] over a billion and a half words in total [.. and] about three million unique words." The source data (10 files from 10 newspapers representing 8 countries) spans 14 years, and the corpus files are datestamped 2016.

Corpora can also be obtained by data-mining websites directly. The following is a partial listing of Arabic newspapers that have shown a steady sizable output over the years:

Akhbar al-Khalij (Bahrain) akhbar-alkhaleej.com

Akhbar al-Yawm (Cairo) akhbarelyom.com

al-Ahram (Cairo) ahram.org.eg

al-Anba' (Kuwait) alanba.com.kw

al-'Arabiyah (Riyadh) alarabiya.net

al-Bayan (UAE) albayan.ae

al-Dustur (Amman) addustour.com

al-Nahar (Beirut) annahar.com

al-Raya (Qatar) raya.com

al-Riyadh (Riyadh) alriyadh.com

al-Safir (Beirut) assafirarabi.com

al-Sharq al-Awsat (London) aawsat.com

al-Watan (Qatar) al-watan.com

Once a corpus is compiled, the next task is to assess its size in terms of types and tokens (see WORD FREQUENCY COUNTS).