TArC. Un corpus d`arabish tunisien

Gugliotta, Elisa; Dinarelli, Marco

TArC : Incrementally and Semi-Automatically Collecting a Tunisian arabish Corpus This article describes the collection process of the first morpho-syntactically annotated Tunisian arabish Corpus (TArC). Arabish is a spontaneous coding of Arabic Dialects (AD) in Latin characters and arithmographs (numbers used as letters). This code-system was developed by Arabic-speaking users of social media in order to facilitate the communication on digital devices. Arabish differs for each Arabic dialect and each arabish code-system is under-resourced. In the last few years, the attention of NLP on AD has considerably increased. TArC will be thus a useful support for different types of analyses, as well as for NLP tools training. In this article we will describe preliminary work on the TArC semi-automatic construction process and some of the first analyses on the corpus. In order to provide a complete overview of the challenges faced during the building process, we will present the main Tunisian dialect characteristics and its encoding in Tunisian arabish.

TArC. Un corpus d`arabish tunisien / Gugliotta, E., Dinarelli, M.. - 2:(2020), pp. 232-240. (Traitement Automatique des Langues Naturelles (TALN, 27e édition) ).