Camel_tools, a Python Toolkit for Arabic NLP

hajar ibararhi
8 min readOct 23, 2020

--

How to use camel_tools for Arabic Natural Language Processing and Sentiment Analysis

In this article, I’ll try to explain step by step how I used camel_tools to analyze sentiment in arabic comments in a Moroccan electronic journal. I haven’t encountered during my research someone who has already used and then written about camel_tools, so i figured why not make my first article about my experience with this toolkit.

CAMeL Tools is a collection of open-source tools for arabic natural language processing in Python. CAMeL Tools currently provides utilities for pre-processing, morphological modeling, dialect identification, named entity recognition and sentiment analysis.

The dataset that we’ll be working with regroups arabic comments from news articles within 5 different topics, along with their scores, which increase or decrease with the number of likes and dislikes the comment gets, these comments are translated to english and put in a new column. The dataset also have a topic column, and that is useful in case we want to do a topic classification, but our main goal here is to use camel_tools (unfortunately this toolkit doesn’t have a module for classification). If you want to download the dataset, click the link below:

https://um6p-datascience.s3.eu-west-3.amazonaws.com/datasets/hespress/hespress_comments_en.csv

Problem Definition :

So, we have comments, and we want to predict whether each comment contains positive, negative or neutral sentiment, and this using the camel_tools toolkit. But in order to evaluate the sentiment analyzer provided by this toolkit, we will need to have a “reference”. So here’s what we are going to do : first, predict the sentiment category in each comment(of course those in english), using the famous sentiment analyzer: TextBlob, and then do the same thing using the camel_tools sentiment analyzer on the arabic comments, and finally evaluate the camel_tools model. (So basically, the predictions of TextBlob will be our reference to evaluate the model).

Fortunately for us, camel_tools provides a pre-trained sentiment analysis model that we can use to predict the sentiment of given sentences. Unfortunately for us, CAMeL Tools is still in a pre-release stage and the pip package is out-of-date, so we have to install it from source, and install it’s data folder. You can do so following the installation instructions that can be found in :

Exploratory Data Analysis

We start by importing all the libraries we’re going to use later on.

#utilities
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#text processing & sentiment analysis
import re
from nltk.stem import WordNetLemmatizer
import nltk
from nltk.corpus import stopwords
from wordcloud import WordCloud
from afinn import Afinn
import unicodedata as ud
import camel_tools as ct
from nltk.stem.isri import ISRIStemmer
from ar_wordcloud import ArabicWordCloud
import time
#model
from textblob import TextBlob
from camel_tools.sentiment import SentimentAnalyzer
from sklearn.metrics import classification_report, accuracy_score

We then load the dataset, and take a look at it.

df = pd.read_csv("https://um6p-datascience.s3.eu-west-3.amazonaws.com/datasets/hespress/hespress_comments_en.csv")df = df.drop('status_code', axis=1)df.sample(5)
5 column samples from the dataset

As we can see, the dataset contains a column containing the arabic comments and another one containing their english translation, luckily, the dataset doesn’t have any missing values, so we’ll start by preprocessing the english text to use it for sentiment analysis later.

We put the english and arabic text in lists :

text_ar, text_en = list(df['comment']), list(df['en'].)

Preprocessing :

English comments :

Cleaning our text data will help us have more insight and reduce noise. So we define our english text preprocessing function, in which we turn the letters to lower case, remove all non-alphabetic characters, stopwords and lemmatize all the more than one-letter words.

def preprocess_en(textdata):
processedText = []

# Create stopwords list.
stopwordlist = set(stopwords.words('english'))
# Create Lemmatizer and Stemmer.
wordLemm = WordNetLemmatizer()

# Defining regex pattern
alphaPattern = "[^a-zA-Z0-9]"

for comment in textdata:
comment = comment.lower()

# Replace all non alphabets.
comment = re.sub(alphaPattern, " ", comment)
commentwords = ''
for word in comment.split():
# Checking if the word is a stopword.
if word not in stopwordlist:
if len(word)>1:
# Lemmatizing the word.
word = wordLemm.lemmatize(word)
commentwords += (word+' ')

processedText.append(commentwords)

return processedText

We run our function on the english text :

t = time.time()
processedtext_en = preprocess_en(text_en)
print(f'Text Preprocessing complete.')
print(f'Time Taken: {round(time.time()-t)} seconds')

We then plot a WordCloud for the english words contained in the comments:

plt.figure(figsize = (16,16))
wc_ar = WordCloud(max_words = 1000 , width = 1600 , height = 800,
collocations=False).generate(" ".join(processedtext_en))
plt.imshow(wc_en)
WordCloud of english comments

We then compute the sentiment score contained in each processed comment using TextBlob and separate them to categories(positive, negative, neutral) based on this score, and then put the results in a new column ‘sentiment’.

t = time.time()
sentiment_scores_tb = [round(TextBlob(comment).sentiment.polarity, 3) for comment in processedtext_en]
print(f'Sentiment prediction complete.')
print(f'Time Taken: {round(time.time()-t)} seconds')
sentiment_category_tb = ['positive' if score > 0
else 'negative' if score < 0
else 'neutral'
for score in sentiment_scores_tb]
df['sentiment'] = sentiment_category_tb

We take a look at the distribution of this variable :

‘sentiment’ distribution

This sentiment variable will help us evaluate the camel_tools SentimentAnalyzer later.

Arabic comments :

In our process of cleaning text, since we don’t have any missing values, our main task will be to remove stopwords and all characters that are not numbers or arabic letters, and lemmatize the words.

We start by loading a text file containing all the Arabic stopwords :

with open('arabic_stopwords.txt', 'r') as file:
stopwords = file.read()
#print(stopwords)

This file can be found in the following link :

And we define our preprocessing function :

#remove stopwords and all characters that are not arabic letters or # numbers and lemmatize the wordsdef preprocess_ar(text):
processedText = []

# Create Lemmatizer and Stemmer.
st = ISRIStemmer()

for t in text:
t = ''.join(c for c in t if ud.category(c) == 'Lo' or ud.category(c) == 'Nd' or c == ' ')
commentwords = ''
for word in t.split():
# Checking if the word is a stopword.
if word not in stopwords :
if len(word)>1:
# Lemmatizing the word.
word = st.suf32(word)
commentwords += (word+' ')
processedText.append(commentwords)

return processedText

We run our preprocessing function on the arabic processed text, in order to keep only numbers and Arabic words, get rid of stopwords and lemmatize words composed of more than 1 letter.

t = time.time()
processedtext_ar = preprocess_ar(text_ar)
print(f'Text Preprocessing complete.')
print(f'Time Taken: {round(time.time()-t)} seconds')

We then plot WordClouds for arabic comments :

awc = ArabicWordCloud(background_color="white")plt.figure(figsize = (16,16))
wc_ar = awc.from_text(u''.join(processedtext_ar))
plt.imshow(wc_ar)
WordCloud of arabic comments

We can see that among the most frequent words, there is ‘المغرب’ : which is Morocco, ‘المغاربة’ which is Moroccans, ‘ كورونا’ which says Corona as in the Coronavirus and ‘الدولة’ which is the country.

Creating and Evaluating the SentimentAnalyzer Model

Since we’re going to be working with camel_tools.sentiment, which is a module containing a sentiment analyzer component, and a pre-trained model, we won’t need to train our model, and since it predicts sentiment contained in sentences, we won’t need to vectorize our text data.

So let us take a look on what does this model do exactly using the simple code below :

from camel_tools.sentiment import SentimentAnalyzer

sa = SentimentAnalyzer.pretrained()

# Predict the sentiment of multiple sentences
sentences = [
'أنا بخير',
'أنا لست بخير'
]
sentiments = sa.predict(sentences)
SentimentAnalyzer model predictions

So apparently the model returns ‘positive’ , ‘neutral’ or ‘negative’ depending on the sentiment contained in the sentence(for non Arabic speakers, the first sentence ‘أنا بخير’ says ‘I’m fine’ and the second ‘أنا لست بخير’ : ‘I am not fine’). So we notice that for such simple sentences, the model’s predictions are correct. Now let’s try it on our comments’ text.

First, we load the pre-trained camel_tools SentimentAnalyzer and then we predict the sentiments contained in our preprocessed arabic comments. Since it was taking too much time and crashing each time, I’ve decided to make predictions only on the first 10000 comments:

sa = SentimentAnalyzer.pretrained()
# Predict values for the arabic text
t = time.time()
pred = sa.predict(processedtext_ar[:10000])
print(f'Prediction complete.')
print(f'Time Taken: {round(time.time()-t)} seconds'

It took 29 minutes and 30 seconds for the camel_tools model to make predictions on 10000 comments, when with TextBlob, predictions were made in 35 seconds on all the 68433 comments. This shows that speed isn’t really the camel-tools SentimentAnalyzer’s strength.

Now let’s look at how it did compared to our sentiment reference generated with TextBlob. We compute the classification report and accuracy of our model :

# Print the evaluation metrics for the dataset.
print(classification_report(df.sentiment[:10000], pred))
print(f"Accuracy score : {accuracy_score(df.sentiment[:10000], pred)}")

The camel_tools analyzer isn’t doing really good if we compare its results to the TextBlob predictions, the accuracy is lower than 50%, meaning the two models do not agree on the classification of more than 50% of the comments. But these low classification results can be the consequence of many factors, like for example the translation of the comments; there have been many research articles that tackle the question of how translation alters sentiment.

Conclusion :

We have seen how we can use the camel_tools SentimentAnalyzer in order to predict sentiment contained in arabic comments, and we managed to evaluate it, based on a very famous, lexicon-based library: TextBlob. But if we want to evaluate this SentimentAnalyzer more thoroughly, we might want to use manually annotated evaluation datasets.

Thank you for reading and for making it to here, I hope that my article will help induce the interest in you to explore more about CAMeL Tools. I am too a learner and hence I am sharing the knowledge I gained, and you can always reach out to me on my LinkedIn.

--

--

hajar ibararhi

Engineering student, Data Science enthusiast, with great skills in Data Analytics and Machine Learning