Abstract. Nowadays, analysis of opinions in online
media such as newspapers, social media, forums, blogs, products review sites,
has a key role in the human life. In this context, opinion mining is one of the
fastest growing research areas in natural language processing that aims to
extract and organize opinions from users. Machine Learning techniques represent
a powerful instrument to analyze and understand correctly text data. In this paper we present a thorough experimental
evaluation of machine learning algorithms used for opinion mining in Albanian
language. The experimental results are interpreted with respect to various
evaluation criteria for the different algorithms showing interesting features
on the performance of each algorithm.
An opinion is an unproven judgement or view about something
from someone. It cannot be necessarily based on facts or knowledge, but can be
useful in different situations. Referring to A1 an opinion is a quintuple,
(ei, aij, sijkl, hk, tl), where ei is the name of an entity, aij is an aspect
of ei, sijkl is the sentiment on aspect aij of entity ei, hk is the opinion
holder, and tl is the time when the opinion is expressed by hk. The sentiment
level will vary from the application it will have different level as positive
or negative; as positive, negative or neutral; as points for example 1-10 used
in hotels, films rating, etc.
Nowadays there is a huge amount of people’s review, opinions
in online and social media and analyzing it will be important to making
decision in the future. So recently business and academia are focused in
finding the best way to analyze this huge amount of online opinions using
machine learning techniques.
Opinion mining in an ongoing field of text mining and
natural language processing that aims to identity and extract subjectivity
information in people’s opinion. This is known also as sentiment analysis and
involve the development of a system to collect, analyze, summarize or
categorize the opinions based on different criteria’s. Referring to the opinion
definition in A1 the aim of opinion mining is to evaluate the quintuple of
the opinion. Opinion mining can be performed in three levels: document level,
sentence level fine-grained level.
In this paper we will focus in document level opinion
classification throw a thorough experimental evaluation of algorithms for
opinion mining in Albanian. For this we created a text corpus with opinions in
Albanian language collected from well-known Albanian online newspaper. The
corpus contains opinions categorized in five different subjects and for each
subject the opinions are categorized as positive and as negative opinions.
First, we have cleaned the dataset performing a preprocessing of text composed
of a stop-word removal and a stemming. Then we perform the testing of the
algorithms. We evaluated the performance of 53 different classification
algorithms using Weka software.
rest of this paper is organized as follows: Section 2 presents related works on
the opinion mining and sentiment analysis categorization; Section 3 presents
the methodology for opinion categorization; Section 4 presents and analyzes the
experiments; concluding with Section 7 conclusions and future work.
2 Background and related
In A1 the author gives an in-depth
comprehensive study of all research fields of opinion mining. Since the last
decade opinion mining is one of the fastest growing research areas in natural
language processing. Companies and researchers are increasingly focused in
opinion mining and using the result from it in their daily work, as marked
predicting, the feedback form a student for a given lessen, etc.
As opinion mining is a classification technique, the main
machine learning techniques for text classification, supervised,
semi-supervised and unsupervised can be used on it. In supervised learning is
used a leveled dataset to build the classification model, but in unsupervised
learning the dataset is unleveled that make it not too appropriate to be used in
The work in A2 consist in performance evaluation of
supervised machine learning techniques, Naïve Bayes, Maximum entropy and SVM in
English Twitter dataset combined with the Semantic Orientation based WordNet
for extracting the synonyms of the content. From the results of the performed
experiments we can conclude that Naïve Bayes algorithm used with unigram
technique outperform the two other algorithms and the use of WordNet improve
To address the problem of unleveled dataset and use the
unsupervised methods in opinion mining the work in A3 proposed to use a
clustering algorithm. In A4 the author used an unsupervised machine learning
technique as spectral clustering to cluster the English Twitter dataset as
positive and negative. The proposed system has for main steps as following:
cleaning and normalization of the data set; applying a test dataset to cluster
the dataset using unsupervised machine learning algorithm; applying a k-means
to normalize the generated matrix and finally applying a hierarchical
clustering algorithm to merge the clusters in one. The experimental results
indicate that unsupervised machine learning technique outperform the supervised
Opinion mining can be performed in three levels: document
level, sentence level and aspect-based level. At document level opinion, the
whole document is indicated to be a positive or negative opinion. The documents
can be from one domain or cross-domain; from one or cross-language. At sentence
level opinion, a sentence is indicated to be a positive or negative opinion.
And in aspect level opinion… The focus of our work is to evaluate the
performance of classification algorithms in document level classified as
positive or negative opinion.
One of the first study in opinion mining classification has
been realized by 2 that performed an experimental evaluation of three machine
learning algorithm, Naïve Bayes, maximum entropy classification, and support
vector machines in a movie-review corpus classified as positive and negative
opinions focused on features based on unigrams and bigrams. They concluded that
the algorithms do not performs as well as in topic-based categorizing and there
are too many challenges to address.