Nowadays, analysis of opinions in onlinemedia such as newspapers, social media, forums, blogs, products review sites,has a key role in the human life. In this context, opinion mining is one of thefastest growing research areas in natural language processing that aims toextract and organize opinions from users. Machine Learning techniques representa powerful instrument to analyze and understand correctly text data. In this paper we present a thorough experimentalevaluation of machine learning algorithms used for opinion mining in Albanianlanguage. The experimental results are interpreted with respect to variousevaluation criteria for the different algorithms showing interesting featureson the performance of each algorithm.
1 IntroductionAn opinion is an unproven judgement or view about somethingfrom someone. It cannot be necessarily based on facts or knowledge, but can beuseful in different situations. Referring to A1 an opinion is a quintuple,(ei, aij, sijkl, hk, tl), where ei is the name of an entity, aij is an aspectof ei, sijkl is the sentiment on aspect aij of entity ei, hk is the opinionholder, and tl is the time when the opinion is expressed by hk.
The sentimentlevel will vary from the application it will have different level as positiveor negative; as positive, negative or neutral; as points for example 1-10 usedin hotels, films rating, etc.Nowadays there is a huge amount of people’s review, opinionsin online and social media and analyzing it will be important to makingdecision in the future. So recently business and academia are focused infinding the best way to analyze this huge amount of online opinions usingmachine learning techniques.
Opinion mining in an ongoing field of text mining andnatural language processing that aims to identity and extract subjectivityinformation in people’s opinion. This is known also as sentiment analysis andinvolve the development of a system to collect, analyze, summarize orcategorize the opinions based on different criteria’s. Referring to the opiniondefinition in A1 the aim of opinion mining is to evaluate the quintuple ofthe opinion. Opinion mining can be performed in three levels: document level,sentence level fine-grained level. In this paper we will focus in document level opinionclassification throw a thorough experimental evaluation of algorithms foropinion mining in Albanian. For this we created a text corpus with opinions inAlbanian language collected from well-known Albanian online newspaper. Thecorpus contains opinions categorized in five different subjects and for eachsubject the opinions are categorized as positive and as negative opinions.First, we have cleaned the dataset performing a preprocessing of text composedof a stop-word removal and a stemming.
Then we perform the testing of thealgorithms. We evaluated the performance of 53 different classificationalgorithms using Weka software.Therest of this paper is organized as follows: Section 2 presents related works onthe opinion mining and sentiment analysis categorization; Section 3 presentsthe methodology for opinion categorization; Section 4 presents and analyzes theexperiments; concluding with Section 7 conclusions and future work.2 Background and relatedworkIn A1 the author gives an in-depthcomprehensive study of all research fields of opinion mining.
Since the lastdecade opinion mining is one of the fastest growing research areas in naturallanguage processing. Companies and researchers are increasingly focused inopinion mining and using the result from it in their daily work, as markedpredicting, the feedback form a student for a given lessen, etc. As opinion mining is a classification technique, the mainmachine learning techniques for text classification, supervised,semi-supervised and unsupervised can be used on it. In supervised learning isused a leveled dataset to build the classification model, but in unsupervisedlearning the dataset is unleveled that make it not too appropriate to be used inopinion mining. The work in A2 consist in performance evaluation ofsupervised machine learning techniques, Naïve Bayes, Maximum entropy and SVM inEnglish Twitter dataset combined with the Semantic Orientation based WordNetfor extracting the synonyms of the content.
From the results of the performedexperiments we can conclude that Naïve Bayes algorithm used with unigramtechnique outperform the two other algorithms and the use of WordNet improvethe accuracy.To address the problem of unleveled dataset and use theunsupervised methods in opinion mining the work in A3 proposed to use aclustering algorithm. In A4 the author used an unsupervised machine learningtechnique as spectral clustering to cluster the English Twitter dataset aspositive and negative.
The proposed system has for main steps as following:cleaning and normalization of the data set; applying a test dataset to clusterthe dataset using unsupervised machine learning algorithm; applying a k-meansto normalize the generated matrix and finally applying a hierarchicalclustering algorithm to merge the clusters in one. The experimental resultsindicate that unsupervised machine learning technique outperform the supervisedmachine learning.Opinion mining can be performed in three levels: documentlevel, sentence level and aspect-based level. At document level opinion, thewhole document is indicated to be a positive or negative opinion. The documentscan be from one domain or cross-domain; from one or cross-language. At sentencelevel opinion, a sentence is indicated to be a positive or negative opinion.
And in aspect level opinion… The focus of our work is to evaluate theperformance of classification algorithms in document level classified aspositive or negative opinion. One of the first study in opinion mining classification hasbeen realized by 2 that performed an experimental evaluation of three machinelearning algorithm, Naïve Bayes, maximum entropy classification, and supportvector machines in a movie-review corpus classified as positive and negativeopinions focused on features based on unigrams and bigrams. They concluded thatthe algorithms do not performs as well as in topic-based categorizing and thereare too many challenges to address.