Jamie Maguire

Software Architect / Consultant / Developer

Analytics and Big Data, Machine Learning, Sentiment Analysis

The difficulties of sentiment analysis and what are the solutions?

In my last post I introduced the Bayes rule and it’s relationship with sentiment analysis.  In this post I’ll talk about some of the difficulties of applying sentiment analysis and what we can do to try and improve the accuracy.

Sentiment analysis can be applied to many areas but arriving at whether a statement is positive or negative can be  difficult.  The categorisation is mainly split into two type’s facts and opinion.

Facts are expressed about entities, whereas events are about their properties.

Lui discusses that opinions are completely subjective and describe people’s sentiments, appraisals or general feeling towards entities and their properties. (Lui, 2010).

The human language can be complex for machine based learning systems to interpret and opinions can be expressed with sarcasm or irony.  The order of words for can add even more confusion.

Take the following example  (Frank):

“I currently use the Nikon D90 and love it, but not as much as the Canon 40D/50D. I chose the D90 for the video feature. My mistake.”

In this example, the author is conveying sarcasm; this can be hard for classifiers to process (Frank)

“After a whole 5 hours away from work, I get to go back again, I’m so lucky!”

For a classifier to process data and provide more accurate results it must be trained.  This can be achieved by collecting training data.  Various sources can be used, one popular means is to use a corpus of move reviews labelled as positive or negative.  The algorithm is then applied – the best accuracy for this approach is approximately 82.9% (Read)

Improve accuracy by processing neutral sentiment

A paper published by Koppel discusses how the majority of research into sentiment analysis ignores “neutral” sentiment.  The paper goes onto discuss why it is crucial to identify neutral polarity for a number of reasons, learning from just positive and negative examples alone will not result in accurate classification.

Koppel also reinforces the point that “In almost all actual polarity problems, including sentiment analysis, there are, however, at least three categories that must be distinguished: positive, negative and neutral.”

An article written by Kitsuregaw also reinforces the importance of classifying and training with neutral phrases, it discusses that during error analysis, the majority of errors were actually related to neutral phrases.  It explains that out of 48 incorrectly classified phrases, 37 of them were neutral and attributes the error rate to not using a “neutral corpus” when training the classifier.

Taking the papers by Koppel and Kitsuregaw into account it would seem sensible to factor neutral polarity / sentiment into any classification process.

neutral-face

It doesn’t end there and there is more that can be done, often you will find there are strings of text that add no value to the classification process which brings us onto “Stop Words”.

Stop Words

Stop words in computing terms, are words which are filtered out prior to or after processing of natural language data and text. (Wikipedia).  Unfortunately there is not one definitive list of stop words which can be used and any group of words can be chosen as stop words in terms of sentiment analysis.  They are sometimes known as “noise words”.

Search engines for example do not record common stop words in order to save disk space or to speed up searches. (Sullivan) – i.e. search engines “stop” looking at them.

A common list of stop words could contain something like the follow:

a,able,about,across,after,all,almost,also,am,among,an,and,any,are,as,at,be,because,been,but,by,can,cannot,could,dear,did,do,does,either,else,ever,every,for,from,get,got,had,has,have,he,her,hers,him,his,how,however,i,if,in,into,is,it,its,just,least,let,like,likely,may,me,might,most,must,my,neither,no,nor,not,of,off,often,on,only,or,other,our,own,rather,said,say,says,she,should,since,so,some,than,that,the,their,them,then,there,these,they,this,tis,to,too,twas,us,wants,was,we,were,what,when,where,which,while,who,whom,why,will,with,would,yet,you,your

Notice they don’t express much in the way of emotion ?

If you’re trying to perform sentiment analysis you’ll want to remove words like this from your classification process.

Pre-processing 

Collecting data is only one part of the challenge, once data is obtained and stop words have been removed, it must be further cleansed.  In an effort to produce more accurate results data is generally “pre-processed”.  The approach to pre-processing data will vary depending on your problem domain.

If you were trying to pre-process Tweets for example on Twitter, you may want to replace all hyperlinks with a stop word such as URL.

Why?

Well most of the URLS on Twitter are shortened so by instructing your classifier to treat them as a stop word, you save yourself some processing time.  You may also adopt the same approach with usernames to further cleans each tweet.

Finally

If you’ve identified suitable stop words and cleansed the data sufficiently you’re one step closer from being able to perform sentiment analysis against your data!

The final thing you have to do is to TRAIN your classifier.  I’ll cover that in my next post though and discuss what options you have.

As always if you have any comments, suggestions or have other insights then please drop me a message!

JOIN MY EXCLUSIVE EMAIL LIST
Get the latest content and code from the blog posts!
I respect your privacy. No spam. Ever.

4 Comments

  1. Does your site have a contact page? I’m having trouble locating
    it but, I’d like to shoot you an email. I’ve got some creative ideas for
    your blog you might be interested in hearing. Either way, great site and I look forward to seeing
    it develop over time.

    • jamie_maguire

      Hi Jason thanks for taking the time to read and leaving a comment, I’ve been thinking of setting up a blog for years and have never got round to it. Getting comments from people like yourself is great for motivation!

      I’m trying to post at least one article a week on here and have quite a few ideas for posts related to web development BPM, workflow etc. I’ve got content from other side projects I’ve worked on that I’m sure I could adapt to blog posts which some people may find useful.

      Before I put out the email address, yours was flagged as SPAM, is that right?

  2. I’ve been surfing online more than three hours today, yet I never
    found any interesting article like yours. It’s pretty
    worth enough for me. Personally, if all web owners and
    bloggers made good content as you did, the internet will be a lot more useful than ever
    before.

  3. jamie_maguire

    Hi, some but managing to prevent this so far. What about you?

Leave a Reply