Sentiment Analysis in the Cloud with Google Cloud Natural Language, AWS Comprehend, & IBM Watson

Note: Accuracy based on a test set of manually categorized Tweets with sentiment coded as Positive, Negative, or Neutral

Note: Accuracy based on a test set of manually categorized Tweets with sentiment coded as Positive, Negative, or Neutral

A Tutorial for Testing and Boosting Performance with R Studio 

The Old World of Sentiment Analysis

Off-the-shelf sentiment analysis tools have been knocking around for a while. The first time we used one of these services was back in 2011 as part of monitoring social media on behalf of clients. We wanted to be able to measure the emotional quality of posts and not just quantity of interactions. The results — which relied on the analysis of individual words rather than complete expressions — were disappointing to the degree we avoided situations that might require automated sentiment analysis as a solution.

 

A Great Leap Forward

In the last 18 months, we’ve seen a major leap forward in the quality, accessibility, and cost of cloud-based sentiment analysis services. Sophisticated (and expensive) proprietary tools and free software packages such as RSentiment have given way to the rapid expansion of cloud-based services with open APIs and relatively low costs. Google’s Cloud Natural language has been around since February 2016, built upon the same machine learning models that power other Google services such as the recently controversial Google Duplex. Amazon’s AWS Comprehend was launched in November 2017 and IBM Watson’s Natural Language Understanding (NLU) became available as a stand-alone service in early 2017.

These new tools offer the promise of improved performance and flexibility by tapping into machine learning platforms built on massive data sets. Traditional dictionary-based analysis focusing on individual word meaning have been replaced by algorithms that come closer to understanding sentence and paragraph structure as well as complicated linguistic concepts such as irony and sarcasm.

So how do the big cloud sentiment services compare out of the box?

 

Comparing Google, Amazon, and IBM Sentiment Analysis

To assess how far sentiment analysis has developed we looked to compare the three services head-to head. The easiest way to do this, and useful for other purposes, would be to use the APIs to send the text to be analyzed and receive back the results.


As you may gather from our other articles, we’re big advocates of the statistical programming language R and so we looked to develop the analysis within that language. While there isn’t a native R API for any of the services, cURL is available meaning the httr R package can be used to connect to the API.

Fortunately, and is often the case, we didn’t have to work too hard with the APIs, some kind souls had already written R packages for two of the sentiment analysis services. Namely:

For IBM Watson, we wrote a simple function using the httr package to access the API via a GET call.

 

The Testing Data

There are plenty of sentiment analysis training data sets available for free. “Training” means that each chunk of text has been pre-categorised and verified by a human. For this purpose rather than training a machine learning model, we’re testing accuracy for pre-existing models.

For our tests, we used a popular Twitter comment data set consisting of 498 tweets categorized by topic and sentiment as:

NEGATIVE: 177

NEUTRAL: 139

POSITIVE: 182

As is usual with tweets, the grammar, spelling and diction are highly colloquial and therefore difficult to analyze. To establish a benchmark, we ran the tweets through RSentiment, a publicly available package for RStudio. RSentiment correctly predicted 56% of the test data sentiments.

 

Connecting Amazon, Google, and IBM Sentiment Analysis Services

AWS Comprehend

The aws.comprehend package is very easy to use, the only gap being the documentation for the authentication. Simply set the authentication variables in the global R environment:

    Sys.setenv("AWS_ACCESS_KEY_ID" = [your access key ID],
        "AWS_SECRET_ACCESS_KEY" = [your secret access key]

 

 

For best practice, the authentication details should come from a user with access to this service and not the root user.

The workhorse function is then:

    detect_sentiment(“text to detect sentiment”)

and it’s easy enough to use:

    lapply(imported_dataframe$tweets, detect_sentiment)

… to return a list object that can be extracted with a simple for-loop.

 

Google Cloud Natural Language

The googleLanguageR package is just as easy. Download the credentials file from your Google Cloud account and put that in a folder within your locally set working directory.

Then, immediately after loading the library, specify the location of the creds file:

    library(googleLanguageR)
    gl_auth("/abcdvwxyz.json")

The main function is:

    gl_nlp(“text to detect sentiment”)

Again, you can use lapply to run the function through all the tweets.

IBM Watson

There are a number of options for text analysis within IBM Watson. For simple sentiment analysis the Natural Language Understanding Cloud Foundry service provides a specific API endpoint. A positive, negative or neutral result is given back.

With no working pre-existing package we wrote a simple API call using httr.

Authentication is set at the service level, meaning the details for the sentiment analysis service are unique to that service.

    # Set authentication details and options as variables
    sent_URL <- "https://gateway.watsonplatform.net/natural-language-   understanding/api/v1/analyze"
    sent_username <- "xxxxxxx-yyyyyy"
    sent_password <- "abcdef"
    features <- "sentiment,keywords"
    version <- "2018-05-15"
    text <- "I really like this dummy text"

    # API call
    library(httr)
    sentiment_test <- content(GET(sent_URL,
                    authenticate(sent_username, sent_password),
                    query = list(version=version,
                                 text = text,
                                 features=features),
                    add_headers(Accept="application/json"),
                    verbose()))

Again, this will produce a list that can be looped through.

 

Sentiment Services Outputs

The results returned are very different in the information given back:

  • AWS Comprehend gives a result against 4 possible outcomes of positive, negative, neutral & mixed.

  • IBM Watson gives more detail, with breakdowns of the keywords and a sentiment score on a scale to infer the strength of the sentiment in the given direction

  • Google Cloud Natural Language gives the richest return. It deconstructs the text into sentences, entities, and tokens and attempts to categorise each unit’s case, mood & tense.

 

Performance Comparison

For this test, we were just looking at sentiment analysis accuracy and calculated the number of times each service correctly predicted sentiment against the training data set. AWS returns ‘mixed’ sentiment in some instances. Given this was not a response in the original training dataset nor from the other two services, we filtered out those occurrences giving a new total of 491 observations versus 498 originally.

A quick summary of the predictive accuracy for each service:

AWS Comprehend: 65%

Google Cloud Natural Language: 67%

IBM Watson: 68%

For this dataset, IBM was the winner getting 68% correct. This is one dataset, and so each service may perform better or worse on different types of text, for instance those that are less colloquial.

 

Combining Services to Increase Accuracy

What if we were to combine the services? Is there sufficient overlap between the services’ results such that if any two agree there is a increase in accuracy?

We did precisely this, the rules being that if two services agreed, we would take that as the overall sentiment. If none agreed, we would mark as ‘NEUTRAL’, given that could be considered a mid point of sentiment.

Combining the services increased the overall accuracy by 5 points, or in percentage terms on the original accuracy, 7.4%.

In 26 cases (5.3%), the results were undecided, meaning that all 3 services disagreed.

 

Conclusion

Sentiment analysis would seem to have come a long way, but is far from perfect. Using the APIs and a programming language such as R allow for better analysis of the results. We’d advise creating sample test datasets to see which of the services are the most accurate for different types of text.

Moreover, there would seem to be value using more than one service to gain higher accuracy. Increasing just one of the services’ accuracy (without knowing which) will increase the overall accuracy at a faster rate than using just one service.

And, importantly, the costs for each service are low enough to justify the extra certainty from using multiple cloud based services.