Sentiment Analysis in the Cloud with Microsoft Azure


Could we improve the accuracy of sentiment analysis by adding Microsoft Azure to our aggregation model?

We wrote an article earlier this year comparing 3 popular cloud sentiment analysis services, namely Google Cloud Natural Language, AWS Comprehend, & IBM Watson to see how accurate these services were against a pre-classified set of tweets. We also looked to combine those services by using a majority decision between the services.

We had an inquiry from someone we know at Microsoft to see how Azure Text Analytics compared with the other four services.  There's a nice reference here to how to use the httr package for the API calls to Azure.

We obliged and indeed Azure Text Analytics performed to an accuracy of 72% on the same data set as before.  This is the highest score of the four services — although not quite as high as the 73% achieved by combining Google Cloud Natural Language, AWS Comprehend, & IBM Watson.

But we had a new challenge when adding Azure to the mix: if we now had 4 services we couldn't easily combine the scores to derive a simple majority given there could be 2-2 ties. How best to pick a result from mixed information from the four services?

After trying a few methods, we decided to apply a machine learning package on the combined results of the four sentiment analysis services. The results of the initial sentiment analysis queries looked like this:

 RStudio screen capture

RStudio screen capture

The R Caret package provides a consistent wrapper to many different machine learning packages which allowed us to try a few different ones and compare the results.  For those new to machine learning, the model needs to 'trained' with a sample of the data before it can let loose on the rest. For most applications of machine learning a majority of the data set would be used to train - upwards of 70-80%. Given we had a different scenario here and wished to measure the accuracy over as much of the data as possible, we used only 110 of the 491 tweets as the training set. Beyond 110 we saw very little improvement in the accuracy.

The final code was pretty simple to implement as shown below.  We first broke up the main data frame, sub_combined, into test and training sets and then used the train function against the training set to create a fit object. The predict function then uses the fit object with the remainder of data in the test data frame and returns the results.

In essence we're using machine learning to perform regression, but with categorical (text) variables.

# Set function
ml_fun <- function(x){
  comb_train <- sub_combined[1:110,]
  comb_test <- sub_combined[111:491,]
  comb_fit <- train(act_sent_text ~ ., 
                    data = comb_train, 
                    method = x)
  comb_pred <- predict(comb_fit, comb_test)
  confusionMatrix(comb_pred, comb_test$act_sent_text)

# Run function with random forest

The Ranger package - an implementation of Random Forest gave the best results from the ten or so packages we tried. The average accuracy from continually randomized training sets was 78%, beating all the previous methods.


In conclusion and as with our original article and method, combining the services can give a significant accuracy premium over any one particular service. The machine learning method in particular provided a particularly satisfactory boost over our previous tests.