ACRE Machine Learning

 

ACRE ML models use the Nearest Word Cloud machine learning algorithm, which assigns each document to the label value whose Trained Word Cloud (TWC) is most similar to the word cloud of the document.  Higher Confidence values correspond to greater similarity between documents. 

In the example below, the New Document Word Cloud is compared against the Trained Word Clouds for Positive and Negative.  The highest confidence match will be assigned - though if no confidence value exceeds a Threshold, then the document will be assigned the Default value of Neutral.

 ML diagram

The Train command creates the Trained Word Cloud values for each label. 

How do you train a model?  There are 3 ways:

(1)   User provides unlabeled sample documents for one label value at a time

a.      Example: user loads a set of 'Positive' document examples to train a Sentiment ML model.  Next the user loads a set of "Negative" documents.  And so on, through each label value.

(2)   User provides a pre-labeled document set

a.      Example: user provides a spreadsheet where every row contains text to be analyzed in one column and the correct label for that text in another.

(3)   ML model is trained using outcomes from another label model

a.      Example: user has executed a Rules model for Sentiment that labeled 50% of the documents.  The user then enables ML Extension, which trains a new ML model from the Rules results and uses it to label the remaining documents.

After an ML model is trained, it can be executed on a set of document term vectors.  The trained model can be saved, exported and deployed for execution anywhere. 

The Nearest Word Cloud (NWC) algorithm was chosen as the ACRE default based on the following properties:

a)      Intuitive and Transparent

Trained Word Clouds and Document Word Clouds are easily viewed at any time during the analysis, which provides a visual method for modelers to understand why each particular label was assigned.   This visual verification, in combination with word frequency table comparisons, can guide the modeler in determining whether additional rules or training are needed to improve results.

b)      Scalable

Both model training and execution tasks are scalable using cloud computing services.   Since Nearest Word Cloud is a linear-time algorithm (O(n)), execution time will only grow linearly as the number of input documents increases.  Training and execution jobs can be decomposed in a straightforward manner for multi-processor scheduling.

c)      Accurate

Previous academic studies (example) have shown that Nearest Word Cloud provides higher accuracy than other linear-time text classification algorithms, such as Bayesian analysis, k-nearest-neighbors and C4.5.

To further test ACRE algorithm accuracy, we have executed it on the RCV1 corpus, in which Reuters editors manually categorized hundreds of thousands of news articles using a category tree with 55 second-level labels.  After training on 1000 articles, an ACRE ML model was executed on 500 new articles, selecting the single best label for each new article.   The ACRE label selection was correct (matched a label selected by a Reuters editor) for more than 70% of the new articles without model tuning.

A critical factor in ML model accuracy is the choice of the vocabulary of analysis, that is, the set of words and phrases that are used in calculating Confidence values.   Users can significantly reduce noise in the ML calculations and increase accuracy by narrowing this set down to only those words and phrases that should be significant in document comparisons.

ACRE offers useful NLP tools for managing this vocabulary of analysis.  Users can also fine-tune ML model performance by changing Term Weights to either boost or diminish the influence of particular terms (words or bi-grams) in the calculation of Confidence values.