Executing a label model on a set of documents causes ACRE to assign one or more label values from that model's Category Tree (CT) to each document. Typically, the associated Category Tree label values must be defined before the model is executed. In contrast, Pattern Models create their CT dynamically as they execute.
Each Label Model execution creates metadata (assigned label values) that provides information to the user about the documents. This metadata is presented differently depending on specific ACRE product:
For ACRE v1.3, each label model execution generates an AJO (ACRE Job Output) file from which a user can extract the following:
o A Label Table listing every document and its assigned label(s).
o A Summary Table showing total assigned counts for all label values in the CT.
o A Term Frequency Table and associated Word Cloud for all documents, and for every label node in the CT.
For other products, each label model execution stores metadata and modifies the following user displays:
o A column, titled with the Label Model name, is added to the Data View. This columns contains the assigned label value for each document in the View. The user can control whether this column is immediately visible or kept hidden.
o A new item, titled with the Label Model name, is added to the Model Explorer model list. Clicking down from this item reveals all nodes in the Category Tree associated with this model, including total counts for each label value and clickable icons to get Term Frequency Tables () , Word Clouds ( ), and Actions ( ) for each label value.
For example, the red boxes in the screenshot below show the new elements added when the Sentiment model is executed.
How does a label model select a label value? There are several model types available, each of which can be configured in countless variations:
1. Binning Models use numeric data associated with a document to map the document onto a label value defined by a number range.
Example use case: File Size Binning: Each document is labeled as Small if its Size < 100 Kbytes, Medium for Size between 100 Kbytes and 1 Mbyte, and Large for Size > 1 Mbyte.
2. Pattern Models extract label values from the text data by matching regular expression patterns within the text. Regular expressions provide extended search capabilities that go well beyond what is offered by standard search queries.
Example use case: Data Loss Prevention: Company A executes ACRE label models that detect all social security numbers, credit card numbers, project numbers, and other protected patterns in a document set. Tagged documents can then be quarantined or forwarded for further inspection.
3. Rule Models assign label values from a Category Tree based on a set of Rules (keywords and regular expression patterns) associated with each label value. When a rule pattern is matched, the corresponding label is assigned. If multiple rules are matched, then either all matched labels can be saved (if Multi_Values is true) or the single best label can be selected (if Multi_Values is false), based on a scoring system that chooses the highest summed weight over all matching rules.
Example use case: Drug Name Tagging: A health services IT organization uses ACRE to tag each incoming document with a standardized name for the drug discussed in the document. By defining a label for each standard drug name and then creating Rules with lists of alternate drug names, abbreviations, and foreign language equivalents, the model will add a label with the correct standard drug name to each document.
4. Machine Learning Models assign label values from a Category Tree based on the similarity of the document word cloud to a trained word cloud associated with each label value. Users create and modify the trained word clouds by training the model with example documents for each label value.
Example use case: Sentiment Emulation: A marketing organization has a spreadsheet with a thousand Twitter tweets that have previously been tagged for Sentiment by a social media aggregator. They also have thousands of other tweets that are untagged. They upload the tagged document to the ACRE Service, select the Sentiment column and click 'Train'. This creates a machine learning model (with trained word clouds corresponding to 'Positive' and 'Negative') that can now be used to tag all other tweets for Sentiment.
5. Combined Models assign label values from a Category Tree using both rules and machine learning, providing significant modeling and performance advantages over using either method alone. Vertical Data, LLC, believes that these combined model designs are unique and has filed USPTO patent application #14676500 covering these categorization methods. Two ACRE Combined Models are:
a. ML Extension models extend rule-based label assignments to documents that do not match any rules, by matching new document word clouds to trained word clouds corresponding to the rule matches. This can solve the problem of diminishing returns often seen in rules-only decision systems as additional rules are added.
Example use case: Language Labeling: A political organization has collected 4000 survey responses, with some in English, French and Spanish. In order to split the responses by language, they defined a model with 3 labels (English, French, Spanish) and then chose 5 common words from each language and created Rules with these. Executing this as a Rules model labeled only about 30% of the responses (that is, the responses that actually contained at least one of the 15 words). Enabling ML Extension correctly labeled 99.5% of the responses.
b. Rule-Seeded Machine Learning models are trained directly from rule matches. These trained models are then executed in ML mode on new documents. This can be used to quickly prototype machine learning models, and can also rapidly split documents into groups that are defined based on the occurrence of rule keywords and the other words that typically co-occur with these keywords. This eliminates the need to manually select training documents.
Example use case: Request Clustering by Keyword: A customer service organization wants to divide helpdesk requests into service problems and hardware problems. They define a model with 2 labels (Service and Hardware) and then create rules with a few common keywords for each request type. Executing this model in Rule-Seeded Machine Learning mode now routes incoming requests automatically to one of these different service groups for processing.
Users are given a rich set of advanced parameters that can be modified to fine-tune model performance. All options have default values that work well in most cases.