ACRE v1.3 User’s Guide


1      What is ACRE?

The Auto-Categorization and Retrieval Engine (ACRE) v1.3 is a text analytics platform that reads a set of text inputs and allows a user to:

·         Automatically Categorize (tag or label) text using both rules and machine learning

·         Monitor text for keywords and patterns that require immediate action

·         Cluster or Group text items that have similar content

·         Filter text to eliminate unrelated items or spam

·         Gain insights into customer concerns, preferences, sentiment, and other issues within surveys and social media.

·         Enhance business intelligence with actionable results that can be used to drive other processes

·         Summarize text and topics using label counts and word frequency tables

·         Visualize text and topics using word clouds

ACRE utilizes a variety of artificial intelligence and text analytics techniques - including natural language processing (NLP), pattern matching and machine learning - to make its labeling decisions. 


1     What is ACRE?. 1

1.1      ACRE Label Models. 2

1.2      ACRE FileTypes. 3

1.3      Command Overview.. 3

1.4      Label Model Types. 4

2     An Example – Motel5 Survey. 4

2.1      Defining ACRE Label Models. 5

2.2      Executing 4 ACRE Label Models on Motel5 Data. 5

2.3      The Results. 6

2.3.1       The Label Table. 6

2.3.2       Summary Tables. 7

2.3.3       Word Clouds. 7

2.3.4       Term Count Tables. 7

3     ACRE Model File Contents. 7

3.1      Pattern Model Contents. 7

3.2      Category Model Contents. 8

3.2.1       The Category Tree. 8

3.2.2       Rules. 9

3.2.3       Trained Word Clouds. 9

3.2.4       Term Weights. 9

4     ACRE Model Execution. 9

4.1      Pattern Model Execution. 9

4.2      Category Model Execution. 10

4.2.1       Rules-Only Evaluation. 10

4.2.2       Machine-Learning-Only (ML-Only) Evaluation. 10

4.2.3       Dual-Mode Evaluation. 11

4.3      Execution Results. 12

4.3.1       The report command. 12

5     Creating and Modifying ACRE Models. 13

5.1      Creating a Pattern Model 13

5.2      Creating a Category Model 13

5.2.1       CT Definition File format 13

5.3      Setting, Viewing and Updating Rules. 14

5.3.1       Rules Definition File Formats. 15

5.4      Viewing and Updating Term Weights. 15

5.5      Training ACRE Models. 16

5.6      Tuning ACRE Models. 17

5.7      Saving ACRE Models. 17

6     Analysis Scenarios. 17

6.1      Pattern/Rules-Only. 18

6.2      Rules with ML Fill-in. 18

6.3      Classic Machine Learning. 18

6.4      Machine Learning with Seeding Rules. 19


1.1         ACRE Label Models

The tasks above can all be accomplished by executing ACRE Label Models (ALMs), which perform categorization on text documents, which can be text files, xml files, or cells in a spreadsheet.   The categorization process selects one or more label values for each document.  

Executing an ACRE Label Model on a set of documents provides the following outputs:

·         A Label Table (.csv), containing the label value(s) selected for each document

·         A Label Summary (.csv), summarizing total label counts for the document set.

·         Term Counts (.csv) and a Term Cloud (.jpg) showing counts and relative frequencies of terms (words or stems) for the document set.

·         For each label value, Term Counts (.csv) and a Term Cloud (.jpg) showing counts and relative frequencies of all terms (words or stems) in documents assigned that value.

Any number of ALM models can be executed on each document set, providing a virtually unlimited range of analytic possibilities.   Each model execution provides additional metadata that can add value to downstream data analytics processes.

Return to top

1.2         ACRE FileTypes

In addition to common file types, such as text (.txt), XML (.xml), comma-separated values (.csv) and JPEG (.jpg), used for user inputs and results, ACRE generates two proprietary file types:

·         ACRE Label Model file (.alm) – stores all parameters for one ALM Model. 

·         ACRE Job Output file (.ajo) – stores all results from a ALM job execution on a particular document set.

These two file types utilize proprietary data formats and should only be viewed or modified using ACRE commands.

On startup, ACRE also reads an ACRE Options File (.txt), which provides values for an extensive list of user-configurable execution parameters.  All possible options parameters are listed in the ACRE Options Reference.  The default name for the Options File is acre_options.txt.  This name can be changed using the “-i" option on any command.

Return to top

1.3         Command Overview

ACRE v1.3 is a Python software package executed from the command line.  ACRE commands are accessed through Python module, so all commands begin with python  followed by one of 7 possible command words summarized below.  Complete command descriptions are available in the ACRE Command Reference.

1.      python execute  <model>  [ <options> ]

Executes ALM <model>.alm on a set of documents.   A label table (.csv), a log file (.txt) and AJO results (.ajo) are generated.

2.      python report  [ <options> ]

From an AJO file, generates Label Table (.csv), Label Summary (.csv), Term Counts (.csv) or Term Cloud (.jpg) results.

3.      python create   <model>  [ <options> ]

Creates a new ALM model, stores into <model>.alm.

4.      python modify  <model>  [ <options> ]

Modifies existing <model>.alm file. 

5.      python view   <model>   [ <options> ]

Generates CSV file containing the Rules, Pattern or Term Weights stored in <model>.alm.

6.      python train   <model>   [ <options> ]

Trains the ALM model in <model>.alm for machine learning analysis. 

7.      python documents [ <… options …> ]

Generates AJO file containing Term Counts (.csv) and Word Cloud (.jpg) for a set of documents without using any ACRE model. 

ACRE can be installed and run on any system (Windows, Mac, Linux, etc.) that supports Python 2.7.

Return to top

1.4         Label Model Types

There are 2 types of ACRE Label Models, each of which determines a label value for each document.  The model types are:

1)      Pattern Extraction Models

A Pattern Extraction Model (or Pattern Model) selects label values from the text of the documents themselves.  Label values are set to pattern matches within each document using regular expressions.    

2)      Category Label Model

A Category Label Model (or Category Model) selects label values from a predetermined hierarchical set of values called a Category Tree (CT).  Label values are assigned from nodes of this tree.  There are no limits to the number of levels or nodes in a CT.

Figure 1: Category Tree for Sentiment Model

Figure 1 shows the category tree for a Sentiment label model.  When executed, this model will assign a label value of Positive, Negative or Neutral to each document.

Return to top

2      An Example – Motel5 Survey

Let’s demonstrate ACRE functions and outputs with an example.  Assume that Motel5 is a hotel/motel chain that uses a web application where guests can submit unstructured text comments about their stay.  Motel5 has collected 452 guest comments in file hotel_survey.csv – shown below.   The Motel5 management would like to gain actionable insights from this customer data without requiring staff to manually scan and categorize each comment.  

Figure 2: Original Motel5 customer survey data

Return to top

2.1         Defining ACRE Label Models

After initial data analysis, Motel5 management chooses 4 useful labels for the survey comments:

1.      A Topic label.  From past experience, the management knows that typical survey responses focus on one of the following topics: the hotel room, the bath/shower, hotel service quality, hotel location or some other topic. They create a Category Model with label values {Room, Bath, Service, Location, Other}. 

2.      A Sentiment labelThey create a Category Model with label values {Positive, Negative, Neutral}. 

3.      To gain insights into how customers are describing their rooms, they would like to capture the word before “room” for all comments containing that word.  They create a Pattern Model.

4.      They would like to capture “alarm words” in the comments.  These are words indicating a requirement for rapid response, such as “cancel”, “sue”, “fight”, and others. They create a Pattern Model.

These models are created and stored into 4 ALM files, using the ACRE create, modify and train commands as described below.

Return to top

2.2         Executing 4 ACRE Label Models on Motel5 Data

Figure 3 shows the Command Window (left) and Options File (right) used in the execution of these 4 models.   The user first verifies that the 4 ALM model files are in the current folder with dir *.alm.   The user then runs one execute command for each model.   Note that all ACRE keywords (such as “execute”) can be abbreviated to any unique prefix (“e”).   The “-a” option flag indicates that results should be appended to the input data file, which is hotel_survey.csv.

Figure 3: Executing the 4 Label Models

The Options File shows some of the user-configurable options variables used by ACRE commands.  These specify the input file name, the columns where the text and document identifies are found, a Drop List file that contains words that will be ignored during analysis, and that no stemming will be done.   Note that each “#” character starts a comment that extends to the end of the line and is ignored by the options parser.

Return to top

2.3         The Results

2.3.1      The Label Table

Figure 4: Label Table Results

Figure 4 shows the Label Table resulting from the execution of the 4 models on the Motel5 survey data.  Four new columns have been appended to the survey data, each containing the results from executing one ACRE Label Model.  Each new column contains the model name at the top and the assigned label values for each of the 452 survey comments.

Some survey comments are assigned multiple label values (such as “Room & Location”), which means that the Topic model was run in “multi-value mode” (specified by adding “MULTI_VALUE = YES” to the Options File), which allows multiple label value assignments to each document.   If the user had specified “single-value mode” (by adding “MULTI_VALUE = NO” in the Options File), this would force the ACRE software to select only the single best label value for each document.

2.3.2      Summary Tables

Figure 5: Summary Tables for Motel5 Model execution results

Figure 5 shows the Summary Tables for 3 of the ACRE Label Models executed on the Motel5 data.  Totals for the Topic model (611) are higher than the number of comments (452) due to some multi-value assignments.  Each of these tables is generated by the command

python report <model> -r SUMMARY

where <model> is the model name.

If the REPORT_TYPE variables is included in the Options File:


then the command would just be python report <model>


If the same option is specified on both the command line and the Options file, then the command line option will be executed.

2.3.3      Word Clouds

A word cloud (.jpg) containing all terms (words or stems) in all documents assigned to any label value in a model can be generated using the command:

python report <model> -r CLOUD –n <label value>

2.3.4      Term Count Tables

A term count table (.csv) containing all terms (words or stems) in all documents assigned to any label value in a model can be generated using the command:

python report <model> -r TERMS –n <label value>

Return to top

3      ACRE Model File Contents

For each Label Model type, the model file contents (.alm file) are described here.

3.1         Pattern Model Contents

A Pattern Model ALM file stores a regular expression string, called the Pattern RE.  Within the Pattern RE, the first subexpression within parentheses is the Capture RE, which will become the label value. 

Figure 6: Contents of a Patterm Model file

Figure 6 shows an example of a Pattern RE that captures the word occurring before “room”.   Further details about ACRE regular expression implementation can be found in the ACRE Command Reference.

Return to top

3.2         Category Model Contents

The contents of a Category Model file are represented in this figure and described below.

Figure 7: Contents of a Category ALM File

3.2.1      The Category Tree

The Category Tree (CT) stores the possible label values.   CTs can have more than one level, as shown in Figure 8, and there are no limits to breadth or height.   CT structure and values may be chosen ad-hoc by a user to satisfy a particular analytic objective, or they may be copied from pre-existing organizational schema, such as organizational charts or directory trees.

Figure 8: A Multi-Level Category Tree

Reports can be generated showing information at any node in the tree.   Interior node reports provide aggregated totals and word clouds for all nodes within the subtree below the selected node.   Any node can also be designated as the Default label value.

Return to top

3.2.2      Rules

Each node in the CT can have any number of associated Rules stored in the ALM file.  Each Rule consists of (1) a Rule pattern (a keyword, phrase or other regular expression) and (2) a Rule Weight, which is set to 1 by default.

3.2.3      Trained Word Clouds

If this ALM has been trained via the acre train command, then each leaf node will also have an associated Trained Word Cloud (TWC).   The contents of the TWC are set and modified by the user with the acre train commands, described below.

3.2.4      Term Weights

Each term (word or stem) used in Machine Learning analysis has an associated numerical Term Weight, set to 1.0 by default.

Return to top

4      ACRE Model Execution

Here we describe how the model contents are used to execute each type of ACRE model on a set of documents.

4.1         Pattern Model Execution

When a Pattern Model is executed, it attempts to match the Pattern RE on each document.   When a match occurs, the Capture RE portion of the matched text is extracted as the label value for that document. 

For example, using the Pattern RE in Figure 6, if the text contains “spare room”, then this would match the Pattern RE and trigger a pattern extraction.  The portion that matches the Capture RE – “spare” – is extracted and stored as the label value for this text.

As a Pattern Model executes, it dynamically generates a Category Tree containing all matches found.  This Category Tree can be saved into a CT Definition file by executing the report command with report_type of LABEL_TREE, which can then be used as input for other models.

Return to top

4.2         Category Model Execution

When a Category Model is executed on a set of documents, the evaluation method is determined by the EVAL_TYPE options variable.

4.2.1      Rules-Only Evaluation


Each document is checked against all Rule patterns.  If MULTI_VALUE = YES, the document is assigned label values for all Rule matches.   Multiple labels are listed separated by “&” in Label Table reports.

If MULTI_VALUE = NO, the single best Rule match is determined by adding Rule Weights of all matched Rules for each label value.   The label value with the highest total weight sum is selected as the single label value for the document.

If no Rules match, then the Default Label Value is selected.  If no Default value has been specified, then the assigned value is “None”.

Trained Word Clouds are not used in Rules-Only evaluation.

4.2.2      Machine-Learning-Only (ML-Only) Evaluation 


Each document is categorized using Nearest Word Cloud classification, which assigns each document to the label value whose Trained Word Cloud (TWC) is most similar to the word cloud of the document.  

For each document, a Confidence value is calculated for each TWC, with higher Confidence values corresponding to greater similarity between the TWC and the word cloud of the document. 

The THRESHOLD option variable stores the minimum Confidence value that will be considered a usable result.  If no Confidence value is above THRESHOLD, then the Default label value (or “None”) is selected. 

If MULTI_VALUE=YES, then all label values with Confidence greater than THRESHOLD are selected.  If MULTI_VALUE=NO, then the single label value with the highest Confidence value greater than THRESHOLD is selected. 


Figure 9: Label Selection by Nearest Word Cloud classification

Figure 9 shows an example of the use of Nearest Word Cloud classification to select a Sentiment label value for a tweet.  The tweet’s word cloud is compared against Trained Word Clouds for Positive and Negative label values.  The tweet will be assigned to the label value with the greatest Confidence value that exceeds THRESHOLD (assuming MULTI_VALUE=NO).  If no Confidence value exceeds THRESHOLD, then the document will be assigned the Default value of Neutral.

ACRE v1.3 offers three algorithms for calculating Confidence (word cloud similarity), based on the SIMILARITY option variable:

1)      SIMILARITY = 1:  Cosine Similarity (default)

2)      SIMILARITY = 2:  Minimum Euclidean Distance

3)      SIMILARITY = 3:  Sum of Absolute Errors

Term Weights can be modified by users to boost (weight > 1.0) or diminish (weight < 1.0) the influence of specific terms in the calculation of Confidence values.  Setting a Term Weight to 0 will eliminate a term from all Confidence calculations. 

There are also additional machine learning tuning parameters available, which are covered in the Options Reference.

Rules are not used in ML-Only evaluation.

4.2.3      Dual-Mode Evaluation


In this mode, both the Rules and the Trained Word Clouds are used in label value selection.

For each document processed:

(1)   If at least one Rule matches, then Rules-Only evaluation is done

(2)   If no Rules match, then ML-Only evaluation is done

In this mode, documents with rule matches are labeled identically to Rules-Only evaluation, while the remaining documents (which would be marked with the Default value in Rules-Only evaluation) are categorized via ML-Only evaluation.

Return to top

4.3         Execution Results

Running the execute command generates three outputs: Label Table output, an AJO file, and a LOG file.

·         Label Table output

If the “-a” (append) option is specified, then a single column is added to the input file (specified by “-d” option or DOCUMENTS option variable) and no output file is created. 

If “-a” is not specified, then execute generates an output file (.csv) containing a list of input documents and label value results.  The output file name can be set using the “–o” command-line option or the OUTPUT_FILE options variable. 

·         AJO file

An ACRE Job Output (.ajo) file is created, which contains all execution results, including label table, summary table, and all term frequencies for each node in the CT tree.  These results can be extracted from the AJO file using the report command, as described below.  By default, the report command uses the last AJO file generated.  Other AJO files can be specified with the “-j” option or the AJO_FILE options variable.

·         LOG file

The LOG file is a text file that lists execution date and time as well as all important options values used.  The file has one line for each document processed, listing the document name, number of terms and results.  The name of the LOG file is set with the “-l” option or the LOG_FILE options variable.

4.3.1      The report command

The AJO file contains all execution results, including label table, summary table, and all term frequencies for each node in the CT tree.  Any desired performance results can be extracted from the AJO file into CSV and JPG files using the report command, which allows users to derive multiple results from a single ACRE model execution.

python report <model> -r <report_type> [ –n <node> –j <AJO_file> -o <output_file> ]


·         <model> is an existing ACRE model

·         <report_type> is one of the following, which can be abbreviated to first letter and/or specified with options variable REPORT_TYPE:

o   DOCUMENTS  - Creates <output_file>.csv containing Label Table results -  columns are “File Name”, “DID”, <model>.   Additional columns will be present if INCLUDE_TEXT and/or ADD_COLS are defined.

o   SUMMARY - Creates <output_file>.csv containing summary counts for each label value  -  columns are “Label”, “Count”, “Percent”.        

o   TERMS - Reports on term frequencies for all documents assigned to or below CT node <label value>.  Creates <output_file>.csv with columns Term, Frequency.

o   CLOUD - Creates <output_file>.jpg with JPEG image of the word cloud consisting of all terms for all documents assigned to or below CT node <label value>.

o   LABEL_TREE - Creates <output_file>.csv in CT Definition format, containing all unique label values processed.   This CT Definition file can then be used to create other label models.

·         <node> is the name of a node in <model> CT tree. Can also be specified with options variable REPORT_NODE.  If unspecified then the root of the tree is used.

·         <AJO_file> is the AJO file name. Can also be specified with options variable AJO_FILE.  If unspecified then the last AJO file created is used.

·         <output_file> is the output file name. Can also be specified with options variable OUTPUT_FILE.  If unspecified then a dynamically created file name is used.


Return to top

5      Creating and Modifying ACRE Models

ACRE v1.3 provides the ability to create models and manage their contents with the create, modify, view and train commands.

5.1         Creating a Pattern Model

Pattern Models are created using the “-p <Pattern RE>” option on the create command.  For example:

python create before_room –p “([A-Za-z]+) room”

This command creates a new Pattern Model with Pattern RE = “([A-Za-z]+) room” and stores the model into a file named before_room.alm.

Return to top

5.2         Creating a Category Model

Category Models are created using the “-c <CT definition file>” option on the create command.  For example:

python create topic –c topic-ct.csv

This command creates a new Category Model using the Category Tree structure defined in file topic-ct.csv, and stores the resulting model in topic.alm.   File topic-ct.csv must follow the CT Definition File format described below.  A Category Tree stored in a model file cannot be modified after the model is created.

5.2.1      CT Definition File format

A CT Definition File is a CSV file with 4 columns: Node, Default, Description and Parent.

Each line after the header defines one category tree node.  Node names can contain only letters, numbers and underscore [A-Za-z_].  One leaf node can be marked “Y” in the Default column to define the Default label value.  For nodes below the first tree level, the Parent column value is set to specify the name of the node above them.

Figure 10: CT Definition Files

Figure 10 shows two CT Definition files, with the corresponding Category Tree shown above each.

Return to top

5.3         Setting, Viewing and Updating Rules

Rule values are set in an ACRE Model file using the “-r” option with the modify command.  For example:

python modify topic –r topic-rules.csv

This modify command will cause any previously existing rules to be erased and replaced with the rules defined in topic-rules.csv, where this file must follow one of the Rules Definition File formats described in section 5.3.1 below.  The Rules Definition file must contain all rules for the model.

The rules stored in a model can be viewed using the “-v RULES” option (or setting VIEW_TYPE = RULES) with the view command.  For example:

python view topic –v RULES –o my-rules.csv

This will write all Rules stored in model file topic.alm into the file my-rules.csv using the Row-based Rule Definition format defined below.

Existing rules can be easily edited by (a) extracting them using a view command, (b) editing the resulting Rules Definition file, and (c) reloading the revised rules using a modify command.  For example:

a)      python view topic –v RULES –o my-rules.csv 

b)      Edit my_rules.csv (using Excel, for example)

c)      python modify topic –r my_rules.csv

This allows incremental Rule changes and/or modifications to Rules Weights to be easily accomplished.

5.3.1      Rules Definition File Formats

ACRE v1.3 accepts two formats for Rules Definition files.  Either of these formats can be used to create Rules Definition files that will define or update Rules stored in an ACRE Model file.

The simplest format is the Column-Based Rules Definition format shown on the left in Figure 11.  With this format, Rule Patterns for each label value are arranged in a single column with the label value name at the top.  The advantage of this format is that it makes it easy to enter lists of rule patterns.  The disadvantage of this format is that Rule Weights are not specified and are assumed to all be equal to 1.0.

The other format is the Row-Based Rules Definition format shown on the right in Figure 11.  With this format, the first line must be the column headers “Label Value”, “Rule Pattern” and “Weight”.  Then, on each row, the label value, rule pattern and rule weight for one rule is specified.   This format allows the user to enter rule weights, but requires a bit more typing.

Figure 11: Rules Definition Files

Return to top

5.4         Viewing and Updating Term Weights

Term Weights are numerical values associated with each unique term (word or stem) in any Trained Word Clouds in a Category Model.  

Term Weights stored in a model can be viewed using the “-v WEIGHTS” option (or setting VIEW_TYPE = WEIGHTS) on the view command.  For example:

python view topic –v WEIGHTS –o my-term-weights.csv

This will write file my-term-weight.csv with columns “Term” and “Weight” containing all unique terms across all Trained Word Clouds in the topic ALM model file and their corresponding weights.  

Existing Term Weights can be easily edited by (a) extracting them using a view command, (b) editing the resulting Term Weights CSV file, and (c) reloading the revised Term Weights using a modify command.  For example:

a)      python view topic –v WEIGHTS –o my-weights.csv

b)      Edit my-weights.csv (using Excel, for example)

c)      python modify topic –w my-weights.csv

This allows changes to Term Weight to be easily accomplished.

Note that you must populate the Trained Word Clouds using one or more train commands before you will be able to view any Term Weights.

5.5         Training ACRE Models

ACRE v1.3 uses the Nearest Word Cloud classifier for Machine Learning analysis.  This is a supervised classifier, so it requires the user to initialize and update the Trained Word Clouds values via training.  The train command is used to accomplish this.  This command takes a set of training documents and a reference label value for each training document, as input.  The result of executing the train command is that the term frequencies for each training document are averaged into the Trained Word Cloud of the corresponding reference label value.   Each training document is called an exemplar for its corresponding reference label value.

ACRE allows the reference labels to be specified in several ways, determined by the value of the REF_LABEL options variable (defined in Options File):

·         REF_LABEL = CSV <col>

This option is valid only if DOC_FORMAT = CSV and specifies that the reference label values can be found in the input CSV file column <col>, where <col> is the Excel column identifier (A, B, C, etc.) for the reference label column.   For example: 

REF_LABEL = CSV D                - specifies that the reference label values are found in the 4th column (column “D”) of the input CSV file.

·         REF_LABEL = XML

This option is valid only if DOC_FORMAT = XML and specifies that the reference label values are marked with a specific XML tag within each input XML document.  The default ACRE v1.3 implementation follows the RCV1 XML format, which specifies this XML tag as <category:topics:2.0>.

·         REF_LABEL = LABEL <label value>

This option specifies that the reference label for all input documents is <label value>.  For example:

 REF_LABEL = LABEL Location          - specifies that the reference label for all inputs is “Location” 

The <label value> must be the name of a leaf node within the model CT.

·         REF_LABEL = RULES

This option specifies that the reference labels for all inputs are determined by Rule matches.   As the training documents are read, the model Rules are checked against each.  For each Rule match, the label value associated with that Rule is trained using the document.  This can be used to implement powerful combined-model analysis scenarios, such as “Rules with ML Fill-in” and “Machine Learning with Seeding Rules” described in section 6 below.

If the train command is run multiple times, then ACRE will cumulatively average each new set of exemplars into the Trained Word Clouds.  If the user want to start fresh (forget previous training), then the “-z” option specifies that all TWCs should be cleared (zero-ed out) before this training is done.

Return to top

5.6         Tuning ACRE Models

Once an ACRE model has been created and executed, it may be tuned – that is, iteratively modified based on previous results – to improve its performance.   Since ACRE tools make it so easy to modify and re-execute models, the tuning process allows initial prototype models to quickly be improved with simple modifications.  Any of the following parameters can be modified as described below to tune a model:

(additional content will be filled in)

·         Modifying the Drop List  - 

·         Setting the Threshold

·         Choosing Rule Patterns and Weights

·         Modifying Term Weights

·         Modifying other Machine Learning Parameters

Return to top

5.7         Saving ACRE Models

Once an ACRE Model has been created, tested and tuned, the user may wish to save the resulting model.  A saved ACRE model can be executed on any ACRE system at any future time and will provide identical results. 

Saving an ACRE Model requires saving a copy of three files: the ALM model file (.alm), the ACRE Options file (acre_options.txt, for example) and the Drop List file (english-stop.txt, for example).   If copies of these three files are moved to any other system where ACRE is installed, they will produce identical results when the model is executed.

Return to top

6      Analysis Scenarios

There are many strategies that can be used in the analysis of text data using ACRE.   In practice, we have found each of the following four Analysis Scenarios to be highly effective for specific text modeling cases:

6.1         Pattern/Rules-Only


There are thousands of useful text analytics models that can be constructed using only Patterns and Rules.  The advantage to these models is that they operate deterministically – that is, results are fully predictable, can be traced and can be audited.  In contrast, machine learning model outputs will vary depending on the set of training documents that are used.

This means that Pattern/Rules-Only models are often the best choice in situations where results must be highly predictable and/or specific regulatory and/or compliance requirements must be met, and in situations where users want to be able to easily trace back any result to the specific rule that caused the result.

Return to top

6.2         Rules with ML Fill-in


In some cases, Rules-Only analysis may assign a large number of Default label values because many of the inputs may not match any rules.  In these situations it may be quite difficult to specify enough rules to ensure that a significant portion of the inputs will be covered.

ACRE resolves this problem with the BOTH evaluation type, which assigns all rule matches exactly as in Rules-Only analysis, but then uses Nearest Word Cloud machine learning to fill in label values for documents that do not match any rules.   This can assign label values to a much larger percentage of inputs (depending on THRESHOLD) and places documents that do not match rules into the label value whose other document word content is most similar to theirs.

This analysis is achieved via a 2-step process:

1.      Train the ACRE Model using input documents and REF_LABEL = RULES.  (train command)

2.      Execute the ACRE Model on same documents with EVAL_TYPE = BOTH  (execute command)

Return to top

6.3         Classic Machine Learning


In Classic Machine Learning analysis, an expert scans through one set of documents to select exemplars for each label value.  These exemplars are then used to train the ML model.  Then new data can be categorized using this trained model.   ACRE supports this analysis mode with several flexible means to specify the exemplars:

§  The expert can scan through a CSV of inputs, marking the correct (reference) label value for each input in a reference column.  The resulting marked CSV can then be used with the train command using REF_LABEL = CSV.

§  The expert can scan through many inputs, saving exemplars for each label value into separate files or folders.   Each file or folder can then be used as input to the train command with REF_LABEL = LABEL <x>, where <x> is the label value for these exemplars.

This analysis is achieved via a 2-step process:

1.      Train the ACRE Model using training documents and REF_LABEL = CSV or LABEL.  (train command)

2.      Execute the ACRE Model on analysis documents with EVAL_TYPE = ML_ONLY  (execute command)

Return to top

6.4         Machine Learning with Seeding Rules


In this analysis mode, which is unique to ACRE, the modeler selects Rule Patterns with the sole objective of using the Rules to select the ML exemplars.   The Rule Pattern matches are used to “seed” the ML analysis by capturing a training set for each label value quickly with little effort.  While in a Rules-only analysis the modeler hopes that a large percentage of the inputs will be Rule matches, in contrast, when constructing Seeding Rules, the modeler needs only enough Rule patterns to capture typical exemplars for each label value.  

This analysis is achieved via a 2-step process:

1.      Train the ACRE Model using input documents and REF_LABEL = RULES.  (train command)

2.      Execute the ACRE Model on same input documents with EVAL_TYPE = ML_ONLY  (execute command)

Return to top