ACRE v1.3 Options Reference


On startup, reads an Options File and sets values of the variables below.   Any variables not included in this file are set to default values.  

The Options File name can be specified with “–i <options-file-name>” on the command line.  Otherwise, the default name is acre_options.txt in the folder containing

For enumerated options variable types, only a unique prefix (usually only the first letter) of the type value is required in the options file.

Some options variables have corresponding command-line options.  If both are specified, the command-line option always takes precedence in determining the value.

A hashtag (#) character in Options File starts a comment that extends to the end of the line.

ACRE Options Variables

1. Input/Output Variables. 1

1.1         ACRE_FOLDER. 1

1.2         DOCUMENTS. 1

1.3         DID_ASSIGN.. 1

1.4         OUTPUT_FILE. 1

1.5         CT_FILE. 1

1.6         LOG_FILE. 1

1.7         GO_LIST and TERM_FILE. 1

1.8         TERM_REGEX. 1

1.9         STEM_MODE. 1

1.10      INCLUDE_TEXT. 1

1.11      INCLUDE_NOMATCH.. 1

1.12      Word Cloud Variables. 1

1.12.1        CLOUD_TERMS. 1

1.12.2        WC_WIDTH.. 1

1.12.3        WC_HEIGHT. 1

1.12.4        WC_BG.. 1

1.13      ADD_COLS. 1

2       Command Function Variables. 1

2.1         EVAL_TYPE. 1

2.2         REPORT_TYPE. 1

2.3         REPORT_NODE. 1

2.4         VIEW_TYPE. 1

3       Model Processing Variables. 1

3.1         MULTI_VALUE. 1

3.2         MATCH_CASE. 1

3.3         MAX_COUNT. 1

4       Machine Learning Variables. 1

4.1         REF_LABEL. 1

4.2         SIMILARITY. 1

4.3         THRESHOLD.. 1

4.4         WEIGHT_TYPE. 1

4.5         MAX_TERMS. 1


1. Input/Output Variables


ACRE_FOLDER is a folder name where ACRE input and output files are stored.  All output files are written to ACRE_FOLDER and all input files are searched first in ACRE_FOLDER and then in current folder.

Example:  ACRE_FOLDER = tests\survey2

Default: current folder 

Used by: create, documents, execute, report, train, view

Go to Top


DOCUMENTS specifies the location of the input documents.  For CSV inputs, this is the name of the CSV file.  For other input types, this is the name of the input folder.

Examples:  DOCUMENTS = data\my_folder

                  DOCUMENTS = data\survey_data.csv

Command Line Option:  -d <documents>

Used by: documents, execute, train

Go to Top


DOC_FORMAT specifies the format of the input documents.

·         DOC_FORMAT = TEXT                        – documents are text files (Default)

·         DOC_FORMAT = XML            – documents are XML/HTML files

·         DOC_FORMAT = CSV <col>   – documents are cells in one column of CSV file

o   <col> is the column letter (“A”, “B”, …, “AA”, etc) of text column.

o   Example:  DOC_FORMAT = CSV B

Used by: documents, execute

Go to Top

1.3     DID_ASSIGN

DID_ASSIGN determines how Document Identifier (DID) values are assigned.

·         DID_ASSIGN = SEQ                 – assign DID values sequentially, starting with 1. (Default)

·         DID_ASSIGN = XML                – DID determined by XML tag “<itemid>

·         DID_ASSIGN = CSV  <col>      - DID is in CSV column <col>

Used by: documents, execute

Go to Top


OUTPUT_FILE specifies the name of the output file for an execute or report command.  This will be a CSV file for execute and CSV or JPG for report.

Examples:  OUTPUT_FILE = results

Command Line Option:  -o <output_file>

Used by: execute, report

Go to Top

1.5     CT_FILE

CT_FILE specifies the name of a Category Tree Definition File. 

Examples:  CT_FILE = my_model_ct.csv

Command Line Option:  -c <CT_definition_file>

Used by: create

Go to Top

1.6     LOG_FILE

LOG_FILE specifies the name for log files generated by execute and train.   Sequence numbers will be appended to avoid overwriting previous log files.

Examples:  LOG_FILE = survey_logs

Command Line Option:  -l <log_file>

Used by: execute, train

Go to Top

1.7     GO_LIST and TERM_FILE

Together, these variables specify Go-List / Stop-List behavior.  TERM_FILE specifies the name of a text file containing one term per line.  GO_LIST is a Boolean variable with possible values {YES, NO}.

·         If GO_LIST = NO, then the TERM_FILE list is treated as a Drop List, meaning that these terms will not be included in term counts, word clouds and ML analysis.

·         If GO_LIST = YES, then the TERM_FILE list is treated as a Go List, meaning that only these terms will be included in term counts, word clouds and ML analysis.

Examples:  TERM_FILE = data\my-drop.txt

                  TERM_FILE = NONE

                  GO_LIST = YES

Default:  TERM_FILE = “english-stop.txt”

Default:  GO_LIST = NO

Used by: documents, execute, train

Go to Top

1.8      TERM_REGEX

TERM_REGEX specifies a regular expression used to defines the boundaries of a “term” when parsing the file.  The parser will define each “term” as a contiguous set of characters matching TERM_REGEX.  The characters that are not in TERM_REGEX will define separators that delimit terms

Example:  TERM_REGEX = “[A-Za-z]+”                - terms contain alphabetic characters only.

Default:  TERM_REGEX = “(?:\.?\w|[@#’-])+”    - terms contain alpha-numerics, underscore (\w) and special characters [.@#’-].

Note: Any grouping parentheses used within the TERM_REGEX expression must be of the form “(?: … )” – that is, the opening parenthesis is represented by 3 characters “(?:”.   Failure to do this will lead to parsing and data capture errors.

Used by: documents, execute, train

Go to Top

1.9     STEM_MODE

STEM_MODE determines what stemming, if any, will be done on input text.

·         STEM_MODE = NONE            - No stemming.   (Default)

·         STEM_MODE = SNOWBALL              - Stem with NLTK Snowball algorithm (Porter2)

·         STEM_MODE = WORDNET    - Uses WordNet lemma/stem form for each term

Used by: documents, execute, train

Go to Top


INCLUDE_TEXT determines whether the text of each document is saved in the ACRE Job Output (AJO) file and will be included in execute output file and report output for DOCUMENT reports.

·         INCLUDE_TEXT = YES             - Include the text. 

·         INCLUDE_TEXT = NO              - Do not include the text.  (Default)

Used by: execute, report

Go to Top


KEEP_NOMATCH determines whether documents that were not assigned to any label will be displayed in output reports.

·         KEEP_ NOMATCH = YES        - Include all documents in reports.  (Default)

·         DEEP_ NOMATCH = NO                     - Include only documents that were assigned a non-default label value.

Used by: execute, report

Go to Top

1.12  Word Cloud Variables


CLOUD_TERMS = <n> is an integer value specifying the maximum number of terms displayed in word cloud output.   Only the <n> most frequently occurring terms will be chosen for display.

Example:  CLOUD_TERMS = 35

Default value: CLOUD_TERMS = 50

Used by: view, report

1.12.2 WC_WIDTH

WC_WIDTH = <n> is an integer value specifying the width of word cloud output, in pixels.  

Example:  WC_WIDTH = 600

Default value: WC_WIDTH = 400

Used by: view, report

1.12.3 WC_HEIGHT

WC_HEIGHT = <n> is an integer value specifying the height of word cloud output, in pixels.  

Example:  WC_HEIGHT = 300

Default value: WC_HEIGHT = 200

Used by: view, report

1.12.4 WC_BG

WC_BG = <color> specifies the background color used for word clouds.   The value of <color> can be any common color name, such as white, black, green, red, blue, etc.

Example:  WC_BG = green

Default value: WC_BG = black

Used by: view, report

Go to Top


ADD_COLS specifies additional columns that may be added to Label Table output file by execute and to DOCUMENTS reports by report.  Multiple ADD_COLS values can be specified, separated by spaces, for example: ADD_COLS  =  TYPE  CONFIDENCE

·         ADD_COLS = NONE    -   No extra columns added.   (default)

·         ADD_COLS = TYPE      -  An additional column titled “Result Type” is added.  Row

values are “rule”, “pattern”, “ml” or “none” according to whether label value was selected by a Rule, Pattern, Machine Learning, or None of these.  

·         ADD_COLS = MATCHES -  An additional column titled “Matches” is added, which, for Pattern or Rule results, contains the

matched values from the document.  Multiple matches are separated by commas.

·         ADD_COLS = COUNT  An additional column titled “Count” is added, which, for Pattern or Rules results, contains the number

of matches occurring in the document.

·         ADD_COLS = SCORE -  An additional column titled “Score” is added, which contains the following:

For Pattern results, the number of pattern matches in the document (same results as Count).

For Rules results, the sum of Rule Weights for all matched Rules for the selected label.

For ML results, the Confidence values calculated for the selected label.

·         ADD_COLS = ARRAY Additional “Score” columns titled “Score <label value>” are added – one for each possible label value.

ARRAY cannot be used with report.

Used by: execute, report

Go to Top

2      Command Function Variables

These variables enumerate specific functions within the execute, view and report commands.

2.1     EVAL_TYPE

EVAL_TYPE determines what type of Category model evaluation is performed by an execute command.

·         EVAL_TYPE = RULES_ONLY  - Choose label values using Rules only

·         EVAL_TYPE = ML_ONLY         - Choose label values using Machine Learning only

·         EVAL_TYPE = BOTH or FILLIN            - Choose label values using Both (default)

Used by: execute

Go to Top


REPORT_TYPE determines what type of report is generated by the report command.


o   Summary Report – Reports summary counts for label assignments.  Creates <output_file>.csv with columns Label, Count and Percent.   

·         REPORT _TYPE = DOCUMENTS  (default)

o   Documents Report – Reports the label assigned to each document.  Creates <output_file>.csv with columns Name, DID, <model>. 

·         REPORT _TYPE = TERMS

o   Terms Report – Reports term counts for all documents at or below CT node specified by REPORT_NODE.  Creates <output_file>.csv with columns Term, Count, Documents.

·         REPORT _TYPE = CLOUD

o   Word Cloud – Generates <output_file>.jpg with JPEG image of word cloud using term counts for all documents at or below CT node specified by REPORT_NODE. 

·         Command Line Option: -r <report_type>

Used by: report

Go to Top


REPORT_NODE = <CT node name> determines which node of the CT tree is the subject of a report generated by the report command.

Example:  REPORT_NODE = Positive

Default:  Reports a summary of all nodes

Command Line Option: -n <report_node>

Used by: report

Go to Top

2.4     VIEW_TYPE

VIEW_TYPE determines what type of view is generated for the <model> specified in a view command.

·         VIEW_TYPE = LABEL_TREE (default)

o   Label Tree View –  Creates <output_file>.csv in CT Definition File format with information about the Category Tree stored in <model>.          

·         VIEW_TYPE = RULES 

o   Rules View –  Creates <output_file>.csv in Row-Based Rules Definition File format with information about all Rules stored in <model>.    


o   Pattern View – Outputs the Pattern RE string for a Pattern Model, <model>.


o   Weights View –  Creates <output_file>.csv with 2 columns: Term, Weight with information from the Term Weights stored in <model>.          

·         VIEW_TYPE = TERMS

o   Terms View – Reports term frequencies for the Trained Word Cloud at CT node specified by REPORT_NODE.  Creates <output_file>.csv with columns Term, Frequency.

·         VIEW_TYPE = CLOUD

o   Word Cloud View – Creates <output_file>.jpg with JPEG image of the Trained Word Cloud at CT node specified by REPORT_NODE. 

·         Command Line Option: -v <view_type>

Used by: view

Go to Top

3      Model Processing Variables


MULTI_VALUE determines whether Rule and/or Machine Learning evaluation can select more than one label value.

·         MULTI_VALUE = YES - each document can have multiple selected label values

·         MULTI_VALUE = NO - each document can have one selected label values.  (Default)

Used by: execute

Go to Top

3.2     MATCH_CASE

MATCH_CASE determines whether regular expression pattern matches for Rules and Patterns are case sensitive or case insensitive.

·         MATCH_CASE = YES   - pattern matches are case sensitive. (Default)

·         MATCH_CASE = NO   - pattern matches are case insensitive.

Used by: train, execute

Go to Top

3.3     MAX_COUNT

MAX_COUNT = <n>, for integer value <n>, determines maximum term count value used in reports and Confidence value calculations. 

·         If MAX_COUNT > 0 and a term appears more than MAX_COUNT times within a document, then the Count value for that term will be set to MAX_COUNT rather than the actual (higher) count value.   

·         If MAX_COUNT = 0, then no maximum is used.

Example:  MAX_COUNT = 1         (All term counts > 1 are set to 1)

Default value: MAX_COUNT = 0   (No maximum)

Used by: execute

Go to Top

4      Machine Learning Variables

4.1     REF_LABEL

REF_LABEL determines how reference labels are determined for the train command.  Reference labels are the label values whose Trained Word Clouds are updated.

·         REF_LABEL = CSV <col>         – Reference labels are in CSV column <col> 

·         REF_LABEL = QREL <file>       – reference label / DID pairs are listed in <file> using the

 TREC QRELS format.

·         REF_LABEL = XML                  – reference labels are specified in XML tag

“<category:topics:2.0>”, as per RCV1 standard.

·         REF_LABEL = LABEL <val>      – Label value <val> is the reference label for all inputs.

·         REF_LABEL = RULES               – Rule matches determine reference labels.

Used by: train

Go to Top


SIMILARITY determines which word cloud similarity measure will be used to calculate Confidence values during ML evaluation.

·         SIMILARITY = 1           -  Cosine Similarity  (default)

·         SIMILARITY = 2           -  Euclidean Distance

·         SIMILARITY = 3           -  Sum of Absolute Errors

Used by: execute

Go to Top


THRESHOLD = <x>, for floating point value <x>, determines the minimum Confidence value that will be considered significant.   If all label value Confidence levels are below THRESHOLD then none are selected and the default label value will be assigned.

Example:  THRESHOLD = 0.2

Default value: THRESHOLD = 0

Used by: execute

Go to Top

4.4      WEIGHT_TYPE

WEIGHT_TYPE determines the initial weight that each term is given in Confidence calculations before user-defined Term Weights are multiplied inPossible values are:

·         WEIGHT_TYPE = Count                         Initial weight = term count in document

·         WEIGHT_TYPE = Log_Count               Initial weight = log2(term count in document + 1)

·         WEIGHT_TYPE = Binary                        Initial weight = 1 if term count > 0 and 0 otherwise

·         WEIGHT_TYPE = Density                      Initial weight = (term count) / (number of terms in document)

Example:  WEIGHT_TYPE = Count

Default value: WEIGHT_TYPE = Density

Used by: train, execute

Go to Top

4.5     MAX_TERMS

MAX_TERMS = <n>, for integer value <n>, determines the maximum number of terms used in Confidence value calculations.  

·         If MAX_TERMS > 0 then only the MAX_TERMS highest frequency terms are used and all others are ignored.

·         If MAX_TERMS = 0 then all terms are used.

Example:  MAX_TERMS = 20

Default value: MAX_TERMS = 0

Used by: execute

Go to Top