ACRE v1.3 Command Reference

 

This document specifies formats and options for ACRE v1.3 commands.   ACRE v1.3 is executed in a Python 2.7 environment using module acre.py.   There are 7 commands {create, modify, view, train, documents, execute, report}, with one of these 7 command words being the first argument after acre.py.  Command words can be shortened to a single letter (or any unique prefix).  The first time acre.py is executed, there will be additional execution delay for environment setup.

Some of the command line options specified here have equivalent Options Variables that can be specified in the Options File.  If the same option is specified in both places, then the command line option will be the value used.  If the command line option is omitted, then the Option Variable will be used.  If both are omitted then the default value will be used.  See the ACRE v1.3 Options Reference for more information.

Contents

1     Global options. 1

2     ACRE Commands. 2

2.1      create. 2

2.2      modify. 2

2.3      view.. 3

2.4      train. 4

2.5      documents. 4

2.6      execute. 5

2.7      report 6

3     User Input File Formats. 6

3.1      Category Tree Definition Files. 6

3.2      Rules Definition Files. 7

3.3      Term Weights Files. 8

4     Regular Expressions. 8

4.1      Category Model REs. 8

4.2      Pattern Model REs. 8

 

1      Global options

The following options can be specified on any command:

·         -i <options_file> - Specifies the name of the Options File.  If omitted, then options file name will be acre_options.txt.

·         -o <output file> - Specifies name of the Output File.   Extension may be omitted and ACRE will add it as appropriate.

o   Options Variable: OUTPUT_FILE = <output file>

o   Default: <model>-<command>.<ext>

·         -l <log_file> - Specifies name of the Log File. 

o   Options Variable: LOG_FILE = <output file>

o   Default: acre-log<N>.txt

2      ACRE Commands

2.1     create

Format:   python acre.py create <model>  <… options …>

Purpose: Creates a new ACRE model.

Positional Arguments:  

<model> = ACRE Label Model name

Options:

·         –c <CT-file>                 -  Create Category Model with CT specified in <CT-file>.

·         –r <rules-file>             - Define Rules as specified in <rules-file>

·         –p <pattern-string>   - Create Pattern Model with pattern <pattern-string>

Either –c or –p is required, but not both.  –r can only be used with –c.

Examples:

      python acre.py create cat_model –c my_tree.csv –r my_rules.csv

      python acre.py c pat_model –p “get ([A-Za-z]+) here”

Output Files:  

·         ACRE Label Model file (.alm)

2.2     modify

Format:   python acre.py modify <model>  <… options …>

Purpose: Modifies an existing ACRE model.

Positional Arguments:  

<model> = ACRE Label Model name  (<model>.alm file must exist)

Options:

·         –r <rules-file>                         - Replace Rules with those specified in <rules-file>

·         –p <pattern-string>   - Replace Pattern with <pattern-string>

·         –w <weights-file>      - Replace Term Weights with those specified in <weights-file>

Examples:

      python acre.py modify cat_model –r my_rules.csv –w my_weights.csv

      python acre.py m pat_model –p “go ([A-Za-z]+)!”

Output Files:  

·         Existing ACRE Label Model file (.alm) is modified.

2.3     view

Format:   python acre.py view <model>  <… options …>

Purpose: Outputs information from an existing ACRE model.

Positional Arguments:  

<model> = ACRE Label Model name  (<model>.alm file must exist)

Options:

·         –v <view-type>          - Generate output according to <view-type> value:

o   LABEL_TREE - Creates <output_file>.csv in CT Definition File format with information about the Category Tree stored in <model>.    

o   RULES  - Creates <output_file>.csv in Row-Based Rules Definition File format with information about all Rules stored in <model>.    

o   PATTERN - Outputs the Pattern RE string to console output.

o   WEIGHTS - Creates <output_file>.csv with 2 columns: Term, Weight with information from the Term Weights stored in <model>.     

o   TERMS - Reports term frequencies for the Trained Word Cloud at CT node specified by <CT-node-name>.  Creates <output_file>.csv with columns Term, Frequency.

o   CLOUD - Creates <output_file>.jpg with JPEG image of the Trained Word Cloud at CT node specified by <CT-node-name>

o   Equivalent Option Variable: VIEW_TYPE

·         –n <CT-node-name> - Specifies the node in CT tree to be used in TERMS or CLOUD.

o   Equivalent Option Variable: REPORT_NODE

Examples:

      python acre.py view cat_model –v RULES –o rules-out.csv

      python acre.py v pat_model –p

      python acre.py v cat_model –v C –n Label5 –o Label5-cloud.jpg

Output Files:  

·         Generates output file specified by –o or OUTPUT_FILE

2.4     train

Format:   python acre.py train <model>  <… options …>

Purpose: Trains an existing ACRE model, which updates one or more Trained Word Clouds

Positional Arguments:  

<model> = ACRE Label Model name  (<model>.alm file must exist)

Options:

·         –d <document/folder> - specifies input folder or CSV file

o   Equivalent Options Variable: DOCUMENTS = <document/folder>

·         –r  - Train by Rule Match.  Each rule match defines a reference label.

o   Equivalent Options Variable setting: REF_LABEL = RULES

·        –z  - (Zero clouds)  All Trained Word Clouds will be cleared before training.

If “-r” is not specified, then reference label values will depend on the value of the REF_LABEL options variable (see ACRE Options Reference)

Examples:

      python acre.py train cat_model –d survey_input.csv -r

      python acre.py t cat_model –d data\doc-folder

Output Files:  

·         Updates Trained Word Clouds in <model>.alm

·         Generates log file specified by –l or LOG_FILE

2.5     documents

Format:   python acre.py documents <… options …>

Purpose: Reads a document set, applies drop list and stemming, generates AJO file containing the term frequencies and word cloud for the document set.   No ACRE model is used.

Options:

·         –d <document/folder> - specifies input folder or CSV file

o   Equivalent Options Variable: DOCUMENTS = <document or folder>

·         -j <ajo_file> - Specifies name of the output AJO File. 

o   Equivalent Options Variable: AJO_FILE = <ajo-file>

Examples:

      python acre.py documents –d survey_input.csv –j my_doc_terms

      python acre.py d –d data\doc-folder

Output Files:  

·         Generates AJO file specified by –j or AJO_FILE

2.6     execute

Format:   python acre.py execute <model> <… options …>

Purpose: Reads a document set, executes <model> on the document set using evaluation type specified by EVAL_TYPE, generates (a) output file containing Label Table, (b) ACRE Job Output (AJO) file with all execution results, (c) log file documenting execution.

Positional Arguments:  

<model> = ACRE Label Model name  (<model>.alm file must exist)

Options:

·         –a        – (append) instead of creating Output File, write the Label Table values into a

 new column of the input CSV file. 

·         –t         – (test) compare each selected label value against a reference value specified

 by REF_LABEL.  Report on the percentage match between these values. 

·         –d <document/folder> - specifies input folder or CSV file

o   Equivalent Options Variable: DOCUMENTS = <document/folder>

·         -j <ajo_file> - Specifies name of the AJO File. 

o   Equivalent Options Variable: AJO_FILE = <ajo-file>

Examples:

      python acre.py execute cat_model –d survey_input.csv  –j my_doc_terms -a

      python acre.py e pat_model –d data\doc-folder -t

Output Files:  

·         Generates AJO file specified by –j or AJO_FILE

·         Generates log file specified by –l or LOG_FILE

·         Generates output file specified by –o or OUTPUT_FILE - if “-a” is not specified.

2.7     report

Format:   python acre.py report <… options …>

Purpose: Reads an AJO file and generates a report corresponding to REPORT_TYPE.

Options:

·         –r <report-type>        - Generate report according to <report-type> value:

o   DOCUMENTS  - Creates <output_file>.csv containing Label Table results -  columns are “File Name”, “DID”, <model>.   Additional columns will be present if INCLUDE_TEXT and/or ADD_COLS are defined.

o   SUMMARY - Creates <output_file>.csv containing summary counts for each label value  -  columns are “Label”, “Count”, “Percent”.      

o   TERMS - Reports on term frequencies for all documents assigned to or below CT node <report-node-name>.  Creates <output_file>.csv with columns Term, Frequency.

o   CLOUD - Creates <output_file>.jpg with JPEG image of the word cloud consisting of all terms for all documents assigned to or below CT node <report-node-name>.  

o   LABEL_TREE - Creates <output_file>.csv in CT Definition format, containing all unique label values processed.   This Category Tree can then be used with other label models.

o   Equivalent Options Variable: REPORT_TYPE = <report-type>

o   Default value: DOCUMENTS

·         –n <report-node>  - Specifies the node in CT tree to be used in TERMS or CLOUD.

o   Equivalent Options Variable: REPORT_NODE = <report-node>

o   Default value: the root node of the CT tree.

·         -j <ajo_file> - Specifies name of the input AJO File. 

o   Equivalent Options Variable: AJO_FILE = <ajo-file>

o   Default value: the last AJO file created.

Examples:

      python acre.py report –r SUMMARY

      python acre.py r  –j my_doc_terms –r D

      python acre.py r  –r C –n Positive

Output Files:  

·         Generates output file specified by –o or OUTPUT_FILE

3      User Input File Formats

Here we summarize file formats for each type of user-generated file that may be needed to provide Category Model information to an ACRE Command.   Pattern Models do not require any special user input files.

ACRE software assumes that the first line of every CSV file will be a header line.

3.1     Category Tree Definition Files

A CT Definition File is a CSV file with 4 columns: Node, Default, Description and Parent.

Each line after the header defines one category tree node.  Node names can contain only letters, numbers and underscore [A-Za-z_].  One leaf node can be marked “Y” in the Default column to define the Default label value.  For nodes below the first tree level, the Parent column value is set to specify the name of the node above them.

Figure 1: CT Definition Files

Figure 1 shows two CT Definition files, with the corresponding Category Tree shown above each.

3.2     Rules Definition Files

The simplest format  for defining Rules is the Column-Based Rules Definition format shown on the left in Figure 2.  With this format, Rule Patterns for each label value are arranged in a single column with the label value name at the top.  The advantage of this format is that it makes it easy to enter lists of rule patterns.  The disadvantage of this format is that Rule Weights are not specified and are assumed to all be equal to 1.0.

The other format is the Row-Based Rules Definition format shown on the right in Figure 2.  With this format, the first line must be the column headers “Label Value”, “Rule Pattern” and “Weight”.  Then, on each row, the label value, rule pattern and rule weight for one rule is specified.   This format allows the user to enter rule weights, but requires a bit more typing.

Figure 2: Rules Definition Files

3.3     Term Weights Files

Term Weights files are CSV files containing two columns: Term and Weight.  The Weight value will be a floating point value.

4      Regular Expressions

ACRE v1.3 uses Python 2.7 Regular Expressions.  More information about Regular Expressions in general can be found here and here.

4.1     Category Model REs

The Rule Patterns stored in ACRE Category Models are Python 2.7 regular expressions that are evaluated during model execution.

In addition, the MATCH_CASE options variable can be used to specify whether letter case (upper or lower) is significant in Rule Pattern matches.

·         MATCH_CASE = YES (default)

o   Letter case is significant, so Rule Pattern “park” would not match document text “Park”, for example.

·         MATCH_CASE = NO  (this can be specified in Options File)

o   Letter case is not significant, so Rule Pattern “park” would match document text “PaRk”, for example.

4.2     Pattern Model REs

The Pattern Regular Expressions stored in ACRE Pattern Models are evaluated according to the Python 2.7 RE specifications, with the addition that the matched text for the first parenthesized subexpression (the Capture RE) is captured as the label value.  For example:

·         IF Pattern RE = “My name is ([A-Za-z]+)”

o   IF Input text contains “My name is Greg” THEN the label value = “Greg”. 

If the user wants to match literal parentheses in the input text, then this can be specified by preceding each parenthesis with a backslash.  For example:

·         IF Pattern RE = “My \([A-Za-z]+\) is ([A-Za-z]+)”

o   IF Input text contains “My name is Greg” THEN this is NOT a match. 

o   IF Input text contains “My (name) is Greg” THEN the label value = “Greg”

If there are multiple sets of capturing parentheses, then the first set determines the label value.  For example:

·         IF Pattern RE = “My ([A-Za-z]+) is ([A-Za-z]+)”

o   IF Input text contains “My name is Greg” THEN the label value = “name”.

Finally, if the user wants to use non-capturing parentheses within the regular expression, then the user should replace “(“ with “(?:”.   For example:

·         IF Pattern RE = “My (?:[A-Za-z]+) is ([A-Za-z]+)”

o   IF Input text contains “My name is Greg” THEN the label value = “Greg”.