ACRE Architecture



Users initiate User Jobs, which execute in the background.   Each Job can be executed locally or can use cloud computing services, providing a scalable solution.   There are 4 job types:

o   Load

Performs ETL (extract, transform, load) tasks to bring external data onto the server, parse and convert.  May split text into smaller units (phrase, sentence, etc.).  

Inputs:  An external data source and one Data Source Definition (DSD) entry, which defines load parameters.

Outputs: Extracted Text items, which store data fields for each document. 

o   Natural Language Processing (NLP) 

ACRE uses the Natural Language Toolkit as well as customized code.  NLP functions are:

+        Stemming - Convert words to their root form. 

+        Stop List - List of words to be eliminated from analysis.

+        Go List - List of words to be exclusively used in analysis.

+        Part-of-Speech Identification - Marks words with part of speech (i.e., noun, verb, adverb, adjective, etc).

+        Named Entity Identification - Marks named entities. 

+        Min/Max Counts - Minimum / maximum word counts to be used in analysis

+        Synonym Lists - List of alternate words/abbreviations/spellings equivalent to a given keyword.  

+        Bi-Grams - Includes 2-word phrases in default analysis.

Inputs:  Extracted data and  NLP Profile, which specifies the set of NLP functions required.

Outputs: A Term Vector for each document, contains a list of term (possibly modified by stemming), count (number of occurrences), weight (configurable), and tags (POS / NE).

o   Train Model

Trains a Machine Learning model by updating its stored Trained Word Cloud (TWC).  

Inputs:  A label model; a set term vectors for training documents, with a reference label for each.   

Outputs: The model's Trained Word Cloud will be updated.

o   Execute Model

Executes a label model. 

Inputs:  One label model and one document set of extracted text or term vectors

Outputs:  (a) one or more labels are assigned to each document, (b) summary label counts calculated for each node of Category Tree (CT) (c) term frequency tables and word clouds available for each document and for every node in CT.