Gartner defines dark data as the information assets that organizations collect, process and store during regular business activities, but generally fail to use for other purposes (such as analytics, business relationships and direct monetizing).

Like dark matter in physics, dark data often comprises the majority of a company's information assets. Organizations often retain dark data for compliance purposes only.  However, storing, maintaining and securing dark data typically incurs more expense and potential risk than it adds value.

Much of the Dark Data lurking in corporate repositories is Dark Text, which is unstructured and not centrally managed.   Manual analysis and classification of such data is virtually impossible.  And the problem continues to grow exponentially over time.

An automated solution that can efficiently and effectively summarize dark text content, evaluate information sensitivity, and auto-tag it for proper classification, retention and subsequent life cycle management and disposition is essential. 

Vertical Data has an effective solution to this problem.   The Dark Text Analyzer (DTA) shines a light on dark text with easy-to-use tools that:

 

Summarize Files by Properties and Word Clouds

As it loads files, the DTA maps them into configurable data bins for each file property, including size, last created and last modified dates.   For spreadsheet files, number of rows and columns is also binned.  The user can immediately view summary graphs, charts and crosstabs on this data to gain an understanding of the file set.   For example, the summary graph below was obtained by clicking on the highlighted graph icon:

 

Word clouds (  ) and word frequency tables ( ) are available with a single click for any file, folder, folder tree, property bin, or other label value.

Scan and Certify Text for Sensitivity

DTA can scan files for any patterns or keyword lists.   Pattern and list examples are shown below and additional content is being developed.   Customized models can be easily defined.   

Patterns

 

Lists

Social Security Number (pre-2011 validated)

 

Weapons

Credit Cards (Luhn validated)

 

Cities/Countries

Prices

 

Adult

E-mail Address

 

Angry

SKUs

 

Financial Terms

Zip Codes

 

Legal Terms

 

 

 

Documents in scan match groups can be inspected further, quarantined or auto-deleted.   Document sets that have been certified to be free of the scan criteria are logged.  All metadata results are saved and can be exported for further analysis.

Search Documents

Users can quickly find specific documents or subsets within a large corpus using search.  DTA provides customized searches, with results filtered by any labeled group:  

Find Similar searches look for documents whose contents (word clouds) are most similar to a set of one or more reference documents.  Results are displayed with the closest match (highest Confidence) at the top.

 

Group Documents

DTA creates labeled document groups, where documents with matching labels are in the same group.   Group counts and word clouds can provide insight into the structure of the data.   Documents can be restricted to a single group, or can join multiple groups.

Rules models provide the simplest document grouping - they group documents together that contain the same keywords, match the same patterns or match the same search queries. 

ML Extensions

If there are a significant number of ungrouped documents after a Rules model has been executed, the Machine Learning Extension feature can be enabled, which automatically trains a new machine learning model from currently grouped documents and then uses this model to add additional documents with similar word clouds to each group. 

Any ACRE label model can be defined, tested, validated and deployed as a part of the DTA analysis.   Documents with known risks can be used to train machine learning models to evaluate their similarity to other documents.  The possibilities for generating additional high-value metadata are endless!


The Sensitivity Model

Vertical Data is pleased to be working with Dennis Devlin, Co-Founder, CISO and CPO of SAVANTURE, to develop and train a complete Sensitivity Model to quantify dark text sensitivity using results from scans, rules and machine learning. 

 

Questions?

To learn more, contact us today!