Pirahna- Big Data Analytics
The Oak Ridge National Laboratory's Computational Data Analytics Group's has worked over 12 years in creating text analytics systems to quickly discover meaningful information from raw data. These capabilities focus on six key areas, emphasizing high performance over very large sets of raw documents.
Collecting and Extracting: Collecting millions of documents from databases, Internet, Social Media, and hard drives; extracting text from hundreds of file formats; and translating this information into multiple languages.
Storing and Indexing: Storing and indexing millions of documents in search servers, distributed file systems (MapReduce), relational databases, and file systems.
Recommending: Filtering the full content of millions of documents to recommend the most valuable and relevant information based on a user’s own information, or user selections, or a user’s interactions with information.
Categorize: Grouping items based on the full content of documents using supervised and semi-supervised machine learning methods and targeted search lists.
Clustering: Creating a hierarchical group of documents based on similarity using unsupervised learning methods on the full content of each document.
Visualizing: Showing hierarchies, groups, and relationships among documents that helps the user quickly understand their value, and to see new connections.
This work has resulted in eight issued ( 7,072,883 7,315,858 7,693,903 7,805,446 7,937,389 8,473,314 8,825,710 9,256,649 ) and one pending patents , several commercial licenses (including Pro2Serve and TextOre), a spin off company (Global Security Information Analysts LLC (GSIA)), an R&D 100 Awards, and scores of peer reviewed research publications.
Case study of Piranha's Text Mining Capabilities
In large cases millions of files must be manually processed to discover potential crimes and threats. To solve this problem, a typical customer reviews several options:
Option 1: Use a search engine or document management technology to build a case. Drawback: key words of interest returned thousands of hit for each keyword that must be manually processed.
Option 2: Use visual analysis tools such as Palantir or Analyst Notebook. Drawback: The documents must be manually processed/tagged before the tool can be used which significantly limits the number of documents that can be processed.
Option 3: Use Piranha to sift through and analyze the documents. Piranha works on hundreds of raw data formats, and can process data extremely fast, on typical computers.
For a recent customer, millions of files were loaded overnight into a desktop version of Piranha. The next day, using the the customer's 1200 keyword list, Piranha’s initial filter recommended one thousand documents. Piranha returned documents that contain sets of infrequently occurring keywords, which often are valuable to the customer.
Next, the 1200 keywords were grouped in to 86 topics, for example, the keywords:
John Doe, President of Doe and Sons Manufacturing of Springfield, Iowa, Jane Doe Vice President of Doe and Sons Manufacturing. John Doe, Jr., Chief Technology officer Doe and Sons Manufacturing.
Would be contained in the topic John Doe. Piranha’s second filter used these topics to find the closest matches to individual topics, further reduce the number of document down to 50. These two filtering steps took about 4 hours.
Piranha was then used to cluster these 50 documents by converting the documents into vectors and comparing the vectors to produce a hierarchy of similar documents. This hierarchy and document set was presented to the customer the following day.
Piranha finds Actionable Intelligence
The case agent was amazed by the results. In a days time Piranha was able to discover the main points of the case, and then Piranha was used by the agents over the next three days to discover several previously unknown actionable intelligence, including:
- New suspects
- An active shell company
- The target’s organizational details
Piranha was able to quickly and effectively find a valuable set of documents that provided a rich set of productive leads for further investigation. Piranha is being used on additional cases for other agencies.