Oak Ridge National Laboratory's Computational Data Analytics Group's has worked
over 12 years in creating text analytics systems to quickly discover meaningful
information from raw data. These capabilities focus on six key areas,
emphasizing high performance over very large sets of raw documents.
Collecting and Extracting:
Collecting millions of documents from databases, Internet, Social Media, and
hard drives; extracting text from hundreds of file formats; and translating
this information into multiple languages.
Storing and indexing:
Storing and indexing millions of documents in search servers, distributed file
systems (MapReduce), relational databases, and file systems.
Filtering the full content of millions of documents to recommend the most
valuable and relevant information based on a user’s own information, or user
selections, or a user’s interactions with information.
items based on the full content of documents using supervised and
semi-supervised machine learning methods and targeted search lists.
a hierarchical group of documents based on similarity using unsupervised
learning methods on the full content of each document.
hierarchies, groups, and relationships among documents that helps the user
quickly understand their value, and to see new connections.
This work has resulted in four issued ( 7,072,883 7,315,858
7,693,9037,805,446) and four pending patents , several commercial licenses
(including Pro2Serve and TextOre), a spin off company (Global Security
Information Analysts LLC (GSIA)), an R&D 100 Awards, and scores of peer
reviewed research publications.
Case study of Piranha's Text Mining Capabilities
In large cases millions of files must be manually processed to discover
potential crimes and threats. To solve this problem, a typical customer reviews
Option 1: Use a search engine or document management technology to build a
case. Drawback: key words of interest returned thousands of hit for each keyword
that must be manually processed.
Option 2: Use visual analysis tools such as Palantir or Analyst Notebook.
Drawback: The documents must be manually processed/tagged before the tool can
be used which significantly limits the number of documents that can be
Option 3: Use Piranha to sift through and analyze the documents. Piranha
works on hundreds of raw data formats, and can process data extremely fast, on
For a recent customer, millions of files were loaded overnight into a
desktop version of Piranha. The next day, using the the customer's 1200 keyword
list, Piranha’s initial filter recommended one thousand documents. Piranha
returned documents that contain sets of infrequently occurring keywords, which
often are valuable to the customer.
Next, the 1200 keywords were grouped in to 86 topics, for example, the
John Doe, President of Doe and Sons Manufacturing of Springfield, Iowa, Jane
Doe Vice President of Doe and Sons Manufacturing. John Doe, Jr., Chief
Technology officer Doe and Sons Manufacturing.
Would be contained in the topic John Doe. Piranha’s second filter used these
topics to find the closest matches to individual topics, further reduce the
number of document down to 50. These two filtering steps took about 4 hours.
Piranha was then used to cluster these 50 documents by converting the
documents into vectors and comparing the vectors to produce a hierarchy of
similar documents. This hierarchy and document set was presented to the
customer the following day.
Piranha finds Actionable Intelligence>
The case agent was amazed by the results. In a days time Piranha was able to
discover the main points of the case, and then Piranha was used by the agents
over the next three days to discover several previously unknown actionable
An active shell company
The target’s organizational details
Piranha was able to quickly and effectively find a valuable set of documents
that provided a rich set of productive leads for further investigation. Piranha
is being used on additional cases for other agencies.
for gathering and summarizing internet information
for gathering and summarizing internet information (2008)
for gathering and summarizing internet information (2010)
method for distributed clustering of textual information
reduction of dimensions of a document vector in a document search and retrieval
and system for determining precursors of health abnormalities from processing
And System To Discover And Recommend Interesting Documents
And System Of Filtering And Recommending Documents
Computing Method For Dynamically Scaling A Process Across Physical Machine
R. M. Patton, B. G. Beckerman, T. E. Potok, G.
Tourassi, "A Recommender System for Web-Based Discovery and Refinement of
Information Radiologists Seek", Radiological Society of North Amercia
(RSNA), 2012 Annual Meeting, Nov. 2012, Chicago, IL, USA.
R. M. Patton, T. E. Potok, B. A. Worley,
"Discovery & Refinement of Scientific Information via a Recommender
System", The Second International Conference on Advanced Communications
and Computation, Oct. 2012, Venice, Italy.
Steed, Chad A. (ORNL), Symons, Christopher T.
(ORNL), DeNap, Frank (ORNL), Potok, Thomas E. (ORNL), “Guided Text Analysis
Using Adaptive Visual Analytics,” Paper in Conf. Proceedings (book, CD),
Visualization and Data Analysis 2012, Burlingame, California, January 23-25,
Patton, Robert M. (ORNL), McNair, Wade
(ORNL), Symons, Christopher T. (ORNL), Treadwell, Jim N. (ORNL), Potok, Thomas
E. (ORNL), “A Text Analysis Approach to Motivate Knowledge Sharing via
Microsoft SharePoint,” Paper in Conf. Proceedings (book, CD), 45th Hawaii
International Conference on System Sciences, Wailea, Hawaii, January 4, 2012.
Patton, Robert M. (ORNL), Rojas, Carlos C.
(ORNL), Beckerman, Barbara G. (ORNL), Potok, Thomas E. (ORNL), “A Computational
Framework for Search, Discovery, and Trending of Patient Health in Radiology
Reports,” Paper in Conf. Proceedings (book, CD), 1st IEEE Conference on
Healthcare Informatics, Imaging, and Systems Biology, San Jose, California,
Patton, Robert M. (ORNL), Beckerman, Barbara
G. (ORNL), Potok, Thomas E. (ORNL), Analysis and Classification of Mammography
Reports Using Maximum Variation Sampling, Stephen L. Smith and Stefano Cagnoni
(Eds.), Genetic and Evolutionary Computation: Medical Applications, pp.
113-131, Wiley Publishing, West Sussex, United Kingdom, January 2011.
Cui, Xiaohui (ORNL), Mueller, Frank (North
Carolina State University), Zhang, Yongpeng (ORNL), Potok, Thomas E. (ORNL),
“Data-Intensive Document Clustering on GPU Clusters,” Journal of Parallel and
Distributed Computing, December 2010.
Patton, Robert M. (ORNL), Beckerman, Barbara
G. (ORNL), Potok, Thomas E. (ORNL), Treadwell, Jim N. (ORNL), Genetic Algorithm
for Analysis of Abdominal Aortic Aneurysms in Radiology Reports, Paper in Conf.
Proceedings (book, CD), 2010 Genetic and Evolutionary Computation Conference,
Portland, Oregon, July 2010. Genetic Algorithm for Analysis of Abdominal Aortic
Aneurysms in Radiology Reports.
Cui, Xiaohui (ORNL), Potok, Thomas E(ORNL),
Cavanagh, Joseph M(ORNL), Parallel Latent Semantic Analysis using a Graphics
Processing Unit, Paper in conf proceedings (book, CD), 2009 Genetic and
Evolutionary Computation Conference, July 2009.Parallel Latent Semantic
Analysis using a Graphics Processing Unit.
Patton, Robert M (ORNL), Potok, Thomas
E(ORNL), Beckerman, Barbara G(ORNL), Treadwell, Jim N(ORNL), A Genetic Algorithm
for Learning Significant Phrase Patterns in Radiology Reports, Paper in conf
proceedings (book, CD), Genetic and Evolutionary Computation Conference 2009,
Montreal, CAN, July 2009.A Genetic Algorithm for Learning Significant Phrase
Patterns in Radiology Reports.
X. Cui, J. M. Beaver, J. St. Charles, T. E.
Potok, Dimensionality Reduction for High Dimensional Particle Swarm Clustering,
Proceedings of the IEEE Swarm Intelligence Symposium, September, 2008, St.
Patton, Robert M (ORNL), Potok, Thomas
E(ORNL), Identifying Event Impacts by Monitoring the News Media, Paper in conf
proceedings (book, CD), 12th International Conference on Information
Visualization, London, UK, July 2008. Identifying Event Impacts by Monitoring
the News Media.
Patton, R.M., Cui, X., Jiao, Y., and Potok,
T.E. (2008). Evolutionary computing. Intelligent Data Analysis: Developing New
Methodologies through Patton Discovery and Recovery, Idea Group Inc., Hershey,
X. Cui and T. E. Potok, A Particle Swarm
Social Model for Multi-Agent Based Insurgency Warfare Simulation, Proceedings
of the IEEE Eighth International Conference on Software Engineering, Artificial
Intelligence, Networking and Parallel/Distributed Computing, August, 2007,
J. W. Reed, T. E. Potok, and R. M. Patton,
"A multi-agent system for distributed cluster analysis," in
Proceedings of Third International Workshop on Software Engineering for
Large-Scale Multi- Agent Systems (SELMAS'04)" W16L Workshop - 26th
International Conference on Software Engineering Edinburgh, Scotland, UK: IEE,
2004, pp. 152-5.
J. Reed, Y. Jiao, T. E. Potok, B. Klump, M.
Elmore, and A. R. Hurson, "TF-ICF: A New Term Weighting Scheme for
Clustering Dynamic Data Streams," in Proceedings of 5th International
Conference on Machine Learning and Applications (ICMLA'06). vol. 0 ORLANDO, FL,
2006, pp. 258-263.
P. Yan, Y. Jiao, A. R. Hurson, and T. E.
Potok, "Semantic-based information retrieval of biomedical data," in
Proceedings of the 2006 ACM symposium on Applied computing Dijon, France: ACM
T. E. Potok, M. T. Elmore, J. W. Reed, and N.
F. Samatova, "An ontology-based HTML to XML conversion using intelligent
agents," in Proceedings of the 35th Annual Hawaii International Conference
on System Sciences Big Island, HI, USA: IEEE Comput. Soc, 2002, pp. 1220-9.
R. M. Patton and T. E. Potok,
"Characterizing large text corpora using a maximum variation sampling
genetic algorithm," in Proceedings of the 8th annual conference on Genetic
and evolutionary computation Seattle, Washington, USA: ACM Press, 2006.
P. Palathingal, T. E. Potok, and R. M.
Patton, "Agent based approach for searching, mining and managing enormous
amounts of spatial image data," in Proceedings of the Eighteenth
International Florida Artificial Intelligence Research Society Conference, FLAIRS
2005 - Recent 4 2007 R&D 100 Award Entry Form Advances in Artifical
Intelligence Clearwater Beach, FL, United States: American Association for
Artificial Intelligence, Menlo Park, CA 94025-3496, United States, 2005, pp.
M. T. Elmore, T. E. Potok, and F. T. Sheldon,
"Dynamic data fusion using an ontology-based software agent system,"
in Proceedings of 7th World Multiconference on Systemics, Cybernetics and
Informatics (SCI 2003) vol. Vol.9 Orlando, FL, USA: IIIS, 2003, pp. 5-E html PUBLIC
"-//W3C//DTD XHTML 1.0 Transitional//EN"