UNLV Magazine

Summer 2004 | Vol. 12, No. 2

FEATURES

Finding Needles in the Haystack
With millions of pages to sift through, how do government classifiers make sure that slivers of private or sensitive information don't make their way into the wrong hands?With the help of software developed by UNLV engineers.
Imagine for a moment the pages upon pages of project reports, technical reports, and other documents produced every day by those working for the federal government. The vast majority of them belong in the public domain. And these days, that means posting them on the Internet.

But nestled within a report about the construction costs of a hazardous materials storage facility are details about the thickness of the steel beams supporting the building. Such information, if available to the public, could be used to carry out a terrorist strike.

"After 9-11, it was clear that certain kinds of information, even if it were currently unclassified, could be potentially useful to a terrorist," says Tom Nartker, a computer science professor and director of UNLV's Information Science Research Institute (ISRI). "Federal agencies needed automated systems that could quickly and efficiently identify this sensitive information within large quantities of documents planned for public dissemination."

Sorting Through The Stack

In 2002, the U.S. Department of Energy turned to ISRI to design software for that task, granting the UNLV group $2.2 million. The institute developed and installed a program on 300 computers just 14 months later. Called the Homeland Security Classifier (HSC), it has since reviewed more than 8 million documents to pinpoint sensitive information.

The software "reads" and sorts the electronic text documents by applying the same rules used by human classifiers. In the first pass, HSC identifies documents with no sensitive information. These can be immediately released and made available on Internet sites and in libraries.

In the documents processed so far, about 70 percent are marked as not needing manual review. "For these documents, the system has proven to be 100 percent precise, meaning that no sensitive information has slipped through," Nartker says.

The remaining 30 percent are tagged as potentially sensitive and forwarded to trained classifiers for manual review. Less than 1 percent of all documents are subsequently marked as "sensitive but unclassified," says Julie Borsack, ISRI project manager. These documents are not posted on the Internet.

"While there's a tremendous need to make sure sensitive information is not readily available worldwide,there's just as much pressure to not be overly cautious," Borsack says. "Pass two of HSC includes three levels of manual review for each document before it is judged to be sensitive but unclassified."

In the second pass, the software highlights just the passages that contain potentially sensitive information so the human classifier can zero in on the appropriate passages. Only about 5 percent of the pages in the entire collection are actually reviewed by a classifier. "You simply couldn't start on page 1 and manually review all of these documents in a timely manner with absolute accuracy," Borsack says. "By automatically identifying the passages that need further evaluation, HSC has saved the government tremendous time and expense."

The institute was able to deliver the Homeland Security Classifier quickly because it is based on software tools it had designed for other uses. "These have not been small projects," says computer science professor Kazem Taghva, associate director of the institute. "It has taken many years and several million dollars to get ISRI where it is today."

Launched in 1989, ISRI is well-established in the field of information-access technology. It has won more than $10 million in federal research and software development grants and employs 11 full-time programmers, document analysts, and support personnel. "We go straight from the research drawing board through to product development and delivery — that's a different paradigm for research groups on university campuses," Taghva says.

Kazem Taghva, Tom Nartker, and Julie Borsack
of the UNLV Information Science Research Institute

Long-term Benefits for UNLV and Its Students

The finished products the institute delivers to its government clients also have commercial applications. The Homeland Security Classifier, for example, can be modified for a variety of classification tasks. It could be licensed to corporations that need to ensure they comply with the Privacy Act or with the Health Insurance Portability and Accountability Act.

"The benefit of the Internet — that it gives everybody access to information — can also be its drawback," Borsack says. "I think people are concerned about too much access to their personal information. Both individuals and businesses want some protection."

Nartker added, "Managing electronic information these days is no small task. At the most basic level this might mean sorting our e-mail to eliminate spam and boost productivity. At a higher level, it means ensuring that some information, such as Social Security numbers, remain private."

Among the products that ISRI is now patenting or licensing for commercial use is MANICURE. The Energy Department currently uses this program to convert printed text into electronic documents. Setting MANICURE apart from similar programs on the market are its quality-control features, which identify the accuracy of the conversion.

Another project that has both homeland security and commercial applications is ISRI's development of the Multilingual Search Technology (MLST). It indexes and retrieves documents in other languages. Taghva and a group of graduate students are applying the technology to search documents written in Farsi, the official language of Afghanistan, and Arabic, which is used in Iraq.

While software currently on the market does a good job of translating documents, searching for relevant information in other languages can be difficult. "The difficulty is not so much in translating documents from another language, but in finding the documents that relate to your search," Taghva says. "What good is a translation program if you don't know which documents to translate?"

MLST incorporates a browser and Web application so the user can type a search in English and find the documents in Arabic or Farsi. The documents can then be displayed in English without the font issues that plague other Middle Eastern language systems.

Should these products be commercially licensed, the university will reap a portion of the proceeds. "Our current contracts not only fund ISRI research, they support overhead for the university's research administration, the Engineering College, and the computer science department," Taghva says.

But the benefits go beyond the financial for the university. ISRI research has a direct impact on student learning. The institute employs more than 70 graduate and undergraduate students, whose resumes are enhanced by the hands-on programming and classification experience. "Employers look to our graduates because of the expertise they gained from our program," Taghva says. "Their experience at ISRI is not something they can get in a classroom."