Mar 20, 2007 4:05 PM

Homeland Data Tool Needs Privacy Help, Report Says

A DHS data-mining tool under development needs more privacy attention, according to a report by government investigators released Tuesday, but the report makes no mention of violation of federal privacy violations as a Washington Post story reported it would. While testing of the system resembles testing of the doomed Total Information Awareness system, the report […]

The system, known as Analysis, Dissemination, Visualization, Insight and Semantic Enhancement or ADVISE, is being tested by four government groups to see if the tool will help make sense of massive amounts of data by finding hidden connections and patterns in data sets. One Homeland Security official wants the tool to handle one billion pieces of structured information (think phone call or bioterror sensor information) and one million unstructured pieces of information (think blog post or email message) per hour.

Congress's investigative arm, the Government Accountability Office, which audited three of those tests at the behest of the House Appropriations Committee, found that the tool wasn't in use yet, but that privacy oversight should start now so that appropriate safeguards could be built into the tool during development, rather than being tacked on at the end. Report (.pdf).

Currently the system lacks tools to distinguish one Joe Smith from another and doesn't rate the strength of inferences the system makes between individuals, the report found. Combined with poor data quality, those limits mean innocent citizens could be incorrectly identified as part of a terrorist cell.

Regarding data quality risks, the ADVISE tool currently does not have the capability to distinguish among individuals with similar identifying information, nor does it have a mechanism to assess the accuracy of the relationships it uncovers. To address the risk of misidentification, software could be added to the tool to distinguish among individuals that have similar names, a process known as disambiguation. Disambiguation tools have been developed for other applications.
Additionally, although the ADVISE tool includes a feature that allows analysts to designate confidence levels for individual pieces of data, no mechanism has been developed to assess the confidence of relationships identified by the tool. While software specifically to determine data quality would be difficult to develop, other controls exist that could be readily used as part of a strategy for mitigating this risk. For example, anonymization could be used to minimize the exposure of personal data, and operational procedures could be developed to restrict the use of analytical results containing personal information that could have data quality concerns. To implement anonymization, the tool would need the software capability to handle anonymized data or have a built-in data anonymizer. DHS currently does not have plans to build anonymization into the ADVISE tool.

For its part, DHS says it will publish the required privacy evaluations when or if the program is used on real data.

According to DHS spokesman Christopher Kelly, ADVISE is currently being tested by:

The Biodefense Knowledge Center at Lawrence Livermore labs which is using "8 open source databases" to analyze biological threats,
Another Lawrence Livermore group that was examining ways to detect possible attempts to deploy and deploy weapons of mass effect, a project researchers "deemed worthy of further study." The test "focuses on the capabilities of foreign and domestic terrorist groups to develop and deploy weapons of mass effect by solely evaluating technologies (materials, technical skills needed, etc.) There is no information about about U.S. persons in any of the data,"
The DHS Intelligence Analysis Office, is using ADVISE to "sift through and pinpoint key documents analysts can use in day-to-day reporting," and
The Interagency Center for Applied Homeland Security Technology (a little known lab for testing data-mining projects) is running two tests: one that tests how well the system handles large amount of data and another that tests whether the system can pinpoint bad guys from a fake world. This test uses "synthetic data fabricated for the purposes of the evaluation... [that] data contains no information about any real people. This dataset simulates a world in which agents (both terrorists andbenign) conduct activities."

The latter tests may well be the one that the Washington Post heard violated the Privacy Act, since they may be using legally collected data and then removing names (say, for instance, travel data collected by Customs and Border Patrol and the US Visit program).

They are also the tests that sound the most like Total Information Awareness, a largely aborted Pentagon effort that strove to build a tool that would find evil-doers by sifting through nearly every conceivable database of Americans' lives. That tool too was tested using a fake world of data, though the research arm that led the effort has stonewalled any attempt to learn more about this.

Homeland Security's Science and Technology office insists that ADVISE is just a smarter sifter, not an information system.

"It's a set of information tools to aid human analysts of large amounts of already collected data," Kelly said.

The tool works by pulling in massive amounts of data from disparate databases. It then cleans the data, tries to understand unstructured data such as news articles, and looks for connections.

advisesystemdesign2

Analysts can then try to order the system to produce results for complex questions:

For example, the qurery might be "Identify any suspicious group of individuals that passed through customs at JFK in January 2004. The answer might include things, like, "Fifteen men between the ages of 24 and 44, all employees of the same chemical processing plant and all traveling without their families, flew into JFK during the time period in question."

But remember that such inferences are hard and as one paper paid for by the government about this program pointed out, the reliability of these tools is currently very low:

Prof. Andrew McCallum of the University of Massachusetts cited the performance of the state-of-the-art tools to be as follows.
- Named entity recognition (i.e., identifying a person, location, or organization) had a recall-precision rate of between 80-95%.
- Binary relationship extraction (i.e., determining Location 1 is contained in Location 2 or Person 1 is a member of Organization 1) had a recall-precision rate of between 60-80%, depending on the type of relationship.

Paper (.pdf)

The House Appropriation's Homeland Security Subcommittee is holding a hearing (that undoubtedly will include discussion of ADVISE) called "Balancing Privacy with Security Needs" Wednesday morning at 10:00 AM EST.

No word on whether the WaPo will be issuing a retraction.

Earlier post.

UPDATE: This post was updated shortly after posting to include a little more information about each of the 4 tests, an additional link, and a sentence on how much data the program was expected to handle.