A DHS data-mining tool under development needs more privacy attention, according to a report by government investigators released Tuesday, but the report makes no mention of violation of federal privacy violations as a Washington Post story reported it would. While testing of the system resembles testing of the doomed Total Information Awareness system, the report found the tool was not yet operational.
The system, known as Analysis, Dissemination, Visualization, Insight and Semantic Enhancement or ADVISE, is being tested by four government groups to see if the tool will help make sense of massive amounts of data by finding hidden connections and patterns in data sets. One Homeland Security official wants the tool to handle one billion pieces of structured information (think phone call or bioterror sensor information) and one million unstructured pieces of information (think blog post or email message) per hour.
Congress's investigative arm, the Government Accountability Office, which audited three of those tests at the behest of the House Appropriations Committee, found that the tool wasn't in use yet, but that privacy oversight should start now so that appropriate safeguards could be built into the tool during development, rather than being tacked on at the end. Report (.pdf).
Currently the system lacks tools to distinguish one Joe Smith from another and doesn't rate the strength of inferences the system makes between individuals, the report found. Combined with poor data quality, those limits mean innocent citizens could be incorrectly identified as part of a terrorist cell.
For its part, DHS says it will publish the required privacy evaluations when or if the program is used on real data.
According to DHS spokesman Christopher Kelly, ADVISE is currently being tested by:
The Biodefense Knowledge Center at Lawrence Livermore labs which is using "8 open source databases" to analyze biological threats,
Another Lawrence Livermore group that was examining ways to detect possible attempts to deploy and deploy weapons of mass effect, a project researchers "deemed worthy of further study." The test "focuses on the capabilities of foreign and domestic terrorist groups to develop and deploy weapons of mass effect by solely evaluating technologies (materials, technical skills needed, etc.) There is no information about about U.S. persons in any of the data,"
The DHS Intelligence Analysis Office, is using ADVISE to "sift through and pinpoint key documents analysts can use in day-to-day reporting," and
The Interagency Center for Applied Homeland Security Technology (a little known lab for testing data-mining projects) is running two tests: one that tests how well the system handles large amount of data and another that tests whether the system can pinpoint bad guys from a fake world. This test uses "synthetic data fabricated for the purposes of the evaluation... [that] data contains no information about any real people. This dataset simulates a world in which agents (both terrorists andbenign) conduct activities."
The latter tests may well be the one that the Washington Post heard violated the Privacy Act, since they may be using legally collected data and then removing names (say, for instance, travel data collected by Customs and Border Patrol and the US Visit program).
They are also the tests that sound the most like Total Information Awareness, a largely aborted Pentagon effort that strove to build a tool that would find evil-doers by sifting through nearly every conceivable database of Americans' lives. That tool too was tested using a fake world of data, though the research arm that led the effort has stonewalled any attempt to learn more about this.
Homeland Security's Science and Technology office insists that ADVISE is just a smarter sifter, not an information system.
"It's a set of information tools to aid human analysts of large amounts of already collected data," Kelly said.
The tool works by pulling in massive amounts of data from disparate databases. It then cleans the data, tries to understand unstructured data such as news articles, and looks for connections.
Analysts can then try to order the system to produce results for complex questions:
For example, the qurery might be "Identify any suspicious group of individuals that passed through customs at JFK in January 2004. The answer might include things, like, "Fifteen men between the ages of 24 and 44, all employees of the same chemical processing plant and all traveling without their families, flew into JFK during the time period in question."
But remember that such inferences are hard and as one paper paid for by the government about this program pointed out, the reliability of these tools is currently very low:
Paper (.pdf)
The House Appropriation's Homeland Security Subcommittee is holding a hearing (that undoubtedly will include discussion of ADVISE) called "Balancing Privacy with Security Needs" Wednesday morning at 10:00 AM EST.
No word on whether the WaPo will be issuing a retraction.
Earlier post.
UPDATE: This post was updated shortly after posting to include a little more information about each of the 4 tests, an additional link, and a sentence on how much data the program was expected to handle.

