Judge Leon explains why the NSA uses everyone’s metadata

Freedom to Tinker 2013-12-20

There are many interesting things to discuss in Judge Leon’s opinion from yesterday, finding the NSA’s phone metadata program likely unconstitutional. In this post, I’ll focus on an interesting bit of computer science in the judge’s ruling, and I’ll explain why the judge’s computer science argument is actually more powerful than he realized.

The judge found that the plaintiffs had standing to challenge the constitutionality of the NSA’s practices, based on the NSA’s use of plaintiffs’ data in processing queries. (He also found standing for other reasons.)

To do this, the judge found that the NSA’s contact chaining analysis was necessarily using data about these specific plaintiffs. (The relevant part of the opinion starts at the bottom of page 38, and goes through page 41.)

The NSA’s contact chaining analysis uses a notion of distance based on “hops”. If A has talked to B in the last five years, then A and B are one hop apart. If A has talked to B in the last five years, and B in turn has talked to C in the last five years, then A and C are two hops apart. And so on. The NSA’s analysis starts with a “seed” phone number that has been approved as meeting a legally required level of suspicion. The analysis then extends up to three hops away from the seed number.

So how does the judge find that the NSA analysis necessarily uses the plaintiffs’ data? Here’s the key passage in the judge’s opinion:

The Government, however, describes the advantages of bulk collection in such a way as to convince me that plaintiffs’ metadata—indeed everyone’s metadata—is analyzed … whenever the Government runs a query using as the “seed” a phone number or identifier associated with a phone for which the NSA has not collected metadata (e.g., phones operating through foreign phone companies). According to the declaration submitted by NSA Director of Signals Intelligence Directorate (“SID”) Teresa H. Shea, the data collected as part of the Bulk Telephony Metadata Program—had it been in place at that time—would have allowed the NSA to determine that a September 11 hijacker living in the United States had contacted a known al Qaeda safe house in Yemen. Presumably, the NSA is not collecting metadata from whatever Yemeni telephone company was servicing that safehouse, which means that the metadata program remedies the investigative problem in Director Shea’s example only if the metadata can be queried to determine which callers in the United States had ever contacted or been contacted by the target Yemeni safehouse number. [The same point is reinforced elsewhere in the Shea declaration.] When the NSA runs such a query, its system must necessarily analyze metadata for every phone number in the database by comparing the foreign target number against all of the stored call records to determine which U.S. phones, if any, have interacted with the target number.

(pp. 39-40, emphasis in original, internal citations omitted)

The basic argument is that if the analysis needs to know whether Alice and Bob ever talked, then it must look at either Alice’s or Bob’s record. If Alice’s record is unavailable, then the only way to know whether Alice and Bob are connected is to look at Bob’s record.

(You might argue that instead of looking at Bob’s record, the analysis could instead look at some kind of precomputed index to find out the answer. But that doesn’t change anything, because the index-building process would still have to look at Bob’s record, otherwise the index couldn’t “know” whether Alice and Bob were connected. There’s no way to get the answer without looking at Bob’s record at some point.)

It follows that if you want a full list of people who talked to Alice, and you don’t have access to Alice’s record, then you have to look at every record in the database, to figure out whether that record is connected to Alice. If you fail to look at any record, then you can’t be sure that you have a complete list of Alice’s contacts.

This result is actually more powerful than the judge seems to have realized. He applied this argument to the case where the seed number was external (i.e., from a carrier not providing data to the NSA). The same argument, that you must look at every record in the database to get an accurate result, applies not only to the case where the seed is an external record, but also to every case where an external record appears at any point after one hop or two hops. In such a case, the analysis would have to look at every record in the database in order to extend the results to the next hop. (As above, you could instead use an index that was built by looking at every record.)

This case will come up very often. Using the judge’s very conservative calculation, there are at least 10,000 numbers within two hops of a typical seed. If even one of those 10,000 numbers is external, then the system will have to look at every record in the database to complete the three-hop analysis. It looks like this would usually be the case in practice. So the plaintiffs’ data—and your data as well—is not just used occasionally; it is probably used in most every contact chaining calculation done by the NSA.