What Should We Do About Big Data Leaks?

I have a great fondness for government data, and the government has a great fondness for making more of it. Federal elections financial data, for example, with every contribution identified, connected to a name and address. Or the results of the census. I don’t know if you’ve ever had the experience of downloading census data but it’s pretty exciting. You can hold America on your hard drive! Meditate on the miracles of zip codes, the way the country is held together and addressable by arbitrary sets of digits.

You can download whole books, in PDF format, about the foreign policy of the Reagan Administration as it related to Russia. Negotiations over which door the Soviet ambassador would use to enter a building. Gigabytes and gigabytes of pure joy for the ephemeralist. The government is the greatest creator of ephemera ever.

Consider the Financial Crisis Inquiry Commission, or FCIC, created in 2009 to figure out exactly how the global economic pooch was screwed. The FCIC has made so much data, and has done an admirable job (caveats noted below) of arranging it. So much stuff. There are reams of treasure on a single FCIC web site, hosted at Stanford Law School: Hundreds of MP3 files, for example, with interviews with Jamie Dimon of JPMorgan Chase and Lloyd Blankfein of Goldman Sachs. I am desperate to find time to write some code that automatically extracts random audio snippets from each and puts them on top of a slow ambient drone with plenty of reverb, so that I can relax to the dulcet tones of the financial industry explaining away its failings. (There’s a Paul Krugman interview that I assume is more critical.)

The recordings are just the beginning. They’ve released so many documents, and with the documents, a finding aid that you can download in handy PDF format, which will tell you where to, well, find things, pointing to thousands of documents. That aid alone is 1,439 pages.

Look, it is excellent that this exists, in public, on the web. But it also presents a very contemporary problem: What is transparency in the age of massive database drops? The data is available, but locked in MP3s and PDFs and other documents; it’s not searchable in the way a web page is searchable, not easy to comment on or share.

Consider the WikiLeaks release of State Department cables. They were exhausting, there were so many of them, they were in all caps. Or the trove of data Edward Snowden gathered on a USB drive, or Chelsea Manning on CD. And the Ashley Madison leak, spread across database files and logs of credit card receipts. The massive and sprawling Sony leak, complete with whole email inboxes. And with the just-released Panama Papers, we see two exciting new developments: First, the consortium of media organizations that managed the leak actually came together and collectively, well, branded the papers, down to a hashtag (#panamapapers), informational website, etc. Second, the size of the leak itself—2.5 terabytes!—become a talking point, even though that exact description of what was contained within those terabytes was harder to understand. This, said the consortia of journalists that notably did not include The New York Times, The Washington Post, etc., is the big one. Stay tuned. And we are. But the fact remains: These artifacts are not accessible to any but the most assiduous amateur conspiracist; they’re the domain of professionals with the time and money to deal with them. Who else could be bothered?

If you watched the movie Spotlight, you saw journalists at work, pawing through reams of documents, going through, essentially, phone books. I am an inveterate downloader of such things. I love what they represent. And I’m also comfortable with many-gigabyte corpora spread across web sites. I know how to fetch data, how to consolidate it, and how to search it. I share this skill set with many data journalists, and these capacities have, in some ways, become the sole province of the media. Organs of journalism are among the only remaining cultural institutions that can fund investigations of this size and tease the data apart, identifying linkages and thus constructing informational webs that can, with great effort, be turned into narratives, yielding something like what we call “a story” or “the truth.”

Spotlight was set around 2001, and it features a lot of people looking at things on paper. The problem has changed greatly since then: The data is everywhere. The media has been forced into a new cultural role, that of the arbiter of the giant and semi-legal database. ProPublica, a nonprofit that does a great deal of data gathering and data journalism and then shares its findings with other media outlets, is one example; it funded a project called DocumentCloud with other media organizations that simplifies the process of searching through giant piles of PDFs (e.g., court records, or the results of Freedom of Information Act requests).

At some level the sheer boredom and drudgery of managing these large data leaks make them immune to casual interest; even the Ashley Madison leak, which I downloaded, was basically an opaque pile of data and really quite boring unless you had some motive to poke around.

If this is the age of the citizen journalist, or at least the citizen opinion columnist, it’s also the age of the data journalist, with the news media acting as product managers of data leaks, making the information usable, browsable, attractive. There is an uneasy partnership between leakers and the media, just as there is an uneasy partnership between the press and the government, which would like some credit for its efforts, thank you very much, and wouldn’t mind if you gave it some points for transparency while you’re at it.

Pause for a second. There’s a glut of data, but most of it comes to us in ugly formats. What would happen if the things released in the interest of transparency were released in actual transparent formats? By which I mean, not as a pile of unstructured documents, not even as pure data, but, well, as software? Put cost aside and imagine for a minute that the FCIC report was not delivered as web pages, PDFs, finding aids, and the like, but as a database filled with searchable, formatted text, including documents attributed to the individuals within, audio files transcribed, and so forth.

Now listen, if you work in this field I can hear your near-hysterical laughter: What I’m talking about is culturally impossible. I’m asking for people to hack on huge pools of data like they might hack on an app at a startup. It’s like asking someone very drunk to put the books back onto the library shelves.

Imagine the specifications that would need to be written, the meetings that would need to be held, the document entitled “Findings Release Format Specification 1.0” that would itself simply be a list of further modules that would need to be created. How do you deal with foreign languages? With right-to-left character systems? What exactly is the definition of a document? How do we indicate that something is a transcript, or an email, or a what-have-you?

I look at that FCIC data and see at least 300 hours of audio. That’s $18,000 worth of transcription. Those documents could be similarly turned into searchable text, as could any of the PDFs. We can do the same for emails. These tools exist and are open. If there are any faxes they can be OCRed. In this case we’ll assume it’s all in English. And we’ll aim for internal consistency. We’re talking gigabytes, not terabytes, of data, at least so far.

Chop it all up and put it into a database with full-text search. I’d use SQLite3. Its code is in the public domain and is so widely deployed as to be ubiquitous. It even runs on phones. Make a giant SQLite3 file. Then release that. You could put it on a peer-to-peer network like BitTorrent.

What would that mean? It would mean that instead of pawing through a giant PDF that points out other documents, and then finding those documents, anyone with a few minutes of training could download the file, start up a database client program, and start searching through the documents. If they had basic skills as a web developer, they could make new and novel interfaces for that data.

They could even start exploring large data dumps right from their phones. Without the internet. I know this is not the source of joy for all. But there are some of us, a few at least, who would enjoy drifting off to sleep browsing charts and graphs, listening to Jamie Dimon explain himself, and thinking about the world as it was in 2008.

In the world of software you have to ship products that people can use. You gather feedback and iterate on it. Otherwise your product will be subsumed by its competition. This is why we are on version umpteen of Microsoft Word or Excel, and why there are such regular updates to the Facebook app on your phone. But the same norms and rules only barely exist for data. The media has been thrust into the role of data keepers, because only it has the time to unpack the schemata that define a given file and turn it into something usable and newsworthy.

You don’t need a web professional to make a book or make a magazine, you don’t need them to publish a web site or tidy up a picture. But you do need them to clean up data and make it easy to explore. And as the data dumps keep happening, our reliance on the media to make sense of them—legal or not, structured or not—will only increase.

I’ll be completely, well, transparent: I don’t think we’re ready for highly searchable, easily accessible, leaks and data dumps. We are not a particularly measured society, and this sort of information actually rewards a sense of historical context and measured analysis. We like to validate assumptions, not explore corpora. But the data keeps falling off the back of the truck, or is released by some august governmental body, or sneaked out of the country on a mislabeled compact disc. A transparent society is one that makes data not just available but usable. What use is a window if you can’t stare through it?