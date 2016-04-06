If this is the age of the citizen journalist, or at least the citizen opinion columnist, it’s also the age of the data journalist, with the news media acting as product managers of data leaks, making the information usable, browsable, attractive. There is an uneasy partnership between leakers and the media, just as there is an uneasy partnership between the press and the government, which would like some credit for its efforts, thank you very much, and wouldn’t mind if you gave it some points for transparency while you’re at it.

Pause for a second. There’s a glut of data, but most of it comes to us in ugly formats. What would happen if the things released in the interest of transparency were released in actual transparent formats? By which I mean, not as a pile of unstructured documents, not even as pure data, but, well, as software? Put cost aside and imagine for a minute that the FCIC report was not delivered as web pages, PDFs, finding aids, and the like, but as a database filled with searchable, formatted text, including documents attributed to the individuals within, audio files transcribed, and so forth.

Now listen, if you work in this field I can hear your near-hysterical laughter: What I’m talking about is culturally impossible. I’m asking for people to hack on huge pools of data like they might hack on an app at a startup. It’s like asking someone very drunk to put the books back onto the library shelves.

Imagine the specifications that would need to be written, the meetings that would need to be held, the document entitled “Findings Release Format Specification 1.0” that would itself simply be a list of further modules that would need to be created. How do you deal with foreign languages? With right-to-left character systems? What exactly is the definition of a document? How do we indicate that something is a transcript, or an email, or a what-have-you?

I look at that FCIC data and see at least 300 hours of audio. That’s $18,000 worth of transcription. Those documents could be similarly turned into searchable text, as could any of the PDFs. We can do the same for emails. These tools exist and are open. If there are any faxes they can be OCRed. In this case we’ll assume it’s all in English. And we’ll aim for internal consistency. We’re talking gigabytes, not terabytes, of data, at least so far.

Chop it all up and put it into a database with full-text search. I’d use SQLite3. Its code is in the public domain and is so widely deployed as to be ubiquitous. It even runs on phones. Make a giant SQLite3 file. Then release that. You could put it on a peer-to-peer network like BitTorrent.

What would that mean? It would mean that instead of pawing through a giant PDF that points out other documents, and then finding those documents, anyone with a few minutes of training could download the file, start up a database client program, and start searching through the documents. If they had basic skills as a web developer, they could make new and novel interfaces for that data.

They could even start exploring large data dumps right from their phones. Without the internet. I know this is not the source of joy for all. But there are some of us, a few at least, who would enjoy drifting off to sleep browsing charts and graphs, listening to Jamie Dimon explain himself, and thinking about the world as it was in 2008.

In the world of software you have to ship products that people can use. You gather feedback and iterate on it. Otherwise your product will be subsumed by its competition. This is why we are on version umpteen of Microsoft Word or Excel, and why there are such regular updates to the Facebook app on your phone. But the same norms and rules only barely exist for data. The media has been thrust into the role of data keepers, because only it has the time to unpack the schemata that define a given file and turn it into something usable and newsworthy.

You don’t need a web professional to make a book or make a magazine, you don’t need them to publish a web site or tidy up a picture. But you do need them to clean up data and make it easy to explore. And as the data dumps keep happening, our reliance on the media to make sense of them—legal or not, structured or not—will only increase.

I’ll be completely, well, transparent: I don’t think we’re ready for highly searchable, easily accessible, leaks and data dumps. We are not a particularly measured society, and this sort of information actually rewards a sense of historical context and measured analysis. We like to validate assumptions, not explore corpora. But the data keeps falling off the back of the truck, or is released by some august governmental body, or sneaked out of the country on a mislabeled compact disc. A transparent society is one that makes data not just available but usable. What use is a window if you can’t stare through it?