Facebook and Amazon Have Huge Islands of Data. How Can They Come Together?

Throughout my working life as a web programmer I have always had access to cheap or free databases. Databases are specialized software for creating, storing, managing, and analyzing data. Data is just stuff, or rather, structured stuff: The cells of a spreadsheet, the structure of a Word document, computer programs themselves—all data.

If you have a million customers you can’t just have a million files all over the place. You need to put them somewhere. Enter the database. The database is the unsung infrastructure of the world, the shared memory of every corporation, and the foundation of every major web site. And they are everywhere. Nearly every host-your-own-web-site package comes with access to a database called MySQL; just about every cell phone has SQLite3, a tiny, pocket-sized database, built in.

See that term, SQL? That’s the lingua franca of what’s called the relational database world. It stands for “Structured Query Language.” IBM makes DB2; Oracle has the Oracle Database, currently at version 12c. Then there’s Microsoft SQL Server, MySQL (open sourced and freely available, but owned by Oracle), PostgreSQL, and SQLite. These are just the more famous of the databases that “speak” SQL; what this means in practice is that were one to sit down before a command prompt of any of them and write “SELECT * FROM tracks WHERE musician = ‘Katy Perry’;” all the tracks by Katy Perry would be listed on the screen, assuming that a large database of tracks already exists where each track was associated with a musician’s full name.

Most online catalogues work along these lines—there are tables of, say, screwdrivers, and tables of customers, and there is another table that lists which customer bought which screwdrivers (they might purchase more than one); furthermore there are tables of shipping rates by screwdriver weight; tables of postal tariffs; tables of administrators who have the privileges necessary to update the screwdriver table (who might be separate from those who have the privileges necessary to update the postal tables).

All of these tables relate to one another; “screwdriver #123 was purchased by customer #456” is a fact. Where all this gets particularly interesting is wherever, by traversing the relations between tables (this is done, like anything, by commands issued in SQL), you can discover new things: Where in America do people use more Phillips than flat-head screwdrivers? Where do they buy the most screwdrivers? These are the merits of the “relational” database. Predictable, logical, capable of processing thousands of transactions with provable integrity, so that not one penny is lost: SQL is the perfect servant of capitalism. Who could need anything more?

More than a decade ago, after spending many years building web sites powered by relational databases, I became enamored of graph databases. It was in the air. The idea was that we—the citizens of the internet—would work together and create a “Semantic Web,” a giant open “knowledge graph” of statements in the form of subject, predicate, object, a “triple.” If Paul hasFriend Jim, and Jim hasFriend Betsy, well, then, can’t we infer that Paul and Betsy have Jim in common? We can! We can say all sorts of things about Paul, Jim, and Betsy. We can say that Betsy is a professor at a college and then follow that link to a list of all the departments in the college, or say that Jim is a fan of the Beatles and follow that link to A Hard Day’s Night. In addition to publishing web pages with words and headlines, we’d all start publishing our data in this open, connectable format, and everything—everything, dammit—would be one giant connection of links.

This was a different approach to data than the relational model, triples instead of tables, with some different theoretical underpinnings. It necessitated new kinds of storage software and new kinds of databases. Begone, SQL databases! Hello, triple stores. There was even an effort called FOAF (which stands for “friend of a friend” and is pronounced “fofe”) that described the social network between individuals. It was totally decentralized and anyone could join it. All you had to do in order to get connected to this social network was find a web host and publish some FOAF RDF in XML format and then—wait, why are you checking Facebook?

Sigh. I once was in charge of the web site for a magazine (this was 2006), and I made the decision to migrate everything to the graph model. We had hundreds of thousands of pages of content, a whole archive, all connected by subjects. I made myself lord of the taxonomy. Instead of a relational database, I used a graph database—an experimental, alpha version. I revised my entire world into a huge set of hundreds of thousands of logical statements about pages, articles, issues, and subjects. Millions of spinning plates.

I was ambitious in inverse proportion to my talent as a programmer, using untested technologies, flattering myself that I understood how they worked. The resulting site was a joy to contemplate, and worked well enough for the readers (who experienced it as a bunch of web pages, unaware of the toil beneath)—and a bear to maintain. About once a month everything would break and I’d stay awake for a night, carefully feeding files into the graph-database loader that I’d custom-coded, like a stoker feeding coal into a firebox. The upshot was that individual web pages took minutes to generate, so I had to “cache” them for later delivery. Instead of ditching the whole framework and starting again, I created yet more programs to manage the caching process, delivering inefficiency on top of inefficiency. No one could update a blog post unless I edited a file on a server in Texas, so I jammed WordPress, a blogging engine built on top of MySQL, on top of the whole thing and jury-rigged a connection between it and my monstrosity.

This is a risk of working alone, without anyone to tell you you’re insane. After it was done, I realized I had no one in particular with whom to share all my lovely knowledge base, except for a few researchers who sent me emails. Then again, everything worked well enough for the readers.

In the late 2000s, there arose the “NoSQL movement,” coalescing around a collective desire of many programmers to move beyond the strictures of the relational model and unshackle themselves from SQL. Our data is varied and diverse, they said, even if the programmers weren’t that varied and diverse, and we are tired of pretending that one technology will address our need for speed. Dozens of new databases appeared, each with different merits. There were the key-value databases, like Kyoto Cabinet, which optimized for speed of retrieval. There were search-engine libraries, like Apache Lucene, which made it relatively easy to search through enormous corpora of text—your own Google. There was MongoDB, which allowed for “documents,” big arbitrary blobs of data, to be stored without nice rows and consistent structure. People debated, and continue to debate, the value of each. Advocates of relational databases often pipe up and say, “You could do that with [insert name of an SQL database]!” and then other people say, “Yes but why should you?” and it goes around and around.

There is as yet no absolute challenger to the relational model. When people think database, they still think SQL. But if there is a true challenger, it is in the graph model. Because graph data structures power social networks, and social networks are the dominant technological organism of the era.

It is a good era for graph databases. In 2010, Google bought a company called FreeBase, which was trying to create a huge open map of reality and then it was, well, part of Google. Twitter released one later that year, called FlockDB. Neo4j is another popular open-sourced database, letting you store all of those knowledge-style relationships. A friend of mine from the Semantic Web days hung in there, and his company built a commercial graph database called StarDog; it has customers like JPMorgan Chase and NASA. There are knowledge graphs everywhere. And of course there’s Facebook, with its billion users and their countless likes, their lists of friends, and its enormous graph of information on just about everything. Facebook is a tremendous user of the graph model, the irony being that their store of facts is so huge, so world-spanning, that they need a flexible way to store all the data that it contains across thousands and thousands of machines. This service, the root source of truth, is provided by many, many copies of MySQL.

Sometimes I get a little wistful. The vision of a world of connected facts, one big, living library, remains beautiful, and unfulfilled. What we have instead are large pools of data—Amazon’s product catalog and reviews, Facebook’s set of likes and social connections, Twitter’s friends and followers, countless magazine and newspaper web sites. Each has its own pressing, commercial needs. They are not averse to sharing information, to seeing their signal spread far and wide. It’s just that everything needs to align with their business goals. That was the mistake I made in my twenties, thinking that somehow just sharing information would be enough, that information would seek out other information and a world library of open graph data would build itself. I don’t think that it is impossible that we’ll get there, someday; it’s just that without the motive power of capital, things move slowly. There are great islands of knowledge on the internet, but not one big pool of knowledge. I used to feel that such a thing was right around the corner. But the web was much smaller then, and now, well, it’s nice to imagine.

I Dreamed of a Perfect Database

How we order information can reshape our world.