Does Amazon’s Data Speak for Itself?

I have a copy of Amazon. Meaning that, on my hard drive there is a massive chunk of Amazon’s product and reviews database—a listing of nine million or so products and 80 million or so reviews taken from 1996 to 2014. The names of all the books in that chunk, their sales ranks, their categories. Every pair of pants for kids, every sock. All the books about Hitler; all the books about snakes. All the different Lego sets. Whatever.

The way I came to be in possession of this thing is that someone tweeted that it existed. I visited a web page, sent an email to a researcher at UC San Diego, and was sent a link to download the data, with a request to cite a paper associated with it, which was presented at the 2015 SIGIR conference: “Image Based Recommendations on Styles and Substitutes.” The data totaled about 20 gigabytes, compressed.

It’s not a perfect copy by any means, but neither is it a pirated one. Rather, it is “spidered” data, culled by automatically visiting Amazon’s web site and copying what is found, adding it up, aggregating it. One could do the same with Walmart.com, or with any big company. But Amazon is a special case: It is possibly the most purely optimized commercial enterprise in history, marrying hard computer science to ruthless labor practices in pursuit of delivering brown, branded boxes to anyone who might conceivably want them. It knows so much about us, and we know so little about it. Walmart has done terrible things for longer, but in comparison seems so amateur. Amazon is out for the world. And I write this as a hypocrite. Who knows how many Amazon boxes are on their way to my house? They show up daily sometimes. Fear is the coin flip of admiration.

In the data, the books don’t have authors, many prices are missing (and I can’t find any prices above $999.99), and there are other gaps besides. Nonetheless, it’s what was granted me. A conglomerate in a teacup. I decided to absorb the data into a database. The first draft of the code I wrote to do so informed me that it would take 25 days of computing processing to complete. That was too long. Also I was out of hard drive space. So I went to a store and bought a computer, a big, boxy, unfashionable PC with a 4-GHz quad-core processor and ten terabytes of extra hard-drive space, installed Linux on it, and got the most recent version of the PostgreSQL database. I could have done all this in the cloud of course, but it’s harder to just mess around in the cloud, and there’s something very comfortable about having your own big machine next to your knee. Besides, the cloud I know best is Amazon’s, and I didn’t want to get conflicted.

With the help of that machine and quite a few database tricks to massage and extract the data, I got 25 days down to one, with searchable titles, descriptions, and reviews. Seven days of programming and one day of absorption to beat one day of programming and 25 days of absorption: a pretty familiar set of trade-offs. You’re always trying to balance your time against the computer’s, but there’s also the challenge of the thing. I probably should have just let it run for four weeks.

So now I have it. I have my very own local, diminished Amazon. Now what do I do? Do I set up shop? Not really; I can’t just reproduce their pages and reviews. Whenever I find myself in an unfamiliar database I search for the same damn things. Hitler always comes to mind, because Hitler shows up everywhere. What, I wondered, was the most expensive Hitler book I could find? Speeches and Proclamations 1932-1945, in translation, four volumes, $721.05. For the strapped, though, Mein Kampf is only 99 cents on Kindle. What about Roosevelt? $140.36, for a book called Allies at War. The Eleanor Roosevelt Encyclopedia runs $95.07.

Reviews are associated with a total number of votes, and this quickly reveals that the very topmost, thumbs-upped reviews are the joke ones, and the mean ones; for the Hutzler 571 Banana Slicer, “No more winning for you, Mr. Banana!” (52,861 votes); for the BIC Cristal for Her Ball Pen, a review of “Finally!” with 38,604 votes; for the Fifty Shades of Grey audiobook, “Did a teenager write this?” And of course there’s much fun to be had from a five-pound bag of Haribo Gummi Bears, which if eaten in quantity are a laxative, or the Playmobil Security Check Point playset. Books feature very little in the “most reviewed.” But then again few books are as hilarious as the idea of Gummi Bears causing terrible diarrhea, or as likely to inspire passion as a Kindle Fire HD 7”.

Looking at the book rankings by popularity reveals very few secrets. The Alchemist: Doing fine. Heaven Is for Real is up there, even though heaven is possibly not for real. We like books for children; we like dying teens; we like dragons; we like sex and murder. In other categories—pet supplies, for example—it’s kitty litter that’s number one. Presumably 2015 was similar.

It gets a stranger at the bottom. You can sort in reverse order. This is a computer. At the very end of the long tail you find the typical basement bin: How to Stay Sane in a Crazy World down at 15 million; or Creative Screwing, which is self-published and costs more than $700, and thus is also down around 15 million. You can see all the basic forces at work: At the top of the list there’s marketing, popularity, and relatively little regard for the literary; many of these books are garbage, and their popularity is immune to reviews. You have to go down the list to find the world of “quality.” Different ecosystems thrive in there, in among the rankings: the world of the careful sentences; the world of the graphic novels. To Amazon, though, or rather to its computers, it’s all one thing. If you’re a programmer, the difference between a can of oil and a book is fairly minimal. Each one is worthy of review, each one can be assigned a ranking. If you make more profit on the can of oil, you should focus your efforts there. Also, there may be two million books out of those nine million quality items, although no one knows the exact number. The real business is in the digital downloads, of course. Those are the best: Immediate gratification. No warehouses. The labor is purely the author’s, as is the marketing and promotion. Margins approaching infinity.

I kept looking and looking but finally I had to admit: I can’t climb this particular mountain. There’s no obvious path through this data. I could claim that it’s a mirror of capitalism, or the global marketplace, but I can’t prove that. The broad claims of the essayist are no match for the digital reality of a global megastructure.

I don’t have a good mental model for thinking across nine million objects, nor for exploring 80 million opinions. This is what people are talking about when they say “big data,” of course: No one knows what’s actually inside there, no one can make sense of all that stuff. No single human being could possibly read all of the reviews on Amazon in a single lifetime, and even reading the names of all the products would take six or seven months. Big data, for the most part, is made by humans—it is the record of what we clicked on, the banner ads we viewed, our paths through a site, multiplied by humanity. Sometimes it is seismic data or star charts too, but mostly what people are talking about with big data is data about human behavior that can be mined to create better predictive models for future human behavior.

We can rank and sort and massage, let people rate the reviews of each other, and hope that order emerges. And then we old book-type people tend to show up and tut when people prefer The Alchemist to The Man Without Qualities, but if you take a breath, who wouldn’t? People like big, simple things with wizards. Wouldn’t you? It’s just that now we have proof.

What sense can I make of an object like this copy of Amazon? Well, simply because it exists, because it was created by researchers, because I was able to get it, I can intuit that we, meaning just regular people, were fascinated by the enormous social structures that the internet had wrought; and from a mixture of admiration and curiosity, we wanted to understand these new systems, and that perhaps the best way to understand them was not to wonder, or read the PR, or listen to Amazon when they told us that they wanted to deliver packages via drones, but just to look at what they put into the world. Just as they observe us, follow our habits, chase us across the internet with banner ads, we can aggregate them as they aggregate us, observing their behaviors to make the relationship between consumer and giant companies less asymmetrical.

I hope there are more of these, more models of giant organizations, more ways to explore them. These are the dominant organisms of our time, the Medicis and Rockefellers and Rothschilds of the moment, connected via credit card payments to hundreds of millions of lives. They can seem too big to scale, too massive to comprehend. And yet they’re all data. What carries them forward, year to year, is the persistence of the data, the fact that they know what people want, and how to get it, and how to hold it in a warehouse for the shortest amount of time possible. Just because I couldn’t scale the mountain doesn’t mean it can’t be scaled. It definitely can. We just don’t yet know how to read a database.