They’re all in it together.
The spooks and the social media titans and the online commerce goliaths
are collaborating to improve data-crunching software tools that enable
the tracking of our behavior in fantastically intimate ways that simply
weren’t possible as recently as four or five years ago. It’s a new
military industrial open source Big Data complex.
The gift economy has
delivered us the surveillance state.
Hadoop’s
earliest roots go back to 2002, when Doug Cutting, then the search
director at the Internet Archive, and Michael Cafarella, a graduate
student at the University of Washington, started working on an
open-source search engine called "Nutch."
But the project did not get
serious traction until Cutting joined Yahoo and began to merge his work
into Yahoo’s larger strategic goal of improving its search engine
technology so as to better compete with Google.
Significantly, Yahoo
executives decided not to make the project proprietary. In 2006, they
blessed the formation of Hadoop, an open-source project managed under
the auspices of the
Apache Software
Foundation. (For a much more detailed look at the history of Hadoop,
please read this
four-part history of Hadoop at GigaOm.)
Hadoop is basically
a nifty hack.
The definition, per Wikipedia,
is surprisingly
simple:
"It supports the running of applications on large clusters
of commodity hardware."
Bottom line, Hadoop provides a means for
distributing both the storage and processing of an enormous amount of
data over lots and lots of relatively inexpensive computers.
Hadoop
turned out to be cheap, fast and scalable - meaning it could expand
smoothly in capacity as the flows of data it was crunching burgeoned in
size, simply though plugging in extra computers to the network. Hadoop
was also fundamentally modular - different parts of it could be easily
replaced by custom designed chunks of software, making it seamlessly
adaptable to the individual circumstances of different corporations - or
government agencies.
Hadoop’s debut was timely, addressing not
only the problems Yahoo faced in managing the enormous amounts of data
produced by its users, but also those that the entire Internet industry
was simultaneously struggling to cope with.
Basically, the Internet had
become a victim of its own success. The enormous flows of data generated
by users of the likes
of Facebook and Twitter far overwhelmed the
ability of those companies to make sense of it. There was too much
coming in too fast.
Hadoop helped companies cope with the tsunami
- it
was,
in the words of Jeff Hammerbacher, an early employee of Facebook,
"our tool for exploiting the
unreasonable effectiveness of data."
Before Hadoop, you were at the mercy of your
data. After Hadoop, you were in charge.
You could figure out all kinds
of interesting things. You could recognize patterns in the data and
start to make inferences about what might happen if you made tweaks to
your product.
-
What did users do when the interface was adjusted like
this?
-
What kinds of ads made them more likely to pull out their
credit cards?
-
What did that batch of millions of Verizon calls reveal
about the formation of a potential terrorist cell?
Facebook wouldn’t be
able to exploit the insights of its so-called
social graph
without tools like Hadoop.
"Hadoop has become the de facto standard
tool for cost-effectively processing Big Data," says Raymie Stata, who
served as chief technology officer at Yahoo before eventually starting
his own Hadoop-focused start-up,
Altiscale.
And the significance of being able to cheaply process Big
Data, to accurately "measure" what your users are doing, he added, is a
"big deal."
"Once you can measure what’s happening ‘out
there’ - [you can] then use those measurements to understand and
ultimately influence what’s happening out there."
With engineers at multiple companies
recognizing that Hadoop offered solutions to the specific challenges
they faced on a daily basis, Hadoop quickly secured the critical mass of
cross-industry support necessary for an open-source software program to
become an essential part of Internet infrastructure.
Even engineers
at
Google chipped in, although Hadoop, at its core, was basically an
attempt to reverse-engineer proprietary Google technology. But that’s
just how the Internet has historically worked.
For decades, so-called
gift economy collaboration, in which the community as a whole benefits
from the freely donated contributions of its members, has been a potent
driver of Internet software evolution.
As I wrote 16 years ago, when
chronicling the
birth of the Apache Web server, the success of open source software,
"testifies to the enduring vigor of the Internet’s cooperative,
distributed approach to solving problems."
Hadoop, which down to its
fundamental structural essence is a distributed approach to
solving problems, emblematized this philosophy at its core.
So, in a sense, Hadoop’s success was just
the same old story.
But back in the mid-’90s, around the time that one
of the first open source success stories, the Apache Web server, was
taking off, I’m not sure that anyone would have predicted that the
National Security Agency and CIA would end up becoming stalwart
participants in the gift economy.
Even though it makes total sense,
in principle, that the fruits of government-funded software
development should be shared with the general public, there’s still
something cognitively disjunctive about intelligence agencies that
shroud their every activity in great secrecy contributing to projects
built on openness and transparency.
On the one hand, employees of the NSA are appearing at conferences discussing how they have adapted Hadoop
to solve the problems of dealing with
unimaginably huge data sets, but on the other hand, we’re not
supposed to know anything about what they are actually doing with that
data.
The intertwining of the intelligence
agencies with the larger open source software community could hardly be
more incestuous. In 2008, a group of Yahoo employees that eventually
included Doug Cutting formed
a start-up designed to commercialize Hadoop called
Cloudera.
The CIA, through its In-Q-Tel (named after James Bond’s Q
character) venture capital arm,
was an early investor in, and customer of, Cloudera.
The NSA built a
significant piece of software that works "on top" of Hadoop called
Accumulo designed to add sophisticated
security controls managing how data could be accessed, and then
promptly donated that code to
the Apache Software Foundation.
Later, a group of NSA software engineers
formed another spinoff company,
Sqrrl, to commercialize Accumulo.
What all this means is that the improvements
to tools that the NSA is making, with the aim of
more efficiently catching terrorists, are propagating into the
private sector where they will be used by Facebook and Neftlix and Yahoo
to more accurately target ads or influence our purchasing behavior or
provide us with content algorithmically shaped
to our very specific desires. And vice versa. Innovations and
increased capabilities pioneered by private companies trickle back to
the NSA.
The collective boot-strapping never stops.
Again, in principle, there is nothing
necessarily wrong going on here. There is no one to blame. Some
of the fiercer apologists for unfettered free markets might complain
that government involvement in open source projects unfairly competes
with private sector proprietary businesses, but a much stronger case can
be made that any software development work that is funded by taxpayer
money should by definition be considered freely sharable with
the wider public.
The NSA should probably be applauded for
helping to improve Hadoop.
And if the capabilities unlocked by Hadoop
result in the prevention of some horrific terrorist act, then every
programmer who contributed a line of code to the project justly deserves
some congratulation.
But there’s also an intriguing inversion
occurring here of what, for better or worse, we might call the
purpose of the Internet. The Internet was initially created by the
U.S. government to facilitate the sharing of information between
geographically separate research centers.
The Internet took off in the
mid-’90s in large part because the general public recognized it as a
phenomenal tool for sharing information with each other. The fact that
so much of the Internet’s infrastructure was also built from code that
was freely shared seemed like a pleasing match of form and function.
Free software and open-source software
evolution is frequently driven not so much by hope for financial gain
but by individuals looking to solve their immediate engineering
problems.
Over time, on the Internet at large, one of those problems has
turned out to be the gnarly challenge of how to manage all the data
created by all those people sharing so promiscuously with each other.
Hadoop can justly be seen as the natural response to all that
promiscuous sharing. And it certainly helped solve the problems faced by
engineers at Facebook and elsewhere.
But what ended up getting enabled by
the success of Hadoop is something significantly different than good old
peer-to-peer sharing.
The ability to make sense out of petabytes of data
isn’t necessarily useful to you or me. But it’s god’s gift to the
profit-minded corporations and terrorist-seeking intelligence agencies
seeking to leverage the data we generate for their own purposes, to
measure our behavior and ultimately to influence it.
That could mean
Netflix figuring out exactly what combination of plot twists and acting
talent proves irresistible to streaming video watchers or Facebook
figuring out exactly how to stock our newsfeeds with advertisements that
generate acceptable click-through or Twitter knowing exactly where we
are on the surface of the planet so it can pop up a sponsored tweet
pushing a coupon for a happy hour at the bar just down the street - or
the NSA spotting a peculiar pattern of pressure cooker purchases.
This
is no longer about sharing information with each other; it’s about
manipulation, control and punishment. It’s about keeping stock prices
up. We’re a long, long way here from the ideal gift economy, where
everyone brings their home-cooked delicacy to the potlatch. We’ve
arrived at a destination where the tools offer more power to them
than to us.
I posed a version of this analysis to
Michael Cafarella, one of the original authors of Hadoop, now a computer
scientist at the University of Michigan.
He conceded that,
"there’s a certain irony that the open
ideas of open source have enabled the construction of systems that
can undermine openness so substantially."
But Raymie Stata, who has been closely
involved with the growth of Hadoop for the last seven years, warned
against,
"conflating ‘open source software’ with
‘Open Society.’"
"Everyone involved with Hadoop in the early
days certainly did believe that Hadoop, as a piece of open source
software, would make the world a better place.
I can’t say, back then,
that we saw Hadoop moving from cyberspace to the real world, but we did
recognize that it would become foundational to building Internet
applications of the future, and we wanted to contribute to advancing
that agenda.
"But individuals who find common ground in
contributing to open source projects do not, as a whole, share beliefs
on what constitutes the ideal ‘Open Society,’" said Stata.
"Is using Big Data to make inferences
about people a Bad Thing at all, no matter who does it? Or is it no
big deal? Or does it depend on who’s doing it, and for what reason
(and with what transparency)? Should we be more worried about Big
Business, or Big Government?"
"I guess in some ways this incident is
evidence that it’s hard to encode ideals in a piece of software," said
Cafarella.
"The right way to do that is via legislation."
Cafarella’s point is hard to dispute.
Brian Behlendorf, one of the founders of the
Apache Software Foundation, told
me that at one juncture, contributors to the various software projects
managed by Apache had argued over whether the license that determined
the rules for how their code could be shared should include restrictions
against organizations using that code for purposes deemed morally or
ethically unacceptable by the open source software programmer community.
But it was relatively quickly determined that to attempt such
restrictions would open up an impossible to resolve subjective can of
worms. Society at large has to figure out what limits it wants to put on
the surveillance state, on what either Facebook or the NSA is allowed to
do.
It’s also important to acknowledge that as
users of online services, we benefit in many ways from our
instant-gratification, access-to-everything, always on lives.
But still:
-
When we first started to log on, did we realize what the tradeoffs would
be?
-
Did we know that we were entering the Panopticon?
-
That we would be
making it substantially easier than ever before for governments
and businesses to track our behavior and monitor our every whim?
Behlendorf says we kind of did. He recalls
his days, fresh out of college in 1995, working for HotWired, Wired
magazine’s first foray into online publishing.
AT&T was running an ad on HotWired, under the theme
"Imagine the Future," that pictured an arm
with a "wrist-watch phone" on it.
"Someone printed it out," said Behlendorf,
"put it up on the wall, and wrote in black marker over the top of
the ad, ‘NSA primate tracking device.’"
And guess what? We went ahead and built it.