November 2008
from
EndTheLie Website
Google is one of about four search engines that matter.
There are many more than four engines, but only
about four have the technology to crawl much of the web on a regular basis.
As of July 2003, Yahoo owned,
-
Overture
-
Alltheweb
-
AltaVista
-
Inktomi,
...and finally dumped Google in February 2004.
Everything needed to turn Yahoo
into a major search engine was now under Yahoo's roof.
It is still possible that Yahoo will shoot themselves in the foot with all
of this firepower - their desire to monetize everything appears to be high
on their agenda. But so far, after only a year, Yahoo has shown that their
main index search results are on a par with Google's. This is true despite
the fact that Yahoo has has infiltrated some pay-per-click links into the
main index.
One reason for Yahoo's success is that Google's
main index, though free from paid results, has declined considerably since
early 2003. Amazingly, there is on average only a 20 percent overlap between
Yahoo's first 100 results and Google's first 100 results for the same search
- and still, Yahoo is just as good as Google.
These days there is so little room at the top of
the search results heap, that any combination of algorithms will produce
acceptable results. The main difference now is in the depth of the crawl.
Microsoft recently developed their own engine because they found themselves
squeezed between the advertising engine of Overture and the search engine Inktomi - both of which became Yahoo property. In 2003 Microsoft began
experimenting with their own crawler. Their new engine was launched in early
2005.
If Microsoft puts their greed on a back burner for a few years, by
doing deep crawls and presenting a clean interface, they could do to Google
what they did to Netscape.
There is no "secret sauce" at Google - we now
believe it was all hype from the very beginning. (To the extent that there
ever was a secret sauce, the recipe is now known by countless ecommerce
spammers, which makes it a liability rather than an asset.)
Thousands of engineers in hundreds of companies
know how to design search engines. The only real questions are whether you
can commit the resources for a deep, consistent crawl of the web, and how
aggressively you want to use your search engine to make money.
That gives us Google, Yahoo, and Microsoft.
The last one worth watching is Teoma/AskJeeves. Their search technology is good, and they seem serious
about expanding their crawl. It remains to be seen how deeply and
consistently they will be able to crawl websites with thousands of pages.
Google is easily top dog. They provide about 75 percent of the external
referrals for most websites. There is no point in putting up a website apart
from Google. It's do or die with Google. If we're all very lucky, one of the
other three will soon offer some serious competition. If we're not lucky, we
will be uploading our websites to Google's servers by then, much like the
bloggers do at blogger.com (which was bought by Google in 2003).
It would mean the end of the web as we know it.
It is worthwhile to understand the pressures that the average, independent
webmaster is under. And given that Google is so dominant, it's important to
understand the pressures that are being brought to bear on Google, Inc. It
does not take too much imagination to recognize that there's a struggle
going on for the soul of the web, and the focal point of this struggle is
Google itself.
At one level, it's a struggle for advertising revenue. The pundits look at
only this level, and are unanimous that the only advertising model on the
web with any sort of future is one where little ads appear after being
triggered by keyword searches, or by the non-ad content of a web page. For
example, a search for Google Watch may show some ads on the right side of
the screen for wrist watches.
While the technique doesn't work for this
example, often it serves its purpose.
There is only so much pixeled real estate that
the average user can be expected to survey for a given search. Today up to
half of each screen is dedicated to paid ads on Google, as compared to the
ad-free original Google. Everyone wants a piece of this new wave in web
advertising, and Google is making a lot of money.
Unfortunately, early evidence suggests that Yahoo is less interested in pure
search algorithms, than in acquiring market share in a pay-for-placement
and/or pay-for-inclusion revenue stream. The same may be true for Microsoft.
Even Google, dazzled by the sudden income from advertising, must be
wondering why they go to all that trouble and expense to crawl the
noncommercial sector.
Those public-sector sites, such as the org, edu
and gov domains, do not provide direct income, even though the web would be
unattractive without them. All the excitement over a revived online ad
market, pushed by pundits hoping for another dot-com gold rush, is beginning
to look like the days when AltaVista decided that portals were the Next Big
Thing.
That notion caused AltaVista to lose interest in
improving their crawling and searching - which is how Google succeeded in
the first place.
There has been almost no interest in establishing search engines that
specialize in public-sector websites.
-
Where is the Library of Congress?
-
Where are the millions of dollars doled
out by the Ford Foundation?
-
How about the United Nations?
-
Why can't some enlightened European
entity pick up the slack?
Everyone is asleep, while the Internet is
getting spammed to death.
At another level, it's a struggle over who will have the predominant
influence over the massive amounts of user data that Google collects. In the
past, discussions about privacy issues and the web have been about consumer
protection.
That continues to be of interest, but since 9/11
there is a new threat to privacy - the federal 'government.' Google has not
shown any inclination to declare for the rights of its users across the
globe, as opposed to the rights of the spies in Washington who would love
to
have access to Google's user data.
Much of the struggle at this new level is unarticulated.
For one thing, the
spies in Washington don't talk about it. Congress has given them new powers,
without debating the issues. Google, Inc. itself never comments about things
that matter. The struggle recognized by
Google Watch has to do with the
clash of real forces, but right now all we can say is that potentially this
struggle could manifest itself in Google's boardroom.
The privacy struggle, which includes both the old issue of consumer
protection and this new issue of government surveillance, means that the
question of how Google treats the data it collects from users becomes
critical. Given that Google is so central to the web, whatever attitude it
takes toward privacy has massive implications for the rest of the web in
general, and for other search engines in particular.
Call it class warfare, if you like. Because that brings up the other major
gripe that Google Watch has with Google.
That's the PageRank problem - the fact that
Google's primary ranking algorithm has less to do with the quality of web
pages, than it has to do with the "power popularity" of web pages.
Their
approach to ranking is anti-democratic, in that already-powerful pages are
mathematically granted extra power to anoint other pages as powerful.
It's not that we believe Google is evil. What we believe is that Google,
Inc. is at a fork in the road, and they have some big decisions to make.
This Google Watch site is trying to articulate and publicize the situation
at Google, and encourage more scrutiny of their operations. By doing this,
we hope to play a small part in maintaining the web as an information tool
that is more useful for the masses, than it is for the elites.
That's why we and over 500 others nominated Google for a Big Brother
award in 2003.
The nine points we raised in connection with
this nomination necessarily focused on privacy issues:
-
Google's immortal cookie:
Google was the first search
engine to use a cookie that expires in 2038. This was at a time when
federal websites were prohibited from using persistent cookies
altogether. Now it's years later, and immortal cookies are
commonplace among search engines; Google set the standard because no
one bothered to challenge them. This cookie places a unique ID
number on your hard disk. Anytime you land on a Google page, you get
a Google cookie if you don't already have one. If you have one, they
read and record your unique ID number.
-
Google records everything they can:
For all searches they record the
cookie ID, your Internet IP address, the time and date, your search
terms, and your browser configuration. Increasingly, Google is
customizing results based on your IP number. This is referred to in
the industry as "IP delivery based on geolocation."
-
Google retains all data indefinitely:
Google has no data retention
policies. There is evidence that they are able to easily access all
the user information they collect and save.
-
Google won't say why they need this
data:
Inquiries to Google about their
privacy policies are ignored. When the New York Times (2002-11-28)
asked Sergey Brin about whether Google ever gets subpoenaed for this
information, he had no comment.
-
Google hires spooks:
Keyhole, Inc. was supported with
funds from
the CIA. They developed a database of spy-in-the-sky
images from all over the world. Google acquired Keyhole in 2004, and
would like to hire more people with security clearances, so that
they can peddle their corporate assets to the spooks in Washington.
-
Google's toolbar is spyware:
With the advanced features
enabled, Google's free toolbar for Explorer phones home with every
page you surf, and yes, it reads your cookie too. Their privacy
policy confesses this, but that's only because Alexa lost a
class-action lawsuit when their toolbar did the same thing, and
their privacy policy failed to explain this. Worse yet,
Google's
toolbar updates to new versions quietly, and without asking. This
means that if you have the toolbar installed, Google essentially has
complete access to your hard disk every time you connect to Google
(which is many times a day). Most software vendors, and even
Microsoft, ask if you'd like an updated version. But not Google. Any
software that updates automatically presents a massive security
risk.
-
Google's cache copy is illegal:
Judging from Ninth Circuit
precedent on the application of U.S. copyright laws to the Internet,
Google's cache copy appears to be illegal. The only way a webmaster
can avoid having his site cached on Google is to put a "noarchive"
meta in the header of every page on his site. Surfers like the
cache, but webmasters don't. Many webmasters have deleted
questionable material from their sites, only to discover later that
the problem pages live merrily on in Google's cache. The cache copy
should be "opt-in" for webmasters, not "opt-out."
-
Google is not your friend:
By now Google enjoys a 75 percent
monopoly for all external referrals to most websites. Webmasters
cannot avoid seeking Google's approval these days, assuming they
want to increase traffic to their site. If they try to take
advantage of some of the known weaknesses in Google's semi-secret
algorithms, they may find themselves penalized by Google, and their
traffic disappears. There are no detailed, published standards
issued by Google, and there is no appeal process for penalized
sites. Google is completely unaccountable. Most of the time Google
doesn't even answer email from webmasters.
-
Google is a privacy time bomb:
With 200 million searches per
day, most from outside the U.S., Google amounts to a privacy
disaster waiting to happen. Those newly-commissioned data-mining
bureaucrats in Washington can only dream about the sort of slick
efficiency that Google has already achieved.