Google and Privacy

I just came across a blog post about why Google is the biggest threat to Americans' privacy today, describing some testimony that Scott Cleland gave to the House of Representatives recently. It is disappointing to me that among the very real privacy concerns, there are some concerns he raises that aren't even privacy concerns, and some which are misleading.

I'm probably a bit more knowledgeable on the topic than most... After all, I actually worked there for a while, and had a job that potentially had me invading people's privacy: I was in charge of the Google log analysis and storage system. One of the things that frustrated me at the time was Google's data retention policy (at the time I worked at Google, there really wasn't one), and the sheer amount of data that Google collects is probably the single biggest theoretical threat to privacy out there.

But here, let's present some of the claims from the testimony:

The fact that Google’s web “crawlers” are the world’s most pervasive and invasive, Google indiscriminately searches websites for whatever it can find, and automatically assumes if their crawlers can find it, it must be “public” information. This indiscriminate web crawling has resulted in Google exposing private information like social security numbers, as Google did in making hundreds of California university students’ social security numbers public -- as reported by the Sacramento Bee (3-7-07.)

I don't know that Google's web crawlers are the world's most pervasive or invasive. As a matter of fact, publishers of information have a mechanism to prevent Google's crawler from accessing content they don't want to publish: it's called the Robot Exclusion Protocol, and I believe was a standard back in the days of Altavista.

If "indiscriminate web crawling" ends up with the dissemination of private information, such as the case of student social security numbers, isn't the major breach of privacy the one that was caused by the original publication of that data? There were other cases in the past where credit card numbers were exposed, and Google explicitly added code to the search engine to try and prevent people from finding lists of credit card numbers. This isn't something that gets a lot of press when Google gets slammed for privacy concerns, but it's a sign of how Google actually takes this sort of thing seriously: shortly after realizing that people were using the search engine for finding credit cards, those searches were rendered ineffective.

So this is perhaps one way in which Google can cause privacy problems, but not by creating those problems in the first place. This example only shows how Google can intensify existing privacy problems. And the good news about this is that once news gets out about such a breach, the original privacy breach is usually patched, preventing some other party from illicitly doing it. (Those social security numbers were up there before Google found them, and might still be up there if someone hadn't discovered them on Google.) So while this is an example of how privacy breaches can (and will) get magnified by Google, it's also an example of how privacy breaches get stopped by Google. And once the story got publicity, many other organizations that had similar information examined their web servers to try and prevent similar data breaches.

Let me illustrate this cultural disdain for privacy with three high-profile examples of Google proceeding full-speed-ahead with “beta” releases -- without regard to privacy implications of their actions.

  • Google introduced gmail, which enables Google to automatically read the content of users’ private gmail messages in order to send them “relevant” advertising – without meaningful internal privacy review. This caused a widely reported public uproar over users’ privacy being abused.
  • Google introduced Google Earth, which exposed the roof tops of the White House, public buildings and military installations, without meaningful internal review of the privacy, safety, or national security implications. The uproar that ensued over this suggests Google learned little from the gmail incident about the importance of internal review to address external concerns like privacy.
  • Google then introduced StreetView, which is video of people’s homes, apartments and neighborhoods, without meaningful internal review of the privacy or safety concerns involved. The uproar over this invasion of privacy is so significant that Google is very secretive about where and when Google’s “spycars” will be videoing a particular neighborhood in order to protect the safety of the Google drivers from irate residents.
    • The inescapable conclusion from this pattern of behavior is that Google’s culture exhibits a fundamental and sustained disdain for privacy.

Here, Cleland is discussing Google's culture of disdain for privacy. This is a characterization that I find very interesting, because I think a more accurate characterization is much more complicated, but I don't think the examples he gives are very good ones.

Regarding Gmail, it's interesting that Cleland states that there was no "meaningful internal privacy review." Gmail got a lot of internal discussion, including a lot of it about the serving of ads with e-mail. At the very least there was plenty of meaningful internal review, so the question remains of whether or not Gmail's advertising is really a privacy breach.

Much of the internal review was done by engineers. I used the internal Gmail server exclusively for my work e-mail for months before Gmail's launch. And as an engineer, I understand that current e-mail systems, especially of the scale Google was trying to design, scan through e-mails to try and detect viruses and spam. Web based mail servers parse messages and reformat them to make them more suitable for display through a browser. Google's Gmail ad server is a similar sort of software-based scan, using technology similar to what a spam detector might use, but instead of identifying topics that are likely to be spam, identifies topics that are likely to have advertising potential and pushes out those ads.

Now, there's major potential for something that would be a horrible invasion of privacy: storing data about the subject of people's e-mails to build up a profile on what sorts of topics they're receiving e-mail on (and therefore likely interested in.) But Google doesn't do that. So this isn't a privacy breach, although we do have to trust Google not to collect this sort of data in the future.

Google Earth and Google Street View are interesting examples of potential breaches of privacy, but all of these photos were taken from publicly accessible areas. As far as I understand it, the law doesn't consider that an invasion of privacy. From where I'm sitting right now, I can take a picture of the San Francisco city skyline, and even though I might inadvertently include someone's open window in my shot, it shouldn't be considered a breach of privacy, because by leaving your window open, you're losing your reasonable expectation of privacy.

And exposing the roof tops of government buildings a "privacy" issue? Not at all. Perhaps it's relevant to call it a national security issue, but I fail to see how it affects people's privacy.

All of the pictures that Google uses for Google Earth are commercially available. Are the companies selling these pictures to Google not guilty of breach of privacy because they're charging Google for it? Or is the fact that Google makes it so widely available at no cost the thing that turns it into a privacy breach? If there's any breach at all, it's being committed by Google's suppliers, not Google.

You can get similar pictures to Google's Street View by driving around and shooting them yourself on public property. It seems unreasonable to call this an invasion of privacy.

Another trust undermining aspect of Google’s business is the rampancy of fraud in Google’s model.

  • Most people are not aware that click-search is one of the most fraud-prone industries in America. Click Forensics, which is the leading industry tracker of web fraud, estimates that 28% of all Internet clicks are fraudulent.
  • The dirty little secret here is that the gross-revenue business model for search, which was pioneered by Google, makes money off of fraudulent clicks. In other words, Google’s gross revenue model does not have a financial incentive to be honest.
  • It is hard to imagine another legal industry in America that would tolerate a 28% gross fraud rate!

This is a bit misleading. First, notice the "28% of all Internet clicks are fraudulent" claim. (Does this mean Internet advertising clicks? I'll give that the benefit of the doubt, but it's a bit vague.) There's no real investigation of what this means, and Google is pretty aggressive in identifying fraudulent clicks.

Does this mean that 28% of what Google is charging for is fraudulent? Well, for one thing, that number is for the Internet as a whole, not just Google, but we can even give that the benefit of the doubt. Google's business is not as dependent on fraud as this line of reasoning might have you believe. The simple explanation? Google doesn't charge for all of those clicks.

Many of those clicks are "invalid" in the sense that it's easy to detect that they're not coming from a valid user. And even many of the ones that do get through get detected as fraud; a statistic calculated in December 2006 estimated it was as low as 0.2%.

There's also a market-based explanation that the amount of money Google makes is independent of the amount of fraud in the system that I've heard quite a few of Google engineers explain. I don't think it's all that worth going into the argument here, and given that I don't know how much of Google's auction system is publicly disclosed, I'm not sure how much detail I can give is public knowledge. But suffice it to say that if that argument is correct (or even if it's incorrect, but Google management believes it), there's no benefit to Google to encouraging fraud, and there actually is the good-customer-service good-PR benefit to discouraging fraud. While the gross revenue model might not incentivize honesty, it's not clear that it discourages it, and there are other factors encouraging honesty.

Google runs its not-for-profit Google.org as a for-profit division of Google, when every other corporation in America abides by the clear separation of for-profit and not-for-profit entities to avoid even the appearance of tax evasion or impropriety.

This one really mystifies me, because it seems to be mixing up a couple of things, some of which might be relevant, and some of which definitely aren't. For one thing, there's certainly no tax evasion going on, because Google.org explicitly didn't incorporate as a 501(c)3 to try a model of charity that wasn't compatible with the IRS's definition of it. So they pay taxes. Even the implication that some of this might involve some tax shenanigans seems a bit off-base to me.

Given that the usual reasons of separating out a non-taxed entity from a taxed entity aren't even present here, since there is no exempt entity, it's not entirely clear to me what "improprieties" might exist. There certainly are other corporations out there that think that some aspect of what they're doing is a social good, in spite of the fact that the tax code doesn't agree by allowing tax-exempt status.

I've probably rambled on far enough... And I do agree with the main point Cleland was making in his post: Google is a big threat to our privacy (at least potentially). I'm not sure I think it's the biggest threat, because I think that government programs to automate spying on the citizenry might be a bigger threat, and it appears that Bush administration is trying to engage in such things, so I think there's a convincing argument that Google's got a competitor when it comes to claims of who's the biggest potential threat...

But if you're going to make an argument about Google's privacy breaches, you don't need to resort to non sequiturs and exaggerations. Stick to the truth: they have a history of being secretive about the data they keep and how long they keep it, and they keep more data (and more different types of data, and more data that is theoretically linkable into a user profile) than any other corporation in the history of the world.

Comments

Roslyn Bradford

Malcolm Hutchinson

Jody Lang

Brain Alvarez