date modified February 19, 2025
with the title "Training Ground" in SA Special Editions Vol. 34 No. 1s (March 2025), p. 48.
their generative AI models on vast swathes of the Internet - and there's no real way to stop them...
These machine-learning models are capable of pumping out only images and text because they've been trained on mountains of real people's creative work, much of it copyrighted.
Major AI developers, including OpenAI, Meta and Stability AI, now face multiple lawsuits on this.
Such legal claims are supported by independent analyses:
And training datasets for these models include more than books.
In the rush to build and train ever larger AI models, developers have swept up much of the searchable Internet. This approach not only has the potential to violate copyrights but also threatens the privacy of the billions of people who share information online.
In addition, it means that supposedly neutral models could be trained on biased data.
A lack of corporate transparency makes it difficult to figure out exactly where companies are getting their training data - but Scientific American spoke with some AI experts who have a general idea.
But,
...says Emily M. Bender, a linguist who studies computational linguistics and language technology at the University of Washington.
Instead developers amass their training sets by
using automated tools that catalog and extract data from the
Internet.
Web "crawlers"
travel from link to link indexing the location of information in a
database, and
web "scrapers"
download and extract that same information.
Other companies, however, turn to existing resources such as,
Neither Common Crawl nor LAION responded to requests for comment.
Companies that want to use LAION as an AI
resource (it was part of the training set for image generator Stable
Diffusion, Dodge says) can follow these links but must download the
content themselves.
But data that are viewable in a search engine or without logging into a site, such as a public LinkedIn profile, might be vacuumed up, Dodge says.
Then, he adds,
...including blogs, personal web pages and company sites.
This category includes anything on the popular photograph-sharing site Flickr, online marketplaces, voter-registration databases, government web pages, Wikipedia, Reddit, research repositories, news outlets and academic institutions.
Plus, there are pirated-content compilations and web archives, which often contain data that have since been removed from their original location on the web.
And scraped databases do not go away.
Some data crawlers and scrapers are even able to get past paywalls (including Scientific American's) by disguising themselves behind paid accounts, says Ben Zhao, a computer scientist at the University of Chicago.
Paywalled news sites were among the top
data sources included in Google's C4 database (used to train
Google's LLM T5 and Meta's LLaMA), according to a joint analysis by
the Washington Post and the Allen Institute.
Zhao points to one particularly striking example where an artist discovered that a private diagnostic medical image of her was included in the LAION database.
Reporting from Ars Technica confirmed the artist's account and that the same dataset contained medical-record photographs of thousands of other people as well.
It's impossible to know exactly how these images ended up being included in LAION, but Zhao points out that data get misplaced, privacy settings are often lax, and leaks and breaches are common.
Information not intended for the public Internet
ends up there all the time.
But beyond these acknowledgments, companies have become increasingly cagey about revealing details on their datasets in recent years.
Meta offered a general data breakdown in its technical paper on the first version of LLaMA, but the release of LLaMA 2 a few months later included far less information.
Google, too, didn't specify its data sources for its PaLM2 AI model beyond saying that much more data were used to train PaLM2 than to train the original version of PaLM.
OpenAI wrote that it would not disclose any
details on its training dataset or method for GPT-4, citing
competition as a chief concern.
Why are dodgy
training data a problem?
Many widely used generative AI models have blocks meant to prevent them from sharing identifying information about individuals, but researchers have repeatedly demonstrated ways to get around these restrictions.
For creative workers, Zhao says,
But without transparency about data sources, it's
difficult to blame such outputs on the AI's training; after all, it
could be coincidentally "hallucinating" the problematic material.
Datasets such as Common Crawl, for instance, include white supremacist websites and hate speech.
As a result, Broussard points out, AI image generators tend to produce sexualized images of women.
Bender echoes this concern and observes that the bias goes even deeper - down to who can post content to the Internet in the first place.
Online harassment compounds the problem by forcing marginalized groups out of some online spaces, Bender adds.
This exclusion means data scraped from the Internet fail to represent the full diversity of 'the real world'...
How can you
protect your data from AI?
Zhao and his colleagues have developed a tool called Glaze, which can be used to make images effectively unreadable to AI models. But the researchers have been able to test its efficacy with only a subset of AI image generators, and its uses are limited.
For one thing, it can protect only images that haven't previously been posted online. Anything else might have already been vacuumed up into web scrapes and training datasets.
As for text, no such tool exists.
It's up to the scraper developer, however, to opt to abide by these notices.
So far, however, AI companies have pushed back on
such requests by claiming the provenance of the data can't be proved
- or by ignoring the requests altogether - says Jennifer King,
a privacy and data researcher at Stanford University.
Currently there are no significant AI policies or legal rulings that would require tech companies to take such actions - and that means these businesses have no incentive to go back to the drawing board.
|