rss-bridge 2023-10-23T10:00:00+00:00

Generative AI Models Are Sucking Up Data from All Over the Internet, Yours Included

In the rush to build and train ever larger AI models, developers have swept up much of the searchable Internet, quite possibly including some of your own public data—and potentially some of your private data as well.
Learn more about your ad choices. Visit megaphone.fm/adchoices

October 23, 2023

Add Us On GoogleAdd SciAm

By Sophie Bushwick, Lauren Leffer, Tulika Bose & Elah Feder edited by Jeffery DelViscio

[Illustration of a Bohr atom model spinning around the words Science Quickly with various science and medicine related icons around the text]

Apple | Spotify | YouTube |RSS

Sophie Bushwick: To train a large artificial intelligence model, you need lots of text and images created by actual humans. As the AI boom continues, it's becoming clearer that some of this data is coming from copyrighted sources. Now writers and artists are filing a spate of lawsuits to challenge how AI developers are using their work.

Lauren Leffer: But it's not just published authors and visual artists that should care about how generative AI is being trained. If you're listening to this podcast, you might want to take notice, too. I'm Lauren Leffer, the technology reporting fellow at Scientific American.

Bushwick: And I'm Sophie Bushwick, tech editor at Scientific American. You're listening to Tech, Quickly, the digital data diving version of Scientific American'sScience, Quickly podcast.

On supporting science journalism

If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.

So, Lauren, people often say that generative AI is trained on the whole Internet, but it seems like there's not a lot of clarity on what that means. When this came up in the office, lots of our colleagues had questions totally.

Leffer: People were asking about their individual social media profiles, password-protected content, old blogs, all sorts of stuff. It's hard to wrap your head around what online data means when, as Emily M. Bender, a computational linguist at University of Washington, told me, quote, “There's no one place where you can download the Internet.”

Bushwick: So let's dig into it. How are these AI companies getting their data?

Leffer: Well, it's done through automated programs called web crawlers and web scrapers. This is the same sort of technology that's long been used to build search engines. You can think of web crawlers like digital spiders moving around silk strands from URL to URL, cataloging the location of everything they come across.

Bushwick: Happy Halloween to us.

Leffer: Exactly. Spooky spiders on the internet. Then web scrapers go in and download all that catalog information.

Bushwick: And these tools are easily accessible.

Leffer: Right. There's a few different open access web crawlers out there. For instance, there's one called Common Crawl, which we know OpenAI used to gather training data for at least one iteration of the large language model that powers ChatGPT.

Bushwick: What do you mean? At least one?

Leffer: Yeah. So the company, like many of its big tech peers, has gotten less transparent about training data over time. When OpenAI was developing GPT-3, it explained in a paper what it was using to train the model and even how it approached filtering that data. But with the release of GPT-3.5 and GPT-4, OpenAI offered far less information.

Bushwick: How much less are we talking?

Leffer: A lot less—almost none. The company's most recent technical report offers literally no details about the training process or the data used. OpenAI even acknowledges this directly in the paper, writing that “given both the competitive landscape and the safety implications of large scale models like GPT-4 this report contains no further details about the architecture, hardware training, compute dataset, construction training method or similar.”

Bushwick: Wow. Okay, so we don't really have any information from the company on what fed the most recent version of ChatGPT.

Leffer: Right. But that doesn't mean we're completely in the dark. Likely between GPT-3 and GPT-4 the largest sources of data stayed pretty consistent because it's really hard to find totally new data sources big enough to build generative AI models. Developers are trying to get more data, not less. GPT-4 probably relied, in part, on Common Crawl, too.

Bushwick: Okay, so Common Crawl and web crawlers, in general—they're a big part of the data gathering process. So what are they dredging up? I mean, is there anywhere that these little digital spiders can't go?

Leffer: Great question. There are certainly places that are harder to access than others. As a general rule, anything viewable in search engines is really easily vacuumed up, but content behind a login page is harder to get to. So information on a public LinkedIn profile might be included in Common Crawl's database, but a password-protected account likely isn't. But think about it for one minute.

Open data on the Internet includes things like photos uploaded to Flickr, online marketplaces, voter registration databases, government web pages, business sites, probably your employee bio, Wikipedia, Reddit, research repositories, news outlets. Plus there's tons of easily accessed pirated content and archived compilations, which might include that embarrassing personal blog you thought you deleted years ago.

Bushwick: Yikes. Okay, so it's a lot of data, but—okay. Looking on the bright side, at least it's not my old Facebook posts because those are private, right?

Leffer: I would love to say yes, but here's the thing. General web crawling might not include locked-down social media accounts or your private posts, but Facebook and Instagram are owned by Meta, which has its own large language model.

Bushwick: Ah, right.

Leffer: Right. And Meta is investing big money into further developing its AI.

Bushwick: On the last episode of Tech, Quickly, we talked about Amazon and Google incorporating user data into their AI models. So is Meta doing the same thing?

Leffer: Yes. Officially. The company admitted that it has used Instagram and Facebook post to train its AI. So far Meta has said this is limited to public posts, but it's a little unclear how they're defining that. And of course, it could always change moving forward.

Bushwick: I find this creepy, but I think that some people might be wondering: So what? It makes sense that writers and artists wouldn't want their copyrighted work included here, especially when generative AI can spit out content that mimics their style. But why does it matter for anyone else? All of this information is online anyway, so it's not that private to begin with.

Leffer: True. It's already all available on the Internet, but you might be surprised by some of the material that emerges in these databases. Last year, one digital artist was tooling around with a visual database called LAION, spelled L-A-I-O-N...

Bushwick: Sure, that's not confusing.

Leffer: Used in trainings and popular image generators. The artist came across a medical photo of herself linked to her name. The picture had been taken in a hospital setting as part of her medical file, and at the time, she'd specifically signed a form indicating that she didn't consent to have that photo shared in any context. Yet somehow it ended up online.

[...]

*Original source*