The high cost of cheap AI training data

Categorieën: Internet Leeftijd: 14 t/m 18 jaar 19 t/m 30 jaar 31 t/m 64 jaar 65 jaar en ouder

Laden...

1 jaar geleden

Html
Tekst

How one dataset is shaping generative AI as we know it ‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏‌ ͏

Donate

Hello,

Common Crawl is very likely the most influential nonprofit you have never heard of.

For many years, the organization has been flying under the radar – crawling and archiving the internet, openly sharing that data with everyone.

But everything changed when OpenAI revealed Common Crawl as the primary source of training data for GPT-3, the large language model (LLM) that still powers the free version of ChatGPT.

Now Common Crawl's more than 9.5 petabytes of data is the go-to source for LLM builders – reason enough for us to investigate this influential dataset.

In our brand new report, Mozilla finds that Common Crawl's outsized role in the generative AI boom has improved transparency and competition, but is also contributing to biased and opaque generative AI models. But better is possible.

Read our report on Common Crawl to learn how one small nonprofit shapes generative AI as we know it.

Read the report →

Key findings

Common Crawl's data is only a fraction of the internet: it primarily captures English language content, and websites from digitally marginalized communities are less likely to be included. Automated filtering isn't cutting it, human curation is a must: Common Crawl's data contains hate speech and explicit content that is useful for many researchers, but harmful when used to train consumer products without care. Common Crawl and LLM builders have a shared responsibility: Common Crawl should highlight its limitations and biases, and push for builder transparency. And builders need to share how Common Crawl data was filtered, and what measures they take to address harms from biased and explicit datasets.

Read the report to learn about what can be done to build generative AI responsibly.

Read the full report →

Thanks for all you do for the internet.

Stefan Baack
Research and Data Analyst
Mozilla

Deel deze op

Laden...

Andere nieuwsbrieven van Mozilla.org

Before midnight Mozilla.org
Afgelopen maandag om 17:41
Why we ask Mozilla.org
9 dagen geleden
IRL Podcast is back – join us for season 8 Mozilla.org
12 dagen geleden

Meer nieuwsbrieven van Mozilla.org

Laden...

Gerelateerde nieuwsbrieven

Bekijk andere categorieën

Internet

Hello,

Key findings

Connect with us

Thanks for reading!