Public datasets based on web crawls #27

tarkowski · 2024-04-20T19:46:58Z

Thank you for a thoughtful and important white paper. I would like to suggest that public datasets based on web crawls, such as Common Crawl, should be placed within the scope of the paper. While some AI developers crawl the web directly, others rely on such datasets for training their ML models. As such, these datasets constitute an important form of intermediation of web content for the purpose of AI training. Many of the challenges listed in the paper, and suggested ways of mitigating them, apply to these datasets, and the organizations that build them and make them available. Therefore, bringing to life the ethical web principles also requires proper governance of this intermediary stage.

Also, building on issue #26 , it would be worthwile to consider whether such training datasets – as a representation of the web that needs to meet same ethical requirements – should not be governed, as a public resource, as part of Web governance mechanisms and institutions.

tarkowski added the enhancement New feature or request label Apr 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Public datasets based on web crawls #27

Public datasets based on web crawls #27

tarkowski commented Apr 20, 2024

Public datasets based on web crawls #27

Public datasets based on web crawls #27

Comments

tarkowski commented Apr 20, 2024