Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Public datasets based on web crawls #27

Open
tarkowski opened this issue Apr 20, 2024 · 0 comments
Open

Public datasets based on web crawls #27

tarkowski opened this issue Apr 20, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@tarkowski
Copy link

Thank you for a thoughtful and important white paper. I would like to suggest that public datasets based on web crawls, such as Common Crawl, should be placed within the scope of the paper. While some AI developers crawl the web directly, others rely on such datasets for training their ML models. As such, these datasets constitute an important form of intermediation of web content for the purpose of AI training. Many of the challenges listed in the paper, and suggested ways of mitigating them, apply to these datasets, and the organizations that build them and make them available. Therefore, bringing to life the ethical web principles also requires proper governance of this intermediary stage.

Also, building on issue #26 , it would be worthwile to consider whether such training datasets – as a representation of the web that needs to meet same ethical requirements – should not be governed, as a public resource, as part of Web governance mechanisms and institutions.

@tarkowski tarkowski added the enhancement New feature or request label Apr 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
1 participant