🔁 Hugging Face 转推了
Rohan Paul @rohanpaul_ai
Wow. This is a HUGE 24-trillion-token web dataset with document-level metadata available on @huggingface
apache-2.0 license
- collected from Common Crawl
- each document is labeled with a 12-field taxonomy covering topic, page type, complexity, and quality .
- Labels are
apache-2.0 license
- collected from Common Crawl
- each document is labeled with a 12-field taxonomy covering topic, page type, complexity, and quality .
- Labels are
