Impressive results from Hugging Face: proper filtering of web data can match or exceed performance of commercial models trained on highly curated datasets.
Dataset: https://huggingface.co/datasets/tiiuae/falcon-refinedweb
Paper: https://doi.org/10.48550/arXiv.2306.01116