@pydataamsterdam

Impressive results from Hugging Face: proper filtering of web data can match or exceed performance of commercial models trained on highly curated datasets.

Dataset: huggingface.co/datasets/tiiuae
Paper: doi.org/10.48550/arXiv.2306.01