Article
Marc Lammers PRO
MarcusLammers
·
AI & ML interests
The future of compute isn’t linear, it is intelligent.
Recent Activity
published
an
article
about 22 hours ago
GENESIS: A Constant-Time Compute Engine for Financial Infrastructure
commented on
an
article
about 2 months ago
Introducing Wikipedia Monthly: Fresh, Clean Wikipedia Dumps for NLP & AI Research
replied to
omarkamali's
post
about 2 months ago
Another month, another Wikipedia Monthly release! 🎃
Highlights of October's edition:
· 🗣️ 341 languages
· 📚 64.7M articles (+2.5%)
· 📦 89.4GB of data (+3.3%)
We are now sampling a random subset of each language with a reservoir sampling method to produce splits `1000`, `5000`, and `10000` in addition to the existing `train` split that contains all the data.
Now you can load the english (or your favorite language) subset in seconds:
`dataset = load_dataset("omarkamali/wikipedia-monthly", "latest.en", split="10000")`
Happy data engineering! 🧰
https://huggingface.co/datasets/omarkamali/wikipedia-monthly