Activity Feed

AI & ML interests

None defined yet.

ajibawa-2023 
posted an update 8 days ago
view post
Post
1180
Ruby-Code-Large
Dataset : ajibawa-2023/Ruby-Code-Large

Ruby-Code-Large is a large-scale corpus of Ruby programming language source code comprising 331,743 code samples stored in .jsonl format. The dataset is designed to support research and development in large language model (LLM) pretraining, static analysis, web application development, and software engineering automation within the Ruby ecosystem.

By offering a substantial, language-focused dataset, Ruby-Code-Large enables targeted experimentation in dynamic programming, object-oriented design, and rapid application development—areas where Ruby is widely used, particularly in web frameworks and scripting.

Ruby-Code-Large addresses the lack of large, curated, Ruby-specific datasets, enabling focused research on expressive syntax, metaprogramming, and high-level abstractions.
ajibawa-2023 
posted an update 9 days ago
view post
Post
6069
Go-Code-Large
Dataset: ajibawa-2023/Go-Code-Large

Go-Code-Large is a large-scale corpus of Go (Golang) programming language source code, comprising 316,427 code samples stored in .jsonl format. The dataset is designed to support research and development in large language model (LLM) pretraining, static analysis, cloud-native systems, and modern backend software engineering.

By offering a focused and curated dataset for Go, this corpus enables experimentation in concurrent programming, distributed systems, and performance-oriented backend services—domains where Go is widely adopted.

Go-Code-Large addresses the relative scarcity of large, language-specific datasets for Go, enabling targeted research into idiomatic Go patterns, concurrency primitives, and scalable system design.
  • 2 replies
·
MikeDoes 
posted an update 16 days ago
view post
Post
2059
What happens when PII masking is treated as a trainable behavior, not just a detection task?

A new reinforcement learning environment tackles this question using a dataset derived from ai4privacy/open-pii-masking-500k-ai4privacy, transformed into a verifier-based training and evaluation setup.

Instead of evaluating PII masking as a one-off redaction step, this environment frames privacy as something models must consistently optimize for under feedback. The task requires models to correctly identify sensitive spans, replace them with [PII] tags, and comply with strict output formatting — all scored through explicit reward signals.

To make this realistic, the author filtered and normalized the dataset to focus on US-English examples, ensuring consistent masking targets while preserving the structural diversity needed to expose failure modes.

What's notable here isn't just the environment itself, but the shift in perspective.

By turning PII masking into a reinforcement learning problem, privacy stops being a static rule and becomes a behavior models are trained to maintain even under optimization pressure.

This is a strong example of how open privacy datasets can move beyond benchmarks and become infrastructure for new learning paradigms.

🔗 Explore the PII Masking RL environment on Prime Intellect:
https://app.primeintellect.ai/dashboard/environments/adamlucek/pii-masking
MikeDoes 
posted an update 17 days ago
view post
Post
143
PII leakage isn't just a model problem — it's a data problem.

A recent paper takes a hard look at how well current systems actually detect and redact personal data at scale. One of their key conclusions is something the privacy community keeps rediscovering: without large, structured, and diverse PII datasets, evaluation collapses into guesswork.

To ground their experiments, the authors benchmarked their approach using the 500K PII-Masking dataset from AI4Privacy, leveraging its scale and coverage to test real-world redaction behavior rather than toy examples.

What's interesting here isn't just the model performance — it's what the evaluation reveals.

The paper shows that many systems appear robust under narrow tests but fail once PII appears in varied formats, contexts, and combinations. This gap between "works in theory" and "works in practice" is exactly where privacy risks emerge.

This is the value of open, research-grade datasets:

They expose failure modes early

They make comparisons reproducible

They let the community measure progress honestly

When researchers build on shared data foundations, everyone benefits — from academic insight to safer downstream applications.

🔗 Read the full paper here: https://arxiv.org/abs/2407.08792
appvoid 
posted an update 18 days ago
MikeDoes 
posted an update 27 days ago
view post
Post
2596
Things our clients and open source actually said to us this year:

"Finally, someone built a synthetic PII training data for German."

"Does it cover have localised information? Not just the language, the actual format. That must have been a lot of work that we can save from our side."

"We operate in 12 EU countries. Your dataset is the only one that covers all of them which has helped us out a lot in compliance especially because it's synthetic."

Every language has strong PII localization names, addresses, IDs, phone numbers, dates in the real format of that country.

23 languages. 29 regions. 3 scripts. 1,428,143 examples.

100% synthetic. Zero real personal data. Free on Hugging Face.
MikeDoes 
posted an update 28 days ago
view post
Post
618
Ai4Privacy has been working on this for the past year. 🙏

Today we're releasing the PII Masking 2M Series, the world's largest open source privacy masking dataset. (Again. 🚀🚀)

🔢 2M+ synthetic examples
🌍 32 locales across Europe
🏷️ 98 entity types
🏥💬🏦💼📍 5 industry verticals: Health, Finance, Digital, Work, Location
✅ 1M+ entries freely available on Hugging Face

Every example is 100% synthetic. No real personal data. Built so you can train and evaluate PII detection models without the legal headaches. 🔒

Thank you for 15,000,000+ downloads across our datasets, models, and libraries. This one's for you. ❤️


hashtag#privacy hashtag#ai hashtag#opensource hashtag#nlp hashtag#gdpr hashtag#pii hashtag#huggingface hashtag#machinelearning
ajibawa-2023 
posted an update about 1 month ago
view post
Post
2787
C-Code-Large
Dataset: ajibawa-2023/C-Code-Large

C-Code-Large is a large-scale corpus of C programming language source code comprising more than 4 million code samples stored in .jsonl format. The dataset is designed to support research and development in large language model (LLM) pretraining, static analysis, and software engineering automation for the C ecosystem.

By offering a high-volume, language-focused dataset, C-Code-Large enables targeted experimentation in low-level programming, memory-constrained environments, and performance-critical systems, where C continues to be a dominant language.

C-Code-Large addresses the lack of large, curated, C-specific datasets, making it possible to conduct focused research on procedural programming paradigms, manual memory management, and system-level abstractions.

codelion 
posted an update about 2 months ago
view post
Post
3280
Scaling Pedagogical Pre-training to 10 Billion Tokens

New blog post exploring what happens when you take optimal data mixing insights and scale up the data generation itself.

We built Sutra, a multi-stage framework for generating pedagogical pre-training data guided by a knowledge graph of ~2,000 concepts across 9 domains. The pipeline includes structured content generation, six-dimension quality evaluation, diversity management across 20 content styles, and a cleaning stage to prevent collapse.

The result is codelion/sutra-10B, a 10.2 billion token pedagogical dataset with rich metadata (domain, complexity, prerequisites, quality scores) on every entry.

We trained codelion/SmolLM2-70M on it for 3 full epochs (30.6B tokens) on a single A10 GPU in ~78 hours.

Key finding: perplexity kept improving across epochs, but benchmark gains plateaued fast. At 70M parameters, the model hits a representational ceiling that more data alone can't break through.

Full writeup with comparisons against 7 other datasets, detailed benchmark breakdowns, and connections to recent work on synthetic data scaling, curriculum learning, and data mixing laws: https://huggingface.co/blog/codelion/scaling-pedagogical-pretraining-10-billion-tokens

All datasets at multiple scales (10M, 100M, 1B, 10B) plus seed concepts and an SFT variant are in the Sutra Pedagogical Datasets collection.
  • 2 replies
·
ajibawa-2023 
posted an update about 2 months ago
view post
Post
3854
Cpp-Code-Large
Dataset: ajibawa-2023/Cpp-Code-Large

Cpp-Code-Large is a large-scale corpus of C++ source code comprising more than 5 million lines of C++ code. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and static program analysis for the C++ ecosystem.

By providing a high-volume, language-specific corpus, Cpp-Code-Large enables systematic experimentation in C++-focused model training, domain adaptation, and downstream code understanding tasks.

Cpp-Code-Large addresses the need for a dedicated C++-only dataset at substantial scale, enabling focused research across systems programming, performance-critical applications, embedded systems, game engines, and large-scale native software projects.
  • 3 replies
·
appvoid 
posted an update about 2 months ago
view post
Post
2512
Let's keep the momentum for small models. I just published dot. It's the first pretrained causal model that is trained on math/symbols rather than english. The goal is to get an agnostic fewshot meta learner that learns from reality itself instead of language.

It's already decent at some tasks, with next version coming in a few weeks.


appvoid/dot
  • 5 replies
·
appvoid 
posted an update about 2 months ago
view post
Post
242
Are you ready for some ●s? Tomorrow will be a good day.
  • 4 replies
·
ajibawa-2023 
posted an update about 2 months ago
view post
Post
3518
Python-Code-Large
Dataset: ajibawa-2023/Python-Code-Large

Python-Code-Large is a large-scale corpus of Python source code comprising more than 2 million rows of Python code. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and program analysis for the Python ecosystem.

By providing a high-volume, language-specific corpus, Python-Code-Large enables systematic experimentation in Python-focused model training, domain adaptation, and downstream code understanding tasks.

Python-Code-Large addresses the need for a dedicated Python-only dataset at substantial scale, enabling focused research across data science, backend systems, automation, scientific computing, and AI-driven Python environments.
  • 1 reply
·