Stanford AI

university

https://www.ai.stanford.edu

Activity Feed Request to join this org

AI & ML interests

None defined yet.

Recent Activity

Xkev submitted a paper about 1 month ago

Sparse Reward Subsystem in Large Language Models

ojayy authored a paper about 2 months ago

Assessing Algorithmic Bias in Language-Based Depression Detection: A Comparison of DNN and LLM Approaches

Kameshr authored a paper 4 months ago

Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures

View all activity

Papers

Sparse Reward Subsystem in Large Language Models

Intelligence per Watt: Measuring Intelligence Efficiency of Local AI

View all Papers

ajibawa-2023

posted an update 2 days ago

Post

2184

C-Code-Large
Dataset: ajibawa-2023/C-Code-Large

C-Code-Large is a large-scale corpus of C programming language source code comprising more than 4 million code samples stored in .jsonl format. The dataset is designed to support research and development in large language model (LLM) pretraining, static analysis, and software engineering automation for the C ecosystem.

By offering a high-volume, language-focused dataset, C-Code-Large enables targeted experimentation in low-level programming, memory-constrained environments, and performance-critical systems, where C continues to be a dominant language.

C-Code-Large addresses the lack of large, curated, C-specific datasets, making it possible to conduct focused research on procedural programming paradigms, manual memory management, and system-level abstractions.

ajibawa-2023

posted an update 15 days ago

Post

3745

Cpp-Code-Large
Dataset: ajibawa-2023/Cpp-Code-Large

Cpp-Code-Large is a large-scale corpus of C++ source code comprising more than 5 million lines of C++ code. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and static program analysis for the C++ ecosystem.

By providing a high-volume, language-specific corpus, Cpp-Code-Large enables systematic experimentation in C++-focused model training, domain adaptation, and downstream code understanding tasks.

Cpp-Code-Large addresses the need for a dedicated C++-only dataset at substantial scale, enabling focused research across systems programming, performance-critical applications, embedded systems, game engines, and large-scale native software projects.

3 replies

ajibawa-2023

posted an update 20 days ago

Post

3485

Python-Code-Large
Dataset: ajibawa-2023/Python-Code-Large

Python-Code-Large is a large-scale corpus of Python source code comprising more than 2 million rows of Python code. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and program analysis for the Python ecosystem.

By providing a high-volume, language-specific corpus, Python-Code-Large enables systematic experimentation in Python-focused model training, domain adaptation, and downstream code understanding tasks.

Python-Code-Large addresses the need for a dedicated Python-only dataset at substantial scale, enabling focused research across data science, backend systems, automation, scientific computing, and AI-driven Python environments.

1 reply

ajibawa-2023

posted an update 24 days ago

Post

2571

PHP-Code-Large

Dataset: ajibawa-2023/PHP-Code-Large

PHP-Code-Large is a large-scale corpus of PHP source code comprising more than 12 million lines of PHP code. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and static program analysis for the PHP ecosystem.

By providing a high-volume, language-specific corpus, PHP-Code-Large enables systematic experimentation in PHP-focused model training, domain adaptation, and downstream code understanding tasks.

PHP-Code-Large addresses the need for a dedicated PHP-only dataset at substantial scale, enabling focused research across backend systems, CMS platforms, APIs, and full-stack PHP environments.

ajibawa-2023

posted an update 29 days ago

Post

3254

JavaScript-Code-Large
ajibawa-2023/JavaScript-Code-Large

JavaScript-Code-Large is a large-scale corpus of JavaScript source code comprising around 5 million JavaScript files. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and program analysis for the JavaScript ecosystem.

By providing a high-volume, language-specific corpus, JavaScript-Code-Large enables systematic experimentation in JavaScript-focused model training, domain adaptation, and downstream code understanding tasks.

JavaScript-Code-Large addresses the need for a dedicated JavaScript-only dataset at substantial scale, enabling focused research across frontend, backend, and full-stack JavaScript environments. .

ajibawa-2023

posted an update about 1 month ago

Post

3134

Java-Code-Large ( ajibawa-2023/Java-Code-Large)

Java-Code-Large is a large-scale corpus of publicly available Java source code comprising more than 15 million java codes. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and program analysis.

By providing a high-volume, language-specific corpus, Java-Code-Large enables systematic experimentation in Java-focused model training, domain adaptation, and downstream code understanding tasks.

efecelik

posted an update about 1 month ago

Post

3070

The moment we've been waiting for — ACE-Step dropped their new model: Ace-Step 1.5 🎉
🔗 ACE-Step/Ace-Step1.5
And the best part? It's released under the MIT license.
We've already started integrating it into our project. Let's go 🚀

1 reply

Xkev

submitted a paper to Daily Papers about 1 month ago

Sparse Reward Subsystem in Large Language Models

Paper • 2602.00986 • Published Feb 1 • 13

efecelik

posted an update about 2 months ago

Post

1377

🎮 Introducing: Paper Popularity Game

Think you know which AI papers go viral? Test your instincts!
I built a little game where you try to guess the popularity of AI research papers from the Hugging Face Daily Papers feed.

How it works:
You'll see two papers side by side—read the titles, check the abstracts, and pick which one you think got more upvotes from the HF community.

It's a great way to discover trending AI research while having fun.
Tests your intuition about what the ML community finds interesting.

Try it out:
efecelik/paper-popularity-game
Would love to hear your high scores and feedback!

efecelik

posted an update about 2 months ago

Post

1640

Interesting paper: PhysRVG

The core idea: instead of treating physics as a soft condition the model can work around during optimization, enforce it strictly via reinforcement learning. The paper focuses on rigid body dynamics - collisions, pendulums, free fall, rolling.

PhysRVG: Physics-Aware Unified Reinforcement Learning for Video Generative Models (2601.11087)

2 replies

efecelik

posted an update about 2 months ago

Post

638

Having multiple perspectives helps me create more diverse, innovative projects but without deep mastery in one area, I never feel truly satisfied.

What's the better investment: going deep in one field, or staying broad across many?

2 replies

efecelik

posted an update 2 months ago

Post

2537

My First MCP Server: DataView
Browse HuggingFace datasets directly from your AI assistant.
-Search & filter datasets
-View rows & stats
-SQL queries & Parquet export
efecelik/dataview-mcp

efecelik

posted an update 2 months ago

Post

252

We Built a Music App with ACE-Step – Looking for Feedback

Hey everyone,

We've been building AceSteps – a platform where anyone can create music using the ACE-Step model ( ACE-Step/ACE-Step-v1-3.5B). You can mint your tracks as NFTs, tokenize them into 100,000 fractional shares, and trade them on Uniswap V4. When your song gets popular, token holders earn from ad revenue automatically. It's a Farcaster Mini-App on Base Network.

But we want to make it better, and we'd love your input:

What's the one feature that would make you actually use an AI music tool regularly?
Andd any suggestions on how we can make this model better? Actually sharing here for this question. 🤗

Any feedback, ideas, or critiques are welcome.
🔗 https://docs.acesteps.com/
🔗 https://docs.acesteps.com/pitch-deck.html
🔗 https://farcaster.xyz/?launchFrameUrl=https%3A%2F%2Fwww.acesteps.com%2F
🔗 https://www.acesteps.com

efecelik

posted an update 2 months ago

Post

2320

why ACE-Step model isn't popular that much? imo it makes really good music.
ACE-Step/ACE-Step-v1-3.5B

2 replies

KingNish

posted an update 3 months ago

Post

3244

Muon vs MuonClip vs Muon+Adamw

Muon has gone from an experiment to a mainstream optimizer, but does it hold up for fine‑tuning? We ran head‑to‑head tests on Qwen3‑4B (10k+ high‑quality instruction rows) to find out.

Short story: Pure Muon converged fastest at the start, but its gradient‑norm spikes made training unstable. MuonClip (Kimi K2’s clipping) stabilizes long pretraining runs, yet in our small‑scale fine‑tune it underperformed, lower token accuracy and slower convergence. The winner was the hybrid: Muon for 2D layers + AdamW for 1D layers. It delivered the best balance of stability and final performance and even beat vanilla AdamW.

Takeaway: for small-scale fine-tuning, hybrid = practical and reliable.

Next Step: scale to larger models/datasets to see if Muon’s spikes become catastrophic or if clipping wins out.

Full Blog Link: https://huggingface.co/blog/KingNish/optimizer-part1

KingNish

posted an update 3 months ago

Post

2717

I tested Muon vs MuonClip vs Muon+AdamW for fine-tuning LLMs
Just published a blog on that, Read here 👉 https://huggingface.co/blog/KingNish/optimizer-part1

1 reply

adityatadimeti

authored a paper 4 months ago

LFM2 Technical Report

Paper • 2511.23404 • Published Nov 28, 2025 • 54

Kameshr

authored a paper 4 months ago

Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures

Paper • 2510.24081 • Published Oct 28, 2025 • 20

nouamanetazi

posted an update 5 months ago

Post

4581

After training 𝐒𝐦𝐨𝐥𝐋𝐌𝟑 on 𝟑𝟖𝟒 𝐇𝟏𝟎𝟎𝐬 for nearly a month, I've come to realize something most people overlook: 𝐢𝐧𝐟𝐫𝐚𝐬𝐭𝐫𝐮𝐜𝐭𝐮𝐫𝐞 𝐢𝐬 𝐭𝐡𝐞 𝐦𝐚𝐤𝐞-𝐨𝐫-𝐛𝐫𝐞𝐚𝐤 𝐟𝐚𝐜𝐭𝐨𝐫 𝐢𝐧 𝐋𝐋𝐌 𝐭𝐫𝐚𝐢𝐧𝐢𝐧𝐠. 🔥

Everyone talks about model architecture and data quality. And yes, those matter immensely. But here's what nobody tells you: when your training run fails at 2 AM because of mysterious 𝐍𝐂𝐂𝐋 𝐞𝐫𝐫𝐨𝐫𝐬, or when your expensive GPU cluster is running at 𝟔𝟎% 𝐞𝐟𝐟𝐢𝐜𝐢𝐞𝐧𝐜𝐲, the problem isn't your model. It's most probably a 𝐦𝐢𝐬𝐮𝐬𝐞 𝐨𝐟 𝐭𝐡𝐞 𝐡𝐚𝐫𝐝𝐰𝐚𝐫𝐞. 🛠️

Questions that seemed simple but had no clear answers: Why is 𝐌𝐨𝐄 𝐭𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐬𝐥𝐨𝐰𝐞𝐫 𝐭𝐡𝐚𝐧 𝐝𝐞𝐧𝐬𝐞 𝐦𝐨𝐝𝐞𝐥𝐬? Which 𝐍𝐂𝐂𝐋 𝐟𝐥𝐚𝐠𝐬 should we actually set? How often should we checkpoint without killing throughput?

That's why we built 𝐓𝐡𝐞 𝐒𝐦𝐨𝐥 𝐓𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐏𝐥𝐚𝐲𝐛𝐨𝐨𝐤 📖: a complete guide covering everything from model architecture and data curation to the SmolLM3 training marathon, post-training techniques, and crucially, the 𝐢𝐧𝐟𝐫𝐚𝐬𝐭𝐫𝐮𝐜𝐭𝐮𝐫𝐞 𝐥𝐚𝐲𝐞𝐫 that most teams get wrong.

We validated real vs theoretical bandwidth across the entire stack: 𝐇𝐁𝐌𝟑 𝐡𝐢𝐭𝐭𝐢𝐧𝐠 𝟑 𝐓𝐁/𝐬, 𝐍𝐕𝐋𝐢𝐧𝐤 𝟒.𝟎 𝐫𝐞𝐚𝐜𝐡𝐢𝐧𝐠 𝟕𝟖𝟔 𝐆𝐁/𝐬, 𝐏𝐂𝐈𝐞 𝐆𝐞𝐧𝟒 𝐚𝐭 𝟏𝟒.𝟐 𝐆𝐁/𝐬. Then we ran collective operations across 𝟏𝟐𝟖 𝐆𝐏𝐔𝐬 (16 nodes, 8xH100s each) and measured how performance degrades at scale: all-reduce drops from 𝟒𝟖𝟎 𝐆𝐁/𝐬 on a single node to 𝟑𝟐𝟎-𝟑𝟓𝟎 𝐆𝐁/𝐬 across 16 nodes.

If you've ever wondered why your training runs are slower than they should be, or you're planning to scale up and want to avoid expensive mistakes, this guide might save you weeks of debugging.

𝐓𝐡𝐞 𝐒𝐦𝐨𝐥 𝐓𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐏𝐥𝐚𝐲𝐛𝐨𝐨𝐤: https://lnkd.in/e5MKXUHS

Shared with ❤️ by the HuggingFace team

KingNish

posted an update 8 months ago

Post

2219

Wan 2.2 fast upto 10x faster than original wan 2.2

Model: FastVideo/FastWan2.2-TI2V-5B-FullAttn-Diffusers

Space: KingNish/wan2-2-fast

AI & ML interests

Recent Activity

Papers

Team members 443

Stanford's activity