ginigen-ai PRO
AI & ML interests
Recent Activity
Organizations
[Leaderboard update] Newly submitted models measured (2026-07-04)
Adding a GPU Without Building One
AI is usually framed as "how smart is the model / how many GPUs did you buy." The real bottleneck is elsewhere โ how efficiently you use the GPUs you already have.
Training happens once; inference runs the entire time users use your product. So a service's economics come down to cost per token. Inference acceleration uses software to pull several times more out of the same GPU โ the effect of plugging in one more "virtual GPU."
VIDRAFT's VKAE, measured (B200, same-harness, no quality loss):
Qwen3.5-35B-A3B (MoE): 25.7 โ 601 tok/s (23.4ร)
Darwin-36B-Opus (in-house MoE): 25.0 โ 280.8 (11.2ร)
10,000+ tok/s peak aggregate under concurrency
The key: it's reproducible โ model + serving shipped as one container.
docker pull vidraft/qwen35-vkae:601
Don't take our word for it โ run it yourself. The mechanism will be released as a paper.
๐ Leaderboard & demo ๐ VIDraft/vkae
Articles ๐ https://huggingface.co/blog/FINAL-Bench/vkae-leaderboard
New Benchmark Dataset
Add eval.yaml for OpenEvals shortlist registration
it dovetails with our invariant: Chitos never emits 'proven-safe' (absence is not a safety claim). That handles the "don't trust a green light" half; the coverage denominator you're pointing at is the other half.
How we mean to express the denominator: we model the attack surface as an enumerable space โ reachable entry points ร parameters/sinks ร vuln classes. Then every 'not demonstrated' can carry coverage = exercised nodes / modeled nodes, per phase and per class. e.g. "exercised N% of the modeled SQLi surface; the unreached region has this shape."
Candidly: today Chitos emits the numerator (what fired) and the invariant (no proven-safe). The modeled-surface denominator (coverage %) is what we're building toward. And we'd flag the deeper limit โ it's coverage against the surface we modeled, not the true surface; unknown-unknowns still escape, so the denominator is itself a function of the model, and we'd label it as such.
Genuinely keen on your take on modeling the surface โ that's what decides whether the denominator means anything.
Exactly โ in a loop the signal's value becomes causal, not descriptive, and AUROC drops from being the score to being a prerequisite. Here's how we're designing the axis.
Holding the loop fixed: a record-and-replay harness freezes tool responses, with fixed seeds and temp=0 to make the trajectory deterministic โ so the only toggled variable is the flag gate, replaying the same task instance.
The reshuffle confound you name is exactly why we add a rate-matched random-gated arm โ trigger a re-plan randomly at the same frequency the flag fires. Then:
flag-gated vs no-replan = total intervention effect
flag-gated vs rate-matched-random = isolates the signal's timing/selectivity contribution (subtracting the pure reshuffle effect)
flag-gated vs oracle (ground-truth wrong-step) = headroom vs perfect timing
Metrics are exactly yours โ steps-to-recovery, wasted tool calls, final success โ plus wasted-replan count (fires when unneeded) and whether the flag fires before the wrong step, not after. The bar: the flag must beat rate-matched-random for its accuracy to count as value, not just the act of replanning.
Candidly, this agent-loop axis is still in design (the current board stops at the single-forward boundary), and I'd genuinely value your input on the frozen-env + rate-matched control setup.
You've put your finger on exactly the right nerve โ and it's also where Chitos parts ways with static analyzers.
Your critique targets static reachability proofs: the "safe" verdict inheriting the edges the call-graph never saw. Chitos's confirmed verdicts don't come from there. Phase 3 fires real payloads and observes real responses, so a confirmation is an executed round-trip, not an inferred reachable path โ it's what the target actually did, not what our model claimed it would do. For positives, that sidesteps the call-graph-blindness problem.
Where your point lands fully is on negatives. That's precisely why Chitos never emits "proven-safe." Unconfirmed is reported as not demonstrated, never closed. Absence of a proof is not a safety claim โ and we work hard not to blur that line in the UI.
On auditability, I completely agree. Today each finding already carries its attack vector, the payloads attempted, the response delta, and the verifier's reasoning. The next step is making that "what I tried and what I trusted" trail a first-class citizen for negatives too โ because an un-auditable green light is, as you say, just a prettier suspect list. Thank you for the framing.
Appreciate it โ sounds genuinely complementary. Ours is internal metacognition (the model's own hidden state flagging P(wrong)); externalized epistemics is the external side โ and an internal "I might be wrong here" signal is exactly what should trigger an external epistemic lookup. Happy to chat. Easiest is to reach us through the ginigen-ai HF org and we'll set something up. ๐
Exactly โ and to answer directly: we score calibrated confidence at the decision boundary, not abstention or post-hoc self-correction. The adapter reads the hidden state at the moment the answer is produced and emits P(wrong); we report the AUROC of that signal vs. actual correctness. So it's precisely the "confidence drops right before the wrong step" signal โ measured predictively at generation time, not after the fact.
Two axes we keep separate on purpose: trap_rate (single-step discrimination โ does it resist the tempting distractor) and self-confidence AUROC / adapter ฮ (can the internal state flag its own error). The JGOS-31B result is the whole point โ near-perfect trap discrimination (0.005) yet AUROC โ 0.5 on free-form: it doesn't know when it's wrong, and a base-frozen adapter recovers a usable signal.
Where you're right and we don't score it yet: agent-loop self-correction โ the "stop and re-plan vs. cascade through five tool calls" behavior. Our signal is single-step at the boundary; extending it to a multi-step abstention/re-plan axis is the natural next benchmark, and it's the version that actually bites in production. Would genuinely value your input on operationalizing that.
Does Your LLM Know *When It's About to Be Wrong*?
Most leaderboards measure accuracy. We measure metacognition โ whether a model catches its own errors. Benchmark + leaderboard + adapters, all open. ๐
The surprise: even a K-AI #1 model (JGOS-31B-Citizen) is the strongest on multiple-choice traps (trap_rate 0.005 โ ~2 misses in 400) yet blind to its own free-form mistakes (self-confidence AUROC = 0.5, pure random). A tiny base-frozen adapter recovers that signal.
Two independent axes (never compared across a row): โ trap_rate โ does it fall for tempting trap options? (lower = stronger) โก adapter gain ฮ โ how much a lightweight adapter catches errors the model itself misses. (higher = more adapter value)
What's open: ๐ 300+100 trap problems (each with a hidden trap + TICOS type) ๐ 24-model leaderboard ๐งฉ 11 per-model adapters โ adapters, NOT fine-tunes (base stays frozen; the adapter just reads the hidden state โ P(wrong))
Submit any HF model โ auto-scored daily at 09:00 KST and added to the board.
๐ Leaderboard โ ginigen-ai/Metacognition-Leaderboard-Space
๐ Benchmark โ ginigen-ai/Metacognition-Bench
๐งฉ Adapters โ FINAL-Bench/metacognition-adapters-6a42c032e6beb803dd032961
๐ Article โ https://huggingface.co/blog/ginigen-ai/metacognition
Benchmark by ginigen-ai ยท Adapters by FINAL-Bench (Darwin/Chimera platform + AETHER metacognition tech).
Most leaderboards measure accuracy. We measure metacognition โ whether a model catches its own errors. Benchmark + leaderboard + adapters, all open. ๐
The surprise: even a K-AI #1 model (JGOS-31B-Citizen) is the strongest on multiple-choice traps (trap_rate 0.005 โ ~2 misses in 400) yet blind to its own free-form mistakes (self-confidence AUROC = 0.5, pure random). A tiny base-frozen adapter recovers that signal.
Two independent axes (never compared across a row): โ trap_rate โ does it fall for tempting trap options? (lower = stronger) โก adapter gain ฮ โ how much a lightweight adapter catches errors the model itself misses. (higher = more adapter value)
What's open: ๐ 300+100 trap problems (each with a hidden trap + TICOS type) ๐ 24-model leaderboard ๐งฉ 11 per-model adapters โ adapters, NOT fine-tunes (base stays frozen; the adapter just reads the hidden state โ P(wrong))
Submit any HF model โ auto-scored daily at 09:00 KST and added to the board.
๐ Leaderboard โ ginigen-ai/Metacognition-Leaderboard-Space
๐ Benchmark โ ginigen-ai/Metacognition-Bench
๐งฉ Adapters โ FINAL-Bench/metacognition-adapters-6a42c032e6beb803dd032961
๐ Article โ https://huggingface.co/blog/ginigen-ai/metacognition
Benchmark by ginigen-ai ยท Adapters by FINAL-Bench (Darwin/Chimera platform + AETHER metacognition tech).