ginigen-ai (ginigen-ai)

upvoted a collection about 12 hours ago

VKAE Accelerated

Fastest single-GPU serving of open models via VKAE. Live board: hf.co/spaces/VIDraft/vkae. Each = card + Docker. • 2 items • Updated about 21 hours ago • 12

New activity in ginigen-ai/Metacognition-Leaderboard-Space about 18 hours ago

[Leaderboard update] Newly submitted models measured (2026-07-04)

#1 opened about 18 hours ago by

ginigen-ai

upvoted an article 1 day ago

Article

Adding a GPU Without Building One

FINAL-Bench

•

1 day ago

• 15

reacted to SeaWolf-AI's post with ❤️ 1 day ago

Post

1673

🚀 Adding a GPU without building one

AI is usually framed as "how smart is the model / how many GPUs did you buy." The real bottleneck is elsewhere — how efficiently you use the GPUs you already have.

Training happens once; inference runs the entire time users use your product. So a service's economics come down to cost per token. Inference acceleration uses software to pull several times more out of the same GPU — the effect of plugging in one more "virtual GPU."

VIDRAFT's VKAE, measured (B200, same-harness, no quality loss):

Qwen3.5-35B-A3B (MoE): 25.7 → 601 tok/s (23.4×)
Darwin-36B-Opus (in-house MoE): 25.0 → 280.8 (11.2×)
10,000+ tok/s peak aggregate under concurrency
The key: it's reproducible — model + serving shipped as one container.

docker pull vidraft/qwen35-vkae:601
Don't take our word for it — run it yourself. The mechanism will be released as a paper.

🏆 Leaderboard & demo 👉 VIDraft/vkae
Articles 👉 https://huggingface.co/blog/FINAL-Bench/vkae-leaderboard

liked 2 models 1 day ago

FINAL-Bench/Darwin-36B-Opus-VKAE

Text Generation • Updated about 6 hours ago • 22

FINAL-Bench/Qwen3.5-35B-A3B-VKAE

Text Generation • Updated about 6 hours ago • 25

New activity in OpenEvals/README 1 day ago

New Benchmark Dataset

🚀 5

18

#2 opened 5 months ago by

burtenshaw

New activity in ginigen-ai/Metacognition-Bench 1 day ago

Add eval.yaml for OpenEvals shortlist registration

1

#1 opened 1 day ago by

openfree

updated a dataset 1 day ago

ginigen-ai/Metacognition-Bench

Updated 1 day ago • 171 • 27

updated a Space 1 day ago

Metacognition Leaderboard

🧠

28

Explore LLM metacognition rankings and submit your model

replied to SeaWolf-AI's post 1 day ago

it dovetails with our invariant: Chitos never emits 'proven-safe' (absence is not a safety claim). That handles the "don't trust a green light" half; the coverage denominator you're pointing at is the other half.

How we mean to express the denominator: we model the attack surface as an enumerable space — reachable entry points × parameters/sinks × vuln classes. Then every 'not demonstrated' can carry coverage = exercised nodes / modeled nodes, per phase and per class. e.g. "exercised N% of the modeled SQLi surface; the unreached region has this shape."

Candidly: today Chitos emits the numerator (what fired) and the invariant (no proven-safe). The modeled-surface denominator (coverage %) is what we're building toward. And we'd flag the deeper limit — it's coverage against the surface we modeled, not the true surface; unknown-unknowns still escape, so the denominator is itself a function of the model, and we'd label it as such.

Genuinely keen on your take on modeling the surface — that's what decides whether the denominator means anything.

replied to their post 2 days ago

Exactly — in a loop the signal's value becomes causal, not descriptive, and AUROC drops from being the score to being a prerequisite. Here's how we're designing the axis.

Holding the loop fixed: a record-and-replay harness freezes tool responses, with fixed seeds and temp=0 to make the trajectory deterministic — so the only toggled variable is the flag gate, replaying the same task instance.

The reshuffle confound you name is exactly why we add a rate-matched random-gated arm — trigger a re-plan randomly at the same frequency the flag fires. Then:

flag-gated vs no-replan = total intervention effect
flag-gated vs rate-matched-random = isolates the signal's timing/selectivity contribution (subtracting the pure reshuffle effect)
flag-gated vs oracle (ground-truth wrong-step) = headroom vs perfect timing
Metrics are exactly yours — steps-to-recovery, wasted tool calls, final success — plus wasted-replan count (fires when unneeded) and whether the flag fires before the wrong step, not after. The bar: the flag must beat rate-matched-random for its accuracy to count as value, not just the act of replanning.

Candidly, this agent-loop axis is still in design (the current board stops at the single-forward boundary), and I'd genuinely value your input on the frozen-env + rate-matched control setup.

liked a model 2 days ago

FINAL-Bench/metacog-adapter-JGOS-31B-Citizen

Updated 4 days ago • 36 • 20

replied to SeaWolf-AI's post 2 days ago

You've put your finger on exactly the right nerve — and it's also where Chitos parts ways with static analyzers.

Your critique targets static reachability proofs: the "safe" verdict inheriting the edges the call-graph never saw. Chitos's confirmed verdicts don't come from there. Phase 3 fires real payloads and observes real responses, so a confirmation is an executed round-trip, not an inferred reachable path — it's what the target actually did, not what our model claimed it would do. For positives, that sidesteps the call-graph-blindness problem.

Where your point lands fully is on negatives. That's precisely why Chitos never emits "proven-safe." Unconfirmed is reported as not demonstrated, never closed. Absence of a proof is not a safety claim — and we work hard not to blur that line in the UI.

On auditability, I completely agree. Today each finding already carries its attack vector, the payloads attempted, the response delta, and the verifier's reasoning. The next step is making that "what I tried and what I trusted" trail a first-class citizen for negatives too — because an un-auditable green light is, as you say, just a prettier suspect list. Thank you for the framing.

replied to their post 2 days ago

Appreciate it — sounds genuinely complementary. Ours is internal metacognition (the model's own hidden state flagging P(wrong)); externalized epistemics is the external side — and an internal "I might be wrong here" signal is exactly what should trigger an external epistemic lookup. Happy to chat. Easiest is to reach us through the ginigen-ai HF org and we'll set something up. 😄

replied to their post 2 days ago

Exactly — and to answer directly: we score calibrated confidence at the decision boundary, not abstention or post-hoc self-correction. The adapter reads the hidden state at the moment the answer is produced and emits P(wrong); we report the AUROC of that signal vs. actual correctness. So it's precisely the "confidence drops right before the wrong step" signal — measured predictively at generation time, not after the fact.

Two axes we keep separate on purpose: trap_rate (single-step discrimination — does it resist the tempting distractor) and self-confidence AUROC / adapter Δ (can the internal state flag its own error). The JGOS-31B result is the whole point — near-perfect trap discrimination (0.005) yet AUROC ≈ 0.5 on free-form: it doesn't know when it's wrong, and a base-frozen adapter recovers a usable signal.

Where you're right and we don't score it yet: agent-loop self-correction — the "stop and re-plan vs. cascade through five tool calls" behavior. Our signal is single-step at the boundary; extending it to a multi-step abstention/re-plan axis is the natural next benchmark, and it's the version that actually bites in production. Would genuinely value your input on operationalizing that.

liked a Space 3 days ago

VKAE Leaderboard

🚀

22

VIDRAF Kernel-level inference acceleration engine

upvoted an article 3 days ago

Article

Does Your LLM Know When It's About to Be Wrong?

ginigen-ai

•

3 days ago

• 20

reacted to their post with ❤️ 3 days ago

Post

10315

🧠 Does your LLM know when it's about to be wrong?

Most leaderboards measure accuracy. We measure metacognition — whether a model catches its own errors. Benchmark + leaderboard + adapters, all open. 🎉

The surprise: even a K-AI #1 model (JGOS-31B-Citizen) is the strongest on multiple-choice traps (trap_rate 0.005 — ~2 misses in 400) yet blind to its own free-form mistakes (self-confidence AUROC = 0.5, pure random). A tiny base-frozen adapter recovers that signal.

Two independent axes (never compared across a row): ① trap_rate — does it fall for tempting trap options? (lower = stronger) ② adapter gain Δ — how much a lightweight adapter catches errors the model itself misses. (higher = more adapter value)

What's open: 📊 300+100 trap problems (each with a hidden trap + TICOS type) 🏆 24-model leaderboard 🧩 11 per-model adapters — adapters, NOT fine-tunes (base stays frozen; the adapter just reads the hidden state → P(wrong))

Submit any HF model → auto-scored daily at 09:00 KST and added to the board.

🏆 Leaderboard → ginigen-ai/Metacognition-Leaderboard-Space

📊 Benchmark → ginigen-ai/Metacognition-Bench

🧩 Adapters → FINAL-Bench/metacognition-adapters-6a42c032e6beb803dd032961

📊 Article → https://huggingface.co/blog/ginigen-ai/metacognition

Benchmark by ginigen-ai · Adapters by FINAL-Bench (Darwin/Chimera platform + AETHER metacognition tech).

11 replies

·

posted an update 3 days ago

Post

10315

🧠 Does your LLM know when it's about to be wrong?

Most leaderboards measure accuracy. We measure metacognition — whether a model catches its own errors. Benchmark + leaderboard + adapters, all open. 🎉

The surprise: even a K-AI #1 model (JGOS-31B-Citizen) is the strongest on multiple-choice traps (trap_rate 0.005 — ~2 misses in 400) yet blind to its own free-form mistakes (self-confidence AUROC = 0.5, pure random). A tiny base-frozen adapter recovers that signal.

Two independent axes (never compared across a row): ① trap_rate — does it fall for tempting trap options? (lower = stronger) ② adapter gain Δ — how much a lightweight adapter catches errors the model itself misses. (higher = more adapter value)

What's open: 📊 300+100 trap problems (each with a hidden trap + TICOS type) 🏆 24-model leaderboard 🧩 11 per-model adapters — adapters, NOT fine-tunes (base stays frozen; the adapter just reads the hidden state → P(wrong))

Submit any HF model → auto-scored daily at 09:00 KST and added to the board.

🏆 Leaderboard → ginigen-ai/Metacognition-Leaderboard-Space

📊 Benchmark → ginigen-ai/Metacognition-Bench

🧩 Adapters → FINAL-Bench/metacognition-adapters-6a42c032e6beb803dd032961

📊 Article → https://huggingface.co/blog/ginigen-ai/metacognition

Benchmark by ginigen-ai · Adapters by FINAL-Bench (Darwin/Chimera platform + AETHER metacognition tech).

11 replies

·

ginigen-ai PRO

AI & ML interests

Recent Activity

Organizations

VKAE Accelerated

[Leaderboard update] Newly submitted models measured (2026-07-04)

Adding a GPU Without Building One

FINAL-Bench/Darwin-36B-Opus-VKAE

FINAL-Bench/Qwen3.5-35B-A3B-VKAE

New Benchmark Dataset

Add eval.yaml for OpenEvals shortlist registration

ginigen-ai/Metacognition-Bench

Metacognition Leaderboard

FINAL-Bench/metacog-adapter-JGOS-31B-Citizen

VKAE Leaderboard

Does Your LLM Know When It's About to Be Wrong?

ginigen-ai PRO

AI & ML interests

Recent Activity

Organizations

ginigen-ai's activity

[Leaderboard update] Newly submitted models measured (2026-07-04)

Adding a GPU Without Building One

New Benchmark Dataset

Add eval.yaml for OpenEvals shortlist registration

Metacognition Leaderboard

VKAE Leaderboard

Does Your LLM Know *When It's About to Be Wrong*?

Does Your LLM Know When It's About to Be Wrong?