🏗️ Building on HF

Dipankar Sarkar PRO

dipankarsarkar

https://www.dipankar.cc

AI & ML interests

Building the AI-native stack. Agents as infrastructure, safety as architecture, performance as plumbing. I publish the receipts: papers, datasets, demos.

Recent Activity

repliedto SeaWolf-AI's post 35 minutes ago

🚀 Adding a GPU without building one AI is usually framed as "how smart is the model / how many GPUs did you buy." The real bottleneck is elsewhere — how efficiently you use the GPUs you already have. Training happens once; inference runs the entire time users use your product. So a service's economics come down to cost per token. Inference acceleration uses software to pull several times more out of the same GPU — the effect of plugging in one more "virtual GPU." VIDRAFT's VKAE, measured (B200, same-harness, no quality loss): Qwen3.5-35B-A3B (MoE): 25.7 → 601 tok/s (23.4×) Darwin-36B-Opus (in-house MoE): 25.0 → 280.8 (11.2×) 10,000+ tok/s peak aggregate under concurrency The key: it's reproducible — model + serving shipped as one container. docker pull vidraft/qwen35-vkae:601 Don't take our word for it — run it yourself. The mechanism will be released as a paper. 🏆 Leaderboard & demo 👉 https://huggingface.co/spaces/VIDraft/vkae Articles 👉 https://huggingface.co/blog/FINAL-Bench/vkae-leaderboard

repliedto RDTvlokip's post 39 minutes ago

I finally changed the architecture of my 15M French LLM. It worked. Then I almost fooled myself about how much and catching that was the real win. After proving last time that architecture is a threshold, not a lever, I got stubborn: could I change how the model learns? Four honest attempts, Lion, a sharper AdamW β2, multi-token prediction, LayerScale. Four failures. The bottleneck wasn't the learning rule either. So I changed the shape of the computation instead: loop the same transformer blocks 4×, deeper reasoning, zero added parameters. It beat the baseline on perplexity, the first thing in the whole project to move that number. Then I added my own twist: let each token decide how deep to think, halting on its own entropy. My first evaluation was spectacular. Coherence up 65%. Hallucinated names down 62%. It was noise. Eight prompts, one seed. I re-ran on 50 prompts × 200 tokens and watched the gains shrink to "modest" and on out-of-domain prompts, recurrence actually made things worse. No universal winner. And none of it is new: it's Adaptive Computation Time (2016), the Universal Transformer (2018), and LoopViT (2026), recombined and measured honestly. The real lesson: A number from 8 prompts is a rumor. The eval harness that kills your own best result is worth more than the result it kills. Cite your lineage. Stay preliminary until multiple seeds say otherwise. The three models are live. The write-up is honest about every caveat 👇 🔗 https://huggingface.co/blog/RDTvlokip/teaching-a-15m-french-llm-to-think-deeper

upvoted a paper about 2 hours ago

Logit-Contribution Scoring Identifies Non-Literal Retrieval Heads

View all activity

Organizations

replied to SeaWolf-AI's post 35 minutes ago

The framing is the right one. Cost per token is the axis, not parameter count, and inference is where the bill actually runs. That part I will defend all day.

The 23.4x is where I would push. 25.7 to 601 tok/s on a 35B-A3B MoE (3B active) on a B200: the ceiling number is aggregate-under-concurrency, but 25.7 reads like single-stream unbatched generate. Those are different axes. Batching alone moves an MoE by an order of magnitude before any VKAE mechanism kicks in. The honest speedup is same-concurrency both sides, same batch, same harness, VKAE on vs off. Otherwise part of what you are measuring is 'we turned batching on.'

Same for 'no quality loss.' Measured on what? Aggregate perplexity or a task average sits still while long-context recall or the rare hard token quietly moves. The loss lives in the tail, not the mean.

When the paper lands, will the headline number be VKAE-on vs VKAE-off at identical concurrency, or optimized-throughput vs a single-stream baseline?

replied to RDTvlokip's post 39 minutes ago

Fair. You did disclose all three, and I read the write-up now. The halting Q&A is the best part of it.

The regularizer finding is the real result. 86% of Focal's tokens go to full depth, so the gain lives in training with halting, not in spending it at inference. You measured the thing most people would have shipped as an inference speedup and called it done.

That opens one ablation I did not see: if the value is regularization, does the entropy signal earn its keep over plain stochastic depth? Cut the loop to a random 1 to 4 passes per token during training, no entropy, no threshold. If Focal still beats that, the 'which tokens freeze' signal carries real information. If it ties, the entropy machinery is decorative and variable-depth training was the whole lever.

Have you run it against a random-depth baseline, or is entropy-vs-random still open?

upvoted a paper about 2 hours ago

Logit-Contribution Scoring Identifies Non-Literal Retrieval Heads

Paper • 2607.01002 • Published 5 days ago • 15

reacted to artificial-citizen's post with 🔥 about 2 hours ago

Post

A 9B running 306 tok/s on a single GPU.

We shipped calibrated NVFP4 + MTP builds of Ornith-1.0-9B, and found something along the way: NVFP4 and speculative decoding are multiplicative on Blackwell. MTP's verify step batches compute straight into the FP4 tensor cores — +52% lift vs +17% on Q4_K_M, acceptance-controlled so it's the kernels, not the draft head.

- GGUF (6.6 GB, MTP baked in): 306 tok/s on RTX PRO 6000 — faster than our 4B record
- vLLM (10.4 GB, W4A4 + MTP sidecar): ~1.5× bf16+MTP at 55% the VRAM
- Full release gate vs bf16 on the card: FC 96% (beats base), claw −0.028, coherence verified to 60K
- On Ampere? Honestly: use Q4_K_M. The FP4 win is Blackwell's tensor cores.

Every number traces to a row in protoLabsAI/lab-benchmarks (CC-BY-4.0).

🔗 huggingface.co/protoLabsAI/Ornith-1.0-9B-NVFP4
🔗 huggingface.co/protoLabsAI/Ornith-1.0-9B-MTP-GGUF

Want a different size/format? Open a discussion — we usually ship within 48h.

1 reply

replied to artificial-citizen's post about 2 hours ago

The multiplicative part has a clean cause: MTP splits decode into two regimes and FP4 wins both.

Plain single-token decode is memory-bound, so FP4 pays by shrinking the weights you stream. The verify step is a little batched prefill, compute-bound, so it lands straight on the FP4 tensor cores. Speculative decoding normally only buys you the memory-bound half. You are collecting the compute-bound half too.

Which is why the whole thing rides on acceptance holding. Easy distribution, the verify batch stays wide and you keep the tensor-core lift. Long-tail or OOD prompts, if the draft head starts missing, the batch collapses toward 1 and you are back to plain memory-bound decode where FP4 is just the weight-bandwidth edge.

Does the +52% hold at low acceptance, or does the multiplicative story decay to additive when the draft head misses?

reacted to satgeze's post with 🔥 about 3 hours ago

Post

satgeze/Ornith-1.0-35B-1M-GGUF
Ornith-1.0 with a 1,048,576-token context window, tested instead of claimed 🦜

Ornith is Qwen3.5-family under the hood, so YaRN factor 4 extends it from 262K native to exactly 1M. I baked that into the GGUF metadata (no fine-tuning, weights bit-identical) so llama.cpp and Ollama apply it with zero flags, then ran full needle-in-a-haystack ladders on my own hardware:

- satgeze/Ornith-1.0-35B-1M-GGUF: 10/10 needles at every rung from 32K through 1M, replicated with fresh seeds (M3 Max 128GB, ~6.8h cold 1M prefill)
- satgeze/Ornith-1.0-9B-1M-GGUF: perfect through 524K, honest 7/10 at 1M under Q4 + q8_0 KV, failure band charted in the card
- satgeze/Ornith-1.0-397B-1M-GGUF: IQ1_M through Q4_K_M as split GGUFs, coherence-gated

Also in the repos:

- Vision: Ornith kept the Qwen3.5 multimodal skeleton, so the VL vision tower (extracted by bartowski) attaches at runtime via llama-server --mmproj. OCR-tested on the 9B and 35B, mmproj files bundled.
- A measured residency matrix: on a single RTX 5090, every 9B quant up to Q6_K holds the full 1M window at 100 percent GPU, 162 to 244 tok/s.
- Quality gates: every low-bit quant passed a coherence test before upload. The 35B IQ1_S failed and was deleted rather than shipped.

Harness, method writeup, and raw per-needle data: https://github.com/satindergrewal/ornith-1m

All MIT. Credit to DeepReinforce for the models and bartowski for the imatrix quants and vision towers. If a config breaks retrieval for you, tell me and it goes in the card.

reacted to RDTvlokip's post with 🔥 about 3 hours ago

Post

1891

I finally changed the architecture of my 15M French LLM. It worked. Then I almost fooled myself about how much and catching that was the real win.

After proving last time that architecture is a threshold, not a lever, I got stubborn: could I change how the model learns? Four honest attempts, Lion, a sharper AdamW β2, multi-token prediction, LayerScale. Four failures. The bottleneck wasn't the learning rule either.

So I changed the shape of the computation instead: loop the same transformer blocks 4×, deeper reasoning, zero added parameters. It beat the baseline on perplexity, the first thing in the whole project to move that number. Then I added my own twist: let each token decide how deep to think, halting on its own entropy.

My first evaluation was spectacular. Coherence up 65%. Hallucinated names down 62%.

It was noise.

Eight prompts, one seed. I re-ran on 50 prompts × 200 tokens and watched the gains shrink to "modest" and on out-of-domain prompts, recurrence actually made things worse. No universal winner. And none of it is new: it's Adaptive Computation Time (2016), the Universal Transformer (2018), and LoopViT (2026), recombined and measured honestly.

The real lesson:

A number from 8 prompts is a rumor. The eval harness that kills your own best result is worth more than the result it kills. Cite your lineage. Stay preliminary until multiple seeds say otherwise.

The three models are live. The write-up is honest about every caveat 👇

🔗 https://huggingface.co/blog/RDTvlokip/teaching-a-15m-french-llm-to-think-deeper

3 replies

replied to RDTvlokip's post about 3 hours ago

The harness that kills your own best result is the whole job. Nice.

The trap you caught has a direction, which is what makes it dangerous. You run the small eval on the config you're rooting for, see the number you wanted, and stop looking. Confirmation bias and low n multiply. The 8-prompt spectacular is never the config you doubt.

One thing I'd promote to first-class: report the seed band, not the mean. A single number with no spread is a point estimate cosplaying as a distribution. '+65% coherence' and '+65% plus or minus 40 across 5 seeds' are different claims, and only one survives.

On the halting: did per-token entropy halting buy you anything on the 50-prompt rerun, or did it wash out with the rest?

replied to Banaxi-Tech's post about 3 hours ago

Since KV1 is training-aware not drop-in, you can't borrow a bigger model's competence for free. That's the real bind with going small.

So skip needle-in-haystack (it needs capability an 8M won't have) and use a long-range task an 8M CAN learn: induction/copy where the target token depends on one planted ~60 positions back. Train that, then 2-bit vs 16-bit K/V and measure recall of the planted token by depth.

Same size, isolates the cache under a real long-range dependency. KLD averages straight over that, a copy task can't hide it.

Would that fit your current training budget?

upvoted 2 papers about 4 hours ago

AgenticDataBench: A Comprehensive Benchmark for Data Agents

Paper • 2607.01647 • Published 4 days ago • 29

When Search Agents Should Ask: DiscoBench for Clarification-Aware Deep Search

Paper • 2606.27669 • Published 10 days ago • 9

reacted to Banaxi-Tech's post with 🤗 about 4 hours ago

Post

1719

Today we are releasing BananaMind-KV1-8M-2Bit-Experimental, a KV-cache-aware trained model that stores its generation KV cache in 2-bit precision instead of the usual 16-bit precision.

Result: 5.33x smaller KV cache vs FP16, with 0.0916 mean KLD against a 16-bit KV cache reference on WikiText-2.

Model: BananaMind/BananaMind-KV1-8M-2Bit-Experimental

The important part: this is not just post-training KV cache quantization.
Instead we take the BitNet approach.

KV1 is trained with a 2-bit-aware K/V path. Instead of training a normal model and quantizing the cache afterwards, the model learns during training to operate under the low-bit KV constraint, closer in spirit to the BitNet idea of training for the low-bit regime.

During generation, each K/V vector is quantized into 4 affine levels and packed into uint8 tensors, with four 2-bit values stored per byte.

WikiText-2 eval vs 16-bit KV cache reference:

Mean KLD: 0.0916 nats/token
Mean KLD: 0.1322 bits/token
Average KV cache shrink vs FP16: 5.33x
Evaluated positions: 372,675

If this actually gets used in models like Qwen or Gemma, then it may be possible to run 128K or even 256K Context on a Normal Machine!
Try it here: BananaMind/BananaMind-KV1-8M-2Bit-Experimental

Code: https://github.com/Banaxi-Tech/kv1

4 replies

replied to Banaxi-Tech's post about 4 hours ago

5.33x shrink is the easy number. The one I'd chase is task-level, not KLD.

0.0916 nats/token is an average over 372k positions, and a KV cache lives or dies on the rare position that mattered: the token at 90k you have to attend back to. Average KLD can look tiny while long-context recall quietly falls off, because the easy majority of positions dominate the mean.

Before "128K on a normal machine," I'd want a needle-in-a-haystack sweep: same passkey at depth 8k / 32k / 128k, 2-bit KV vs 16-bit, pass/fail not perplexity. That's the eval that proves the claim.

Have you run anything retrieval-shaped over long context yet, or just WikiText so far?

upvoted 4 papers about 5 hours ago

SkillCoach: Self-Evolving Rubrics for Evaluating and Enhancing Agentic Skill-Use

Paper • 2607.01874 • Published 4 days ago • 15

replied to their post about 5 hours ago

Good catch, that path is dead. I folded the four standalone demos into one Space to stay under HF's 3-slot cpu-basic quota, so grite moved.

Live here, running over local git data (the grite tab):
https://huggingface.co/spaces/neullabs/agent-infra

Thanks for flagging it. Did it 404 for you, or hang on a cold boot?

upvoted 2 papers 2 days ago

EvoPolicyGym: Evaluating Autonomous Policy Evolution in Interactive Environments

Paper • 2607.02440 • Published 4 days ago • 43

Program-as-Weights: A Programming Paradigm for Fuzzy Functions

Paper • 2607.02512 • Published 4 days ago • 82

Dipankar Sarkar PRO

AI & ML interests

Recent Activity

Organizations

dipankarsarkar's activity