"; */ ?>

software


13
Apr 26

Qwen 3.5 397B on a MacBook @ 29 tokens / s

A year ago I would just read about 397B league of models. Today I can run it on my laptop. The combination of llama.cpp’s importance matrix (imatrix) with Unsloth’s per-model adaptive layer quantization is what makes it all possible

Qwen3.5-397B-A17B is an 807GB beast that, even at Q4, would need more than 200GB of GPU RAM. Some people even call it “local”.

I have fairly strong laptop: M5 Max MacBook Pro 128GB (40-core GPU). That’s 128GB unified RAM, most of which I can dedicate to GPU. So 200GB+ is not something I can comfortably run on it. I could via ssd + ram, but it would be more of an experiment rather than putting the model to work.

But Qwen 397B is an MoE model: it has 512 experts and only routes to 10 of them per token, so the other 502 can be quantized way below Q4 without hurting that particular inference. imatrix keeps the important weights close to their original values, Unsloth’s adaptive layer quantization figures out which layers can afford to be stronger quantized. The combination of all makes 397B at just ~2 bits to fit in 106GB, which, according to my test, retains a surprising amount of its intelligence. And on top of all that.. 106GB is my laptop territory.

The model is “Qwen3.5-397B-A17B-UD-IQ2_XXS“:

  • “UD” = Unsloth Dynamic 2.0, different layers are quantized differently
  • “I” / imatrix = most important weights are rounded to minimize their loss / error
  • “XXS”: extra, extra small.. 807GB => 106GB

After downloading it, before doing anything else I wanted to understand what it is I am going to ask my laptop to swallow:

$ ll -h ~/.llama.cpp/models/Qwen3.5-397B-A17B-UD-IQ2_XXS/UD-IQ2_XXS
total 224361224
-rw-r--r--  1 user  staff    10M Apr 12 18:50 Qwen3.5-397B-A17B-UD-IQ2_XXS-00001-of-00004.gguf
-rw-r--r--  1 user  staff    46G Apr 12 20:12 Qwen3.5-397B-A17B-UD-IQ2_XXS-00003-of-00004.gguf
-rw-r--r--  1 user  staff    14G Apr 12 20:57 Qwen3.5-397B-A17B-UD-IQ2_XXS-00004-of-00004.gguf
-rw-r--r--  1 user  staff    46G Apr 12 21:12 Qwen3.5-397B-A17B-UD-IQ2_XXS-00002-of-00004.gguf

Here is the 106GB. The original 16bit model is 807GB, if it was “just” quantized to 2bit model it would take (397B * 2 bits) / 8 = ~99 GB, but I am looking at 106GB, hence I wanted to look under the hood to see the actual quantization recipe for this model:

$ gguf-dump \
  ~/.llama.cpp/models/Qwen3.5-397B-A17B-UD-IQ2_XXS/UD-IQ2_XXS/Qwen3.5-397B-A17B-UD-IQ2_XXS-00002-of-00004.gguf \
  2>&1 | head -200

super interesting. the expert tensors (ffn_gate_exps, ffn_up_exps and ffn_down_exps) are quantized at ~2 bits, but the rest are much larger. This is where the 7GB difference (99GB vs. 106GB) really pays off: 7GB of packed intelligence on top of expert tensors.

trial by fire

By trial and error I found that 16K for the context would be a sweet spot for the 128GB unified memory. but the GPU space needs to be moved up a little to fit it (it is around 96GB by default):

$ sudo sysctl iogpu.wired_limit_mb=122880

llama.cpp” would be the best choice to run this model (since MLX does not quantize to IQ2_XXS):

$ llama-server \
  -m ~/.llama.cpp/models/Qwen3.5-397B-A17B-UD-IQ2_XXS/UD-IQ2_XXS/Qwen3.5-397B-A17B-UD-IQ2_XXS-00001-of-00004.gguf \
  --n-gpu-layers 99 \
  --ctx-size 16384 \
  --temp 1.0 --top-p 0.95 --top-k 20

My current use case, as I described here, is finding the best model assembly to help me making sense of my kids school work and progress since if anything is super messy in terms of organization, variety of disconnected systems where the kids data lives, communication inconsistencies, that would be US public schools. A small army of Claude Sonnets does it well’ish, but it is really expensive, hence “Qwen3.5 397B” could be just a drop in replacement (that’s the hope)

In order to make sense of which local models “do good” I built cupel: https://github.com/tolitius/cupel, and that is the next step: fire it up and test “Qwen3.5 397B” on muti-turn, tool use, etc.. tasks:

Qwen3.5 397B A17B on M5 Max MacBook 128GB

It is on par with “Qwen 3.5 122B 4bit“, but I suspect I need to work on more exquisite prompts to distill the difference.

And, after all the tests I found “Qwen3.5 397B IQ2” to be.. amazing. Even at 2 bits, it is extremely intelligent, and is able to call tools, pass context between turns, organize very messy set of tables into clean aggregates, etc.

What surprised me the most is the 29 tokens per second average generation speed:

prompt eval time = 269.46 ms / 33 tokens ( 8.17 ms per token, 122.46 tokens per second)
 
   eval time = 79785.85 ms / 2458 tokens ( 32.46 ms per token, 30.81 tokens per second)
   total time = 80055.31 ms / 2491 tokens
 
slot release: id 1 | task 7953 | stop processing: n_tokens = 2490, truncated = 0
srv update_slots: all slots are idle

this is one of the examples from “llama.cpp” logs. the prompt processing depends on batching and ranged from 80 tokens per second to 330 tokens per second

The disadvantages I can see so far:

  • can’t really efficiently run it in the assembly, since it is the only model that can be loaded / fits. with 122B (65GB) I can still run more models side by side
  • I don’t expect it to handle large context well due to hardware memory limitation
  • theoretically it would have a worse time dealing with a very specialized knowledge where a specific expert is needed, but its weights are “too crushed” to give a clean answer. But, just maybe, the “I” in “IQ2-XXS” makes sure that the important weights stay very close to their original value
  • under load I saw the speed dropping from 30 to 17 tokens per second. I suspect it is caused by the prompt cache filling up and triggering evictions, but needs more research

But.. 512 experts, 397B of stored knowledge, 17B active parameters per token and all that at 29 tokens per second on a laptop.


9
Apr 26

US Public Schools meet Qwen 3.5 122B A10B

The last Llama (Scout/Maverick) was released a year ago. Since then US based releases have been super rare: Granite 3.3, GPT-OSS 20B & 120B, Nemotron 3 Nano / Super and now Gemma 4. Meanwhile Chinese open model output is multiples of high quality Qwens, DeepSeeks, Kimis, MiniMaxes, GLMs, MiMos, Seeds, etc..

Gemma 4 is like a breath of fresh air. Not just the model itself, but the rollout, the beauty, the innovation: K=V in global attention, Per-Layer Embeddings, tri-modal minis (E4B, E2B), etc.

Most of my local LLM usage used to be via rented GPUs: Google Cloud, AWS, etc. But about a month ago I decided to bring it all home, and bought a shiny M5 Max MacBook Pro 128GB. It is a beast of a laptop, but also opens up the kind of models I can run locally: 128GB of unified RAM and all.

Besides the cost, the true benefit of running models locally is privacy. I never felt easy sending my data to “OpenRouter => Model A” or even hosting it in AWS on P4d/P4de instances (NVIDIA A100): it is still my data, and it is not home. where I am.

But my laptop is.

When it comes to LLMs, unless it is research or coding finding utility is difficult. But I have kids, and they have school, and if anything is super messy in terms of organization, variety of disconnected systems where the kids data lives, communication inconsistencies, that would be US public schools. But being a parent is fun, and this mess is a great fit for LLMs to make sense of. Local LLMs solve the last piece: my kids data stay on my laptop at home.

So it began. I loaded all I could to my 128GB friendly beast and start looking at which models are good for what. The flow is not difficult:

  • go to many different school affiliated websites
    • some have APIs
    • some I need to playwright screen scape
    • some are a little of both plus funky captchas and logins, etc.
  • then, when on “a” website
    • some teachers have things inside a slide deck on a “slide 13”
    • some in some obscure folders
    • others on different systems buried under many irrelevant links

LLMs need to scout all this ambiguity and come back to me with a clear signal of what is due tomorrow, this week; what the grades are, why they are what they are, etc. Again, a great use case for LLM, since it is lots of unorganized text with a clear goal to optimize for.

You maybe thinking just about now: “OpenClaw“. And you would be correct, this is what I have started from, but then I realized that OpenClaw is as good as the set of LLMs behind it. Also if I schedule a vanilla OS cron that invokes a “school skill”, the number of tokens sent to LLM goes from 10K down to about 600. And while I do have an OpenClaw running on VPS / OpenRouter, this was not (maybe yet) a good use of it.

In order to rank local models I scavenged a few problems over the years that I had to solve with big boys: Claude, OpenAI, Grok and Gemini. They are nice enough to record everything we talk about, which is anything but local, but in this case gave me a chance to collect a few problems and convert them to prompts with rubrics.

I then wrote a script to start making sense of what works for me vs. what is advertised and/or works for others. The script grew fast, and was missing look and feel.

So I added UI to it: https://github.com/tolitius/cupel

Besides the usual general problems, I used a few specific prompts that had tool use and muli-turns (multiple steps composed via tool calling) focused specifically on school related activities.

After a few nights and trial and error, I found that “Qwen 3.5 122B A10B Q4” is the best and the closest that solves most of the tasks. A pleasant surprise, by the way, was the “NVIDIA Nemotron 3 Super 120B A12B 4bit“. I really like this model, it is fast and unusually great. “Unusually” because previous Nemotrons did not genuinely stand out as this one.

And then Gemma 4 came around.

Interestingly, at least for my use case, “Qwen 3.5 122B A10B Q4” still performs better than “Gemma 4 26B A4B“, and about 50/50 accuracy wise with “Gemma 4 31B“, but it wins hands down in speed. “Gemma 4 31B” full precision is about 7 tokens per second on M5 Max MacBook Pro 128GB, whereas “Qwen 3.5 122B A10B Q4” is 50 to 65 tokens / second.

here tested Gemma 4 via OpenRouter to avoid any misconfiguration on my side + 2x faster

But I suspect I still need to learn “The Way of Gemma” to make it work much better. It really is a giant leap forward given its size vs. quality. After all, at 31B, although dense, it stands side by side with 122B.


6
Jan 26

Mad Skills to Learn The Universe

Simple wins. Always. Because it is very composable, fun to work with, and organically evolves.

LLMs are great at language and text, but they don’t know how to do anything, so they rely on something running (e.g. a server) and listening to their output. If the output is formatted in a certain way, say:

sum(41, 1)

the server / listener can convert it to code, call a “sum” function and return 42:

In this particular case we would say that LLM has a “sumtool. And the LLM would have a prompt similar to this one where it would identify that two numbers need to be added from the user’s prompt, and would “call a tool” with “sum(41, 1)“.

But that was a few years ago, where we would beg LLMs to call a tool and return the response in JSON pretty please. And sometimes it would even work, and sometimes it would not.

Fast forward to now, we have Tools, MCP, Agents, LSP, + many more acronyms. Plus LLMs are much better at following our “please JSON” begging.

However, one capability, one discovery and innovation in this field stands out: Skills. The reason it stands out is because, compared to all the hype and acronyms, Skills are… simple.

Just a Directory

A Skill is effectively a directory. That’s it.

It doesn’t require a complex registration process or a heavy framework. It requires a folder, a markdown file, and maybe a script or two. It relies on a standard structure that the Agent (current fancy name for a “piece of code that calls LLM”) knows how to traverse.

This simplicity lowers the barrier to entry significantly. The best way to learn something is to do it, hence I needed a problem to solve.

The “Dad” Problem

My kids are navigating the ways of middle/high school: one is deep in ionic bonding in Chemistry, another one is grappling with graphing sine waves in Pre-Calculus.

I can explain these concepts to them with words, or I can scribble on a napkin (which I love doing), but some concepts just need to move. They need to be visualized. I could write a Python program or a D3.js visualization for every single problem, but that takes too long, even when done directly with “the big 4“.

I needed a way to say: “Show me a covalent bond in H₂O” and have the visualization write itself, test itself, run itself, and show me that interactive covalent bond moving step by step.

Great fit for a “Skill” I thought and built salvador that can do this:

> /visualize a covalent bond in H₂O


Salvador… Who?

Salvador is an Anthropic Skill (but I hear others are adopting skills as well) that lives in my terminal. Its job is to visualize physics, math, or any other Universe concepts on demand.

The beauty is in the organization and progressive disclosure. The Agent doesn’t load the entire “visualization knowledge base” into its context window at once, like it did with MCP. It discovers the skill via metadata and only loads the heavy instructions when I actually ask for a visualization.

Here is what it looks like on disk:

salvador/
├── SKILL.md                 <-- the brain
├── scripts/                 <-- the hands
│   ├── inspect.js
│   └── setup.sh
└── templates/               <-- the scaffolding
    └── visualization.html

The Brain (SKILL.md)

This is the core. It uses a markdown format with YAML frontmatter. But inside, it’s not code – it’s a codified intent. It defines the flow: analyze the ask, divide and conquer, plan the steps / structure, code the solution, verify the result, go back if needed, etc.

The Hands (scripts/)

An agent without tools is just a chatbot. For salvador to work, it needs to “see” what it created.

A small script, inspect.js, that lives inside, spins up a headless browser. When the Agent generates a visualization, it doesn’t “hope” it works. It runs inspect.js to capture console errors and take screenshots.

If the screen, or a sequence of screens, are not what they should be, whether the laws of Physics are off, or a text overlay is misplaced, the Agent catches it, understands the problem, and rewrites the code.

The Loop

This is where the “agent” part becomes real. It’s not the usual “text in, code out.”, it’s a loop:

Idea -> Code -> Inspect -> Fix (as many times as needed) -> Render

Now, when we work on grokking a Physics problem at the kitchen table, I have a lot more than a napkin drawing: me and Salvador are cooking:

$ claude
> /visualize the difference between gravity on Earth vs Mars using a bouncing ball

Salvador picks up the intent, writes the physics simulation, tests it, fixes the gravity constant if it got it wrong, and pops open a browser window:

Humans today create and “conduct” code with or without LLMs. But complementing it with a Skill unlocks learning The Universe quark by quark, lepton by lepton.

Composable. And very open for evolution, because it is.. Simple. and Simple Wins. Always.


23
Apr 15

Question Everything

Feeding Da Brain


In 90s you would say: “I am a programmer”. Some would reply with “o.. k”. More insightful would reply with a question “which programming language?”.

21st century.. socially accepted terminology has changed a bit, now you would say “I am a developer”. Some would ask “which programming language?”. More insightful would reply with a question “which out of these 42 languages do you use the most?”

The greatest thing about using several at the same time is that feeling of constant adjustment as I jump between the languages. It feels like my brain goes through exuberant synaptogenesis over and over again building those new formations.

   What's for dinner today honey?
   Asynchronous brain refactoring with a gentle touch of "mental polish"

With all these new synapses, I came to love the fact that something that seemed so holy and “crystal right” before, now gets questioned and can easily be dismissed. Was it wrong all along? No. Did it change? No. So what changed then? Well.. perception did.

Inmates of the “Gang of Four” Prison


Design patterns are these “ways” of doing things that cripple new programmers, and imprison many senior ones. Instead of having an ability to think freely, we have all these “software standard patterns” which mostly have to do with limitations of “technology at time”.

Take big guys, like C++ / Java / C#, while they have many great features and ideas, these languages have horrible story of “behavior and state”: you always have to guard something. Whether it is from multiple threads, or from other people misusing it. The languages themselves promote reuse vs. decoupling: i.e. “let’s inherit that behavior”, etc..

So how do we overcome these risks and limitations? Simple: let’s create dozens of “ways” that all developers will follow to fight this together. Oh, yea, and let’s make it industry standard, call them patterns, teach them in schools, and select people by how well they can “apply” these patterns to “any” problem at hand.

Not all developers bought into this cult of course. Here is Peter Norvig’s notes from 1996, where he “dismisses” 16 out of 23 patterns from Gang of Four, by just using functions, types, modules, etc.

Builder Pattern vs. Immutable Data Structures


Builder pattern makes sense unless.. several things. There is a great “Builders vs. Option Maps” short post that talks about builder patter limitations:

* Builders are too verbose
* Builders are not data structures
* Builders are mutable
* Builders can’t easily compose
* Builders are order dependent

Due to mutable data structures (in Java/C#/alike) Builders still make sense for things like Google protobufs with simple (i.e. primitive) types, but for most cases where immutable things need to be created it is best to use immutable data structures for the above reasons.

While jumping between the languages, I often need to create things in Clojure that are implemented in Java with Builders. This is not always easy, especially when Builders rely on external state or/and when Builders need to be passed around (i.e. to achieve a certain level of composition).

Let’s say I need to create a notification client that, by design (on the Java side of things), takes some initial state (i.e. an external system connection props), and then event handlers (callbacks) are registered on it one by one, before it gets built, i.e. builds a final, immutable notification client:

SomeClassBuilder builder = SomeClass.newBuilder()
                             .setState( state )
                             .setAnotherThing( thing );
 
builder.register( notificationHandlerOne );
builder.register( notificationHandlerTwo );
...
builder.register( notificationHandlerN );
 
builder.build();

In case you need to decouple “register events” logic from this monolithic piece above, you would pass that builder to the caller that would pass it down the chain. It is something that seems “normal” to do (at least to a “9 to 5” developer), since methods with side effects do not really raise any eyebrows in OO languages. In fact most of methods in those languages have side effects.

I quite often see people designing builders such as the one above (with lots of external state), and when I need to use them in Clojure, it becomes very apparent that the above is not well designed:

;; creates a "mutable" builder..
(defn- make-bldr [state thing]
  (-> (SomeClass/newBuilder)
      (.withState state)
      (.withAnotherThing thing)))
 
;; wraps "builder.register(foo)" into a composable function
(defn register-event-handler! [bldr handler]
    ;; in case handler is just a Clojure function wrap it with something ".register" will accept
    (.register bldr handler))
 
(defn notification-client [state thing]
  (let [bldr (make-bldr state thing)]
    ;; ... do things that would call "register-event-handler!" passing them the "bldr"
    (.build bldr)))

Things that immediately feel “off” are: returning a mutable builder from “make-bldr”, mutating that builder in “register-event-handler!”, and returning that mutated builder back.. Not something common in Clojure at all.

Again the goal is to “decouple logic to register events from notification client creation“, if both can happen at the same time, the above Builder would work.

In Clojure it would just be a map. All data structures in Clojure are immutable, so there would be no intermediate mutable holder/builder, it would always be an immutable map. When all handlers are registered, this map would be passed to a function that would create a notification client with these handlers and start it with “state” and “thing”.

Mocking Suspicions


Another synapse formation, that was created from using many languages at the same time, convinced me that if I have to use а mock to test something, that something needs a close look, and would potentially welcome refactoring.

The most common case for mocking is:

A method of a component "A" takes a component "B" that depends on a component "C".

So if I want to test A’s method, I can just mock B and not to worry about C.

The flaw here is:

"B" that depends on a component "C".

These things are extremely beneficial to question. I used to use Spring a lot, and when I did, I loved it. Learned from it, taught it to others, and had a great sense of accomplishment when high complexity could be broken down to simple pieces and re wired together with Spring.

Time went on, and in Erlang or Clojure, or even Groovy for that matter, I used Spring less and less. I still use it for all my Java work, but not as much. So if 10 years ago:

"B" that depends on a component "C".

was a way of life, now, every time I see it, I ask why?. Does “B” have to depend on “C”? Can “B” be stateless and take “C” whenever it needed it, rather that be injected with it and carry that state burden?

If before “B” was:

public class B {
 
  private final C c;
 
  public B ( C c ) {
    this.c = c;
  }
 
  public Profit doBusiness() {
    return new Profit( c.doYourBusiness() + 42 );
  }
}

Can it instead just be:

public final class B {
  public static Profit doBusiness( C c ) {
    return new Profit( c.doYourBusiness() + 42 );
  }
}

In most cases it can! It really can, the problem is not enough of us question that dependency, but we should.

This does not mean “B” no longer depends on “C”, it means something more: there is no “B” (“there is no spoon..”) as it just becomes a module, which does not need to be passed around as an instance. The only thing that is left from “B” is “doBusiness( C c )”. Do we need to mock “C” now? Can it, its instance disappear the way “B” did? Most likely, and even if it can’t for whatever reason (i.e. someone else’s code), I did question it, and so should you.


The more synapse formations I go through the better I learn to question pretty much everything. It is fun, and it pays back with beautiful revelations. I love my brain :)


6
Oct 10

Humans + Software = Zero

FACT:

Humans are stateful and mutable beings that have no problems processing many things concurrently and share state with others + they are usually “coupled”

GOAL:

And yet in software design we strive for being stateless, immutable and decoupled.


CONCLUSION I:

Humans should not design software

CONCLUSION II:

If humans were designed by software, that project sucked ( according to the GoF teachings )

CONCLUSION III:

Humans are not software, they are maybe… hardware

CONCLUSION IV:

Good software is going to lead to the race of Inhumans

CONCLUSION V:

Humans + Software = 0