

Ha ha! I actually finished it over the weekend. Now it’s onto the documentation…ICBF lol
I just tried to get shit GPT to do it this morning, as it’s generally pretty ok for that. As always, it produces real “page turners”. Here is its idea of a “lay explainer”
Mixture of Assholes: Llama-swap + “MoA router”: making small local models act reliably (without pretending they’re bigger)
This project is a harness for local inference: llama-swap is the model traffic-cop, and the router is the conductor that decides what kind of work you want done (straight answer, self-critique loop, style rewrite, vision/OCR), when, and with what context. Vodka acts as memory layer and context re-roll.
The goal isn’t to manufacture genius. It’s to make local models behave predictably under hardware constraints by:
- making retrieval explicit (no “mystery memory”),
- keeping “fancy modes” opt-in,
- and making the seams inspectable when something goes wrong.
The shape is simple:
UI → Router (modes + RAG + memory plumbing) → llama-swap (model switching) → answer. ([GitHub][1])
The “what”: one OpenAI-style endpoint that routes workflows, not just models
At the front is an OpenAI-compatible POST /v1/chat/completions endpoint. From the client’s point of view, it’s “just chat completions” (optionally streaming). From the router’s point of view, each request can become a different workflow.
It also accepts OpenAI-style multimodal message blocks (text + image_url), which matters for the vision/OCR paths.
Under the hood, the router does three things:
- Decides the pipeline (Serious / Mentats / Fun / Vision / OCR)
- Builds an explicit FACTS block (RAG) if you’ve attached any KBs
- Calls llama-swap, which routes the request to the chosen local model backend behind an OpenAI-like interface ([GitHub][1])
The “why”: small models fail less when you make the seams visible
A lot of local “agent” setups fail in the same boring ways:
- they silently change behaviour,
- they smuggle half-remembered context,
- they hallucinate continuity.
This design makes those seams legible and user-controlled:
- You pick the mode explicitly (no silent “auto-escalation”).
- Retrieval is explicit and inspectable.
- There’s a “peek” path that can show what the RAG facts block would look like without answering — which is unbelievably useful for debugging.
The philosophy is basically: if the system is going to influence the answer, it should be inspectable, not mystical.
The “what’s cool”: you’re routing workflows (Serious / Mentats / Fun / Vision)
There are two layers of control:
A) Session commands (>…): change the router state
These change how the router behaves across turns (things like sticky fun mode, which KBs are attached, and some retrieval observability):
>>status— show session state (sticky mode, attached KBs, last RAG query/hits)>>fun/>>fun off— toggle sticky fun mode>>attach <kb>/>>detach <kb|all>/>list_kb— manage KBs per session>>ingest <kb>/>ingest_all— ingest markdown into Qdrant>>peek <query>— preview the would-be facts block
B) Per-turn selectors (#…): choose the pipeline for one message
# mentats …— deep 3-pass “draft → critique → final”## fun …— answer, then rewrite in a persona voice# vision …/# ocr …— image paths
The three main pipelines (what they actually do)
1) Serious: the default “boring, reliable” answer
Serious is the default when you don’t ask for anything special. It can inject a FACTS block (RAG) and it receives a constraints block (which is currently a V1 placeholder). It also enforces a confidence/source line if it’s missing.
Docs vs implementation (minor note): the docs describe Serious as “query + blocks” oriented. The current implementation also has a compact context/transcript shaping step as part of prompt construction. Treat the code as the operational truth; the docs are describing the intended shape and may lag slightly in details as things settle.
2) Mentats: explicit 3-pass “think → critique → final”
This is the “make the model check itself” harness:
- Thinker drafts using QUERY + FACTS + constraints
- Critic checks for overreach / violations
- Thinker produces the final, carrying forward a “FACTS_USED / CONSTRAINTS_USED” discipline
If the pipeline can’t complete cleanly (protocol errors), the router falls back to Serious.
3) Fun: answer first, then do the performance
Fun is deliberately a post-processing transform:
- pass 1: generate the correct content (lower temperature)
- pass 2: rewrite in a persona voice (higher temperature), explicitly instructed not to change the technical meaning
This keeps “voice” from leaking into reasoning or memory. It’s: get it right first, then style it.
RAG, but practical: Qdrant + opt-in KB (knowledge base) attach + “peek what you’re feeding me”
KBs are opt-in per session
Nothing is retrieved unless you attach KBs (>attach linux, etc.). The FACTS block is built only from attached KBs and the router tracks last query/hit counts for debugging.
Ingestion: “KB folder → chunks → vectors in Qdrant”
Ingestion walks markdown, chunks, embeds, and inserts into Qdrant tagged by KB. It’s simple and operational: turn a folder of docs into something you can retrieve from reliably.
The KB refinery: SUMM → DISTILL → ingest
This is one of the more interesting ideas: treat the KB as a product, not a dump.
- SUMM produces a human-readable summary (strict: no fabrication, no silent renaming) from base text
- DISTILL produces dense, retrieval-shaped atoms (embedding-friendly headings/bullets, minimal noise)
- then ingest the distilled output
The key point: DISTILL isn’t “a nicer summary.” It’s explicitly trying to produce retrieval-friendly material.
Vodka: deterministic memory plumbing (not “AI memory vibes”)
Vodka does two jobs:
- context reduction / stability: keep the effective context small and consistent
- explicit notes: store/retrieve nuggets on demand (
!!store,??recall, plus cleanup commands), TTL (facts expire unless used)
It can also leave internal breadcrumb markers and later expand them when building a transcript/context — those IDs aren’t surfaced unless you deliberately show them.
Roadmap reality check: what’s left for V1.1
- Constraints/GAG: placeholder in V1 (constraints block currently empty)
- Coder role: present in config but not wired yet


Through no fault of my own, the people I talk to (family) are tied to the FB ecosystem…and they are very resilient as to leaving it.
2026 will have to be the year of “please, for the love of all that is holy, can we switch to Signal”