9ml - Machine Learning Models for Plan 9

Port of llama2.c to Plan 9 (9front).

Rules

NEVER skip tests. All tests are mandatory. If a test cannot run, fix the environment - do not skip.
All changes must be tested. Run tests before committing:
- Linux host: make test (or cd test && gcc -o harness *.c -lm && ./harness)
- Plan 9 native: mk (compiles all targets)
Tests must pass. Do not merge if tests fail.

Quick Start

Running Tests (Linux Host)

# Build and run all tests
make test

# Or manually:
cd test
gcc -o harness *.c -lm
./harness

The test harness:

Starts a Plan 9 QEMU VM
Compiles and runs tests inside Plan 9
Compares output against C reference implementations
Reports pass/fail

Building in Plan 9

# Clone repo and build natively in Plan 9
mk            # Build all: run, runq, export, tests
mk clean      # Clean all build artifacts

Supported Model Formats

9ml supports safetensors and GGUF model formats:

# Run with safetensors model
./run model.safetensors -z tokenizer.bin -n 50

# Run with GGUF model (supports quantized models)
./run model-Q8_0.gguf -z tokenizer.bin -n 50

Project Structure

9ml/
├── src/
│   ├── run.c              # FP32 inference (ported)
│   ├── runq.c             # INT8 quantized inference (ported)
│   ├── model.c            # Model loading helpers
│   ├── modelq.c           # Quantized model helpers
│   ├── export.c           # Model export/conversion tool
│   ├── llmfs.c            # 9P file server for multi-model inference
│   ├── simd.h             # SIMD function declarations
│   ├── simd_amd64.s       # SSE2 assembly implementations
│   ├── simdq_amd64.s      # Quantized SIMD (stub, uses C fallback)
│   ├── parallel.h         # Thread pool declarations
│   ├── parallel.c         # Thread pool implementation
│   ├── arch/              # Model architecture plugins
│   │   ├── arch.h         # Architecture interface
│   │   ├── arch.c         # Plugin registry
│   │   └── llama2.c       # LLaMA 2 architecture
│   ├── format/            # File format parsers
│   │   ├── gguf.c         # GGUF format parser
│   │   └── safetensors.c  # Safetensors parser
│   ├── pool/              # Model pool management
│   │   ├── pool.h         # Pool interface
│   │   └── pool.c         # LRU eviction, memory tracking
│   ├── mkfile             # Plan 9 build file
│   └── tests/             # Plan 9 test source files
│       ├── mkfile         # Plan 9 test build file
│       ├── test_*.c       # Various unit tests
│       └── ...
├── test/                  # C test harness (Linux host)
│   ├── harness.c          # Main test driver (supports dual-VM testing)
│   ├── reference.c/h      # Reference implementations
│   ├── qemu.c/h           # QEMU VM management (supports socket networking)
│   └── fat.c/h            # FAT disk operations (mtools)
├── qemu/
│   ├── 9front.qcow2       # VM disk image
│   └── shared.img         # FAT disk for file sharing
├── models/                # Model files (download separately)
│   ├── *.safetensors      # Safetensors format (HuggingFace)
│   └── *.gguf             # GGUF format (llama.cpp)
├── mkfile                 # Root Plan 9 build file
└── tokenizer.bin          # Tokenizer data

Running Tests

Linux Host Testing

The test harness compiles and runs all tests in a Plan 9 QEMU VM:

make test

Requirements:

qemu-system-x86_64
mtools (mcopy, mkfs.vfat)
curl (for downloading 9front if needed)

Test Coverage

Test	Description
rmsnorm	RMS normalization
softmax	Softmax function
matmul	Matrix multiplication
rng	Random number generator (xorshift)
model_loading	Config and weights loading
generation	End-to-end text generation (FP32)
generation_simd	FP32 generation with SIMD optimizations
quantize	INT8 quantize/dequantize roundtrip
quantized_matmul	Quantized matrix multiplication
generation_quantized	End-to-end text generation (Q8_0, must match FP32)
llmfs_local	9P file server local mount and generation
llmfs_remote	Dual-VM remote 9P inference (CPU serves, terminal mounts)
benchmark	Performance benchmark (scalar vs SIMD vs threaded)
simd_validation	SIMD correctness vs scalar baseline
simd_debug	Minimal SIMD debug test
softmax_simd	Softmax SIMD optimization tests
rmsnorm_simd	RMSNorm SIMD optimization tests
arch_detect	Architecture registry and constants
format_detect	File format detection (GGUF, safetensors)
softmax_benchmark	Softmax performance benchmark
softmax_accuracy	Softmax numerical accuracy tests
gguf_dequant	GGUF Q4_0/Q8_0 dequantization
gguf_parse	GGUF header and metadata parsing
http	HTTP client (Plan 9 dial)
safetensors	Safetensors format parsing
pool_lru	Model pool LRU eviction and reference counting

Building in Plan 9

Using mkfiles

# Build everything (from repo root)
mk

# Build only src targets (run, runq, export)
cd src && mk

# Build only test binaries
cd src/tests && mk

# Clean
mk clean

Architecture

9front uses amd64 (64-bit):

Compiler: 6c (NOT 8c which is for 386)
Linker: 6l (NOT 8l)
Object files: .6 extension

Manual Compilation

# Compile
6c -w program.c

# Link
6l -o program program.6

# Combined
6c -w program.c && 6l -o program program.6

Performance Optimizations

The inference engine supports SIMD vectorization and multi-threading for improved performance.

SIMD (SSE2)

Matrix-vector multiplication is accelerated using SSE2 packed float instructions:

Operation	Implementation	Speedup
matmul	SSE2 assembly (simd_amd64.s)	~5.7x
dot_product	SSE2 assembly	~4x
rmsnorm	SSE2 assembly	~3x
softmax	C with 4x unrolling (needs exp())	~2x
vec_add, vec_scale	SSE2 assembly	~4x

The SIMD implementation uses:

8-element unrolled loops with 2 accumulators
4-element cleanup loop
Scalar remainder for non-aligned sizes
Horizontal sum via SHUFPS for final reduction

Thread Pool

Parallel execution uses Plan 9's libthread:

Auto-detects CPU count from /dev/sysstat
Channel-based work distribution
Parallel attention head computation

Benchmark Results (stories15M, 1024x1024 matmul)

Mode	GFLOPS	Speedup
Scalar (1 thread)	3.4	1.0x
SIMD (1 thread)	19.0	5.7x
SIMD (4 threads)	18.7	5.6x

Note: Multi-threading overhead can exceed benefit for small matrices.

Runtime Configuration

/* In model.c / modelq.c */
extern OptConfig opt_config;
opt_config.use_simd = 1;    /* Enable SIMD (default) */
opt_config.nthreads = 4;    /* Set thread count (0 = auto) */

Command-line flags:

./run model.safetensors -z tok.bin --no-simd    # Disable SIMD
./run model.safetensors -z tok.bin --threads 2  # Set thread count

Running Inference

FP32 Inference (safetensors)

In Plan 9:

6c -w run.c && 6l -o run run.6
./run model.safetensors -z tokenizer.bin -n 50 -i 'Once upon a time'

Quantized Inference (GGUF)

Run with a GGUF Q8_0 model (8-bit quantization):

./run model-Q8_0.gguf -z tokenizer.bin -n 50 -i 'Once upon a time'

GGUF supports various quantization levels (Q4_0, Q8_0, etc.) for reduced memory and faster inference.

Command Line Options

Option	Description
`-t <float>`	Temperature (0.0 = greedy, 1.0 = default)
`-p <float>`	Top-p sampling (0.9 = default)
`-s <int>`	Random seed
`-n <int>`	Number of tokens to generate
`-i <string>`	Input prompt
`-z <string>`	Path to tokenizer
`-m generate\|chat`	Mode (default: generate)

Model Export Tool

The export tool can inspect model files:

# Show model info (works on Linux or Plan 9)
./export info model.safetensors
./export info model-Q8_0.gguf

Note: For quantized models, download pre-quantized GGUF files directly from HuggingFace rather than converting. The llama.cpp project provides tools for creating GGUF files with various quantization levels (Q4_0, Q8_0, etc.).

Build on Linux:

gcc -o export src/export.c -lm

Build on Plan 9:

6c export.c && 6l -o export export.6

Supported Model Formats

9ml supports two model formats with automatic detection:

Format Detection

The model loader automatically detects format based on file magic:

GGUF: Magic 0x46554747 ("GGUF" in little-endian)
Safetensors: 8-byte header size followed by JSON metadata

config.json Support

For safetensors models, the loader can read a config.json file in the same directory to get model configuration. Supported fields:

model_type: Maps to architecture ("llama" → ARCH_LLAMA2)
rope_theta: RoPE frequency base (default: 10000.0)
num_hidden_layers: Number of transformer layers
num_attention_heads: Number of attention heads
num_key_value_heads: Number of KV heads (for GQA)
hidden_size: Embedding dimension
intermediate_size: FFN hidden dimension
max_position_embeddings: Maximum sequence length

If config.json is not present, values are inferred from tensor shapes.

Format Comparison

Feature	Safetensors	GGUF
Source	HuggingFace	llama.cpp
Precision	FP32/FP16	FP32/FP16/Quantized
Quantization	No	Q4_0, Q8_0, etc.
Metadata	JSON header	Key-value pairs
Tensor names	HuggingFace names	GGML names
File size (15M)	~60 MB	~15-17 MB (Q8_0)

Tensor Name Mapping

Different formats use different tensor names:

Weight	Safetensors	GGUF
Token embeddings	`model.embed_tokens.weight`	`token_embd.weight`
Q projection	`model.layers.N.self_attn.q_proj.weight`	`blk.N.attn_q.weight`
K projection	`model.layers.N.self_attn.k_proj.weight`	`blk.N.attn_k.weight`
V projection	`model.layers.N.self_attn.v_proj.weight`	`blk.N.attn_v.weight`
O projection	`model.layers.N.self_attn.o_proj.weight`	`blk.N.attn_output.weight`
Gate (w1)	`model.layers.N.mlp.gate_proj.weight`	`blk.N.ffn_gate.weight`
Up (w3)	`model.layers.N.mlp.up_proj.weight`	`blk.N.ffn_up.weight`
Down (w2)	`model.layers.N.mlp.down_proj.weight`	`blk.N.ffn_down.weight`
Attn norm	`model.layers.N.input_layernorm.weight`	`blk.N.attn_norm.weight`
FFN norm	`model.layers.N.post_attention_layernorm.weight`	`blk.N.ffn_norm.weight`
Final norm	`model.norm.weight`	`output_norm.weight`
Output	`lm_head.weight`	`output.weight`

GGUF Attention Weight Interleaving

Important: llama.cpp's GGUF converter interleaves Q and K attention weights for rotary position embeddings. Within each attention head of head_dim rows:

GGUF rows 0,2,4,... contain original rows 0,1,2,...
GGUF rows 1,3,5,... contain original rows head_dim/2, head_dim/2+1,...

The 9ml GGUF loader automatically de-interleaves these weights during loading to match the standard llama2.c layout.

Creating GGUF Files

To create a GGUF from a HuggingFace model using llama.cpp:

# Install llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && cmake -B build && cmake --build build

# Convert to F32 GGUF
python convert_hf_to_gguf.py /path/to/hf-model --outtype f32

# Quantize to Q8_0
./build/bin/llama-quantize model-f32.gguf model-Q8_0.gguf Q8_0

Tied Embeddings

Some models share weights between token embeddings and output projection (tie_word_embeddings: true). The loaders handle this:

Safetensors: Only model.embed_tokens.weight present, used for both
GGUF: Only output.weight present, used for both (loader falls back)

LLM File Server (llmfs)

A 9P file server that exposes LLM inference as a Plan 9 filesystem, enabling distributed inference across machines. Supports multiple models with LRU eviction and per-session model binding.

File System Structure

/mnt/llm/
    ctl             # RW: load, unload, limit, models commands
    info            # R:  loaded models, available models, memory usage
    clone           # R:  read to create new session, returns ID
    0/              # Session 0 directory
        ctl         # RW: model, temp, topp, seed, steps, generate, reset, close
        info        # R:  model, config, and status (idle|generating|done|error)
        prompt      # W:  write prompt text
        output      # R:  complete output (blocks until done)
        stream      # R:  streaming output (returns tokens as generated)
    1/              # Session 1...

Building llmfs

In Plan 9:

6c -w llmfs.c && 6l -o llmfs llmfs.6

Local Usage

# Start the file server
./llmfs -s llm

# Mount it
mount /srv/llm /mnt/llm

# Load a model (name, model file, tokenizer)
echo 'load small stories15M.safetensors tokenizer.bin' > /mnt/llm/ctl

# Check pool info
cat /mnt/llm/info

# Create a session
session=`{cat /mnt/llm/clone}

# Bind model to session and configure
echo 'model small' > /mnt/llm/$session/ctl
echo 'temp 0.0' > /mnt/llm/$session/ctl
echo 'steps 50' > /mnt/llm/$session/ctl

# Set prompt and generate
echo 'Once upon a time' > /mnt/llm/$session/prompt
echo 'generate' > /mnt/llm/$session/ctl

# Read output (blocks until complete)
cat /mnt/llm/$session/output

# Check session info (includes status)
cat /mnt/llm/$session/info

Remote Usage (Distributed Inference)

On the server machine (cpu):

# Start llmfs
./llmfs -s llm
echo 'load small stories15M.safetensors tokenizer.bin' > /srv/llm/ctl

# Export over network
aux/listen1 -tv tcp!*!564 /bin/exportfs -r /srv/llm

On the client machine (terminal):

# Connect to remote server
srv tcp!cpu!564 llm
mount /srv/llm /mnt/llm

# Use it as if local
cat /mnt/llm/clone
echo 'model small' > /mnt/llm/0/ctl
echo 'Once upon a time' > /mnt/llm/0/prompt
echo generate > /mnt/llm/0/ctl
cat /mnt/llm/0/output

Server Control Commands

Command	Description
`load <name> <model> <tokenizer>`	Load model into pool with given name
`unload <name>`	Unload model from pool (fails if in use)
`limit <max_models> <max_memory>`	Set pool limits (models and bytes)
`models <path>`	Set directory to scan for available models

Session Control Commands

Command	Description
`model <name>`	Bind session to named model
`temp <float>`	Set temperature (0.0 = greedy)
`topp <float>`	Set top-p sampling (0.0-1.0)
`seed <int>`	Set random seed
`steps <int>`	Set max tokens to generate
`generate`	Start generation
`reset`	Reset session state
`close`	Close session

Info File Format

The /mnt/llm/info file shows pool status:

loaded:
  small: 60.5MB (0 refs)
  large: 13.5GB (2 refs)
available:
  tinyllama.gguf
  mistral-7b.safetensors
memory: 13.56GB / 16.00GB
limit: 8 models

The /mnt/llm/N/info file shows session status:

model: small
temp: 0.8
topp: 0.9
seed: 12345
steps: 256
status: done 202.50 tok/s

Multi-Model Usage

# Load multiple models into the pool
echo 'load small stories15M.safetensors tokenizer.bin' > /mnt/llm/ctl
echo 'load large llama2-7b.gguf tokenizer.bin' > /mnt/llm/ctl

# Check pool status
cat /mnt/llm/info

# Create session and bind to specific model
session=`{cat /mnt/llm/clone}
echo 'model large' > /mnt/llm/$session/ctl

# Generate using bound model
echo 'Once upon a time' > /mnt/llm/$session/prompt
echo 'generate' > /mnt/llm/$session/ctl
cat /mnt/llm/$session/output

# Check session info (includes bound model)
cat /mnt/llm/$session/info

The pool uses LRU eviction: when memory or model count limits are reached, the least recently used models with zero references are unloaded. Sessions hold references to their bound models, preventing eviction while in use.

Downloading Models

HuggingFace integration for automatic model downloads is not yet available. HuggingFace now uses git-lfs and the xet format for model storage, which requires additional tooling support.

To use models with 9ml:

Download manually from HuggingFace website or using huggingface-cli:

# Install huggingface-cli
pip install huggingface-hub

# Download safetensors model
huggingface-cli download Xenova/llama2.c-stories15M model.safetensors --local-dir models/

# Download GGUF model (pre-quantized)
huggingface-cli download tensorblock/Xenova_llama2.c-stories15M-GGUF \
    llama2.c-stories15M-Q8_0.gguf --local-dir models/

Copy to shared disk for use in Plan 9:

# Copy model files to the FAT shared disk
mcopy -i qemu/shared.img models/model.safetensors models/tokenizer.bin ::

Load in Plan 9 via llmfs:

echo 'pool-load mymodel /mnt/host/model.safetensors /mnt/host/tokenizer.bin' > /mnt/llm/ctl

Plan 9 C Porting Guide

Headers

#include <u.h>
#include <libc.h>

Type Mappings

POSIX	Plan 9
`int8_t`	`schar`
`uint8_t`	`uchar`
`int32_t`	`int`
`uint32_t`	`uint`
`int64_t`	`vlong`
`uint64_t`	`uvlong`
`ssize_t`	`vlong`
`size_t`	`ulong`
`NULL`	`nil`

Function Mappings

POSIX	Plan 9
`printf(...)`	`print(...)`
`fprintf(stderr, ...)`	`fprint(2, ...)`
`exit(0)`	`exits(0)`
`exit(1)`	`exits("error")`
`clock_gettime()`	`nsec()`
`mmap()`	Use `open()` + `read()` + `malloc()`

Main Function

void
main(int argc, char *argv[])
{
    // ... code ...
    exits(0);  // or exits("error message")
}

Critical: Struct Padding

Plan 9 may pad structs differently. When reading binary files with struct headers:

// WRONG - struct may be padded
read(fd, &config, sizeof(Config));

// RIGHT - read raw bytes, then copy fields
#define CONFIG_FILE_SIZE (7 * sizeof(int))
int buf[7];
read(fd, buf, CONFIG_FILE_SIZE);
config.dim = buf[0];
config.hidden_dim = buf[1];
// ...

No OpenMP

Remove all #pragma omp directives - Plan 9 doesn't support OpenMP.

Plan 9 amd64 Assembly Calling Convention

When writing assembly for Plan 9 amd64:

First argument:     BP (RARG register)
Subsequent args:    Stack at 16(SP), 24(SP), 32(SP), 40(SP)...
Return value:       AX (integer), X0 (float)
Callee-saved:       None (caller saves all)

Stack layout (verified empirically):

0(SP)   = return address
8(SP)   = padding/frame
16(SP)  = 2nd argument
24(SP)  = 3rd argument
32(SP)  = 4th argument
40(SP)  = 5th argument

Important: Use SUBL/TESTL/JLE pattern instead of CMPL/JGE for loop comparisons - the Plan 9 assembler's comparison semantics differ from standard x86.

SIMD Assembly Implementation Notes

The simd_amd64.s file contains SSE2 implementations for performance-critical operations:

Frame Size Matters

Use $0 frame size (no local stack variables) for simple functions:

TEXT matmul_simd(SB), $0    // Works - no local frame
TEXT rmsnorm_simd(SB), $0   // Works - no local frame

Using $8 or other frame sizes changes stack argument offsets and can cause memory faults. If you need temp storage, use registers instead of stack.

BYTE-Encoded Instructions

Plan 9 assembler doesn't support all SSE instructions. Use BYTE encoding:

// CVTSI2SS R14, X1 (convert int64 in R14 to float in X1)
// F3 49 0F 2A CE = REX.WB prefix + opcode + ModR/M
BYTE $0xF3; BYTE $0x49; BYTE $0x0F; BYTE $0x2A; BYTE $0xCE

// MOVD R8d, X0 (move 32-bit from R8 to XMM0)
// 66 41 0F 6E C0 = operand-size + REX.B + opcode + ModR/M
BYTE $0x66; BYTE $0x41; BYTE $0x0F; BYTE $0x6E; BYTE $0xC0

// SQRTSS X0, X1 (sqrt of X0 into X1)
// F3 0F 51 C8
BYTE $0xF3; BYTE $0x0F; BYTE $0x51; BYTE $0xC8

// RSQRTSS X0, X1 (approximate 1/sqrt of X0 into X1)
// F3 0F 52 C8
BYTE $0xF3; BYTE $0x0F; BYTE $0x52; BYTE $0xC8

Approximate vs Exact Instructions

RSQRTSS - approximate reciprocal sqrt, relative error ~0.0004 (fast but inaccurate)
SQRTSS + DIVSS - exact sqrt (slower but matches scalar sqrtf())

For rmsnorm, use exact SQRTSS to match scalar output:

// Compute exact 1/sqrt
SQRTSS X0, X1           // X1 = sqrt(X0)
MOVL $0x3F800000, R8    // 1.0f in IEEE 754
MOVD R8, X0             // X0 = 1.0
DIVSS X1, X0            // X0 = 1.0 / sqrt(...)

Plan 9 FPU Exception Handling

Plan 9 enables floating-point exceptions by default. To disable:

setfcr(getfcr() & ~(FPINVAL|FPZDIV|FPOVFL|FPUNFL|FPINEX));

This affects x87 FPU but SSE MXCSR is typically initialized with exceptions masked (0x1F80).

Debugging Tips

Memory faults often indicate wrong stack offsets - verify with matmul_simd pattern
Denormal exceptions suggest RSQRTSS with very small values - use SQRTSS+DIVSS
Add pointer validation at function entry for debugging:

TESTQ DI, DI
JZ bad_ptr
TESTQ SI, SI
JZ bad_ptr

No bsearch

Implement binary search manually:

int str_lookup(char *str, TokenIndex *sorted_vocab, int vocab_size) {
    int lo = 0, hi = vocab_size - 1;
    while (lo <= hi) {
        int mid = (lo + hi) / 2;
        int cmp = strcmp(str, sorted_vocab[mid].str);
        if (cmp == 0) return sorted_vocab[mid].id;
        if (cmp < 0) hi = mid - 1;
        else lo = mid + 1;
    }
    return -1;
}

QEMU VM (for Linux host testing)

The test harness manages QEMU automatically. For manual debugging:

Boot Sequence

bootargs prompt -> Press Enter (accept default)
user prompt -> Press Enter (accept default: glenda)
Reach term% prompt (rc shell)

Mounting Shared Disk in Plan 9

dossrv -f /dev/sdG0/data shared
mount -c /srv/shared /mnt/host

Troubleshooting

"file does not exist" in Plan 9

The shared disk may not be mounted. Mount it manually:

dossrv -f /dev/sdG0/data shared
mount -c /srv/shared /mnt/host

Compilation errors about missing functions

Common missing functions in Plan 9:

bsearch - implement manually
round - use floor(x + 0.5f)
sqrtf/expf/etc - use sqrt/exp (Plan 9 has double versions)

Generation output is garbage

Check struct padding - use raw byte reading for binary file headers.

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

9ml - Machine Learning Models for Plan 9

Rules

Quick Start

Running Tests (Linux Host)

Building in Plan 9

Supported Model Formats

Project Structure

Running Tests

Linux Host Testing

Test Coverage

Building in Plan 9

Using mkfiles

Architecture

Manual Compilation

Performance Optimizations

SIMD (SSE2)

Thread Pool

Benchmark Results (stories15M, 1024x1024 matmul)

Runtime Configuration

Running Inference

FP32 Inference (safetensors)

Quantized Inference (GGUF)

Command Line Options

Model Export Tool

Supported Model Formats

Format Detection

config.json Support

Format Comparison

Tensor Name Mapping

GGUF Attention Weight Interleaving

Creating GGUF Files

Tied Embeddings

LLM File Server (llmfs)

File System Structure

Building llmfs

Local Usage

Remote Usage (Distributed Inference)

Server Control Commands

Session Control Commands

Info File Format

Multi-Model Usage

Downloading Models

Plan 9 C Porting Guide

Headers

Type Mappings

Function Mappings

Main Function

Critical: Struct Padding

No OpenMP

Plan 9 amd64 Assembly Calling Convention

SIMD Assembly Implementation Notes

Frame Size Matters

BYTE-Encoded Instructions

Approximate vs Exact Instructions

Plan 9 FPU Exception Handling

Debugging Tips

No bsearch

QEMU VM (for Linux host testing)

Boot Sequence

Mounting Shared Disk in Plan 9

Troubleshooting

"file does not exist" in Plan 9

Compilation errors about missing functions

Generation output is garbage

Resources