Port of llama2.c to Plan 9 (9front).
- NEVER skip tests. All tests are mandatory. If a test cannot run, fix the environment - do not skip.
- All changes must be tested. Run tests before committing:
- Linux host:
make test(orcd test && gcc -o harness *.c -lm && ./harness) - Plan 9 native:
mk(compiles all targets)
- Linux host:
- Tests must pass. Do not merge if tests fail.
# Build and run all tests
make test
# Or manually:
cd test
gcc -o harness *.c -lm
./harnessThe test harness:
- Starts a Plan 9 QEMU VM
- Compiles and runs tests inside Plan 9
- Compares output against C reference implementations
- Reports pass/fail
# Clone repo and build natively in Plan 9
mk # Build all: run, runq, export, tests
mk clean # Clean all build artifacts9ml supports safetensors and GGUF model formats:
# Run with safetensors model
./run model.safetensors -z tokenizer.bin -n 50
# Run with GGUF model (supports quantized models)
./run model-Q8_0.gguf -z tokenizer.bin -n 509ml/
├── src/
│ ├── run.c # FP32 inference (ported)
│ ├── runq.c # INT8 quantized inference (ported)
│ ├── model.c # Model loading helpers
│ ├── modelq.c # Quantized model helpers
│ ├── export.c # Model export/conversion tool
│ ├── llmfs.c # 9P file server for multi-model inference
│ ├── simd.h # SIMD function declarations
│ ├── simd_amd64.s # SSE2 assembly implementations
│ ├── simdq_amd64.s # Quantized SIMD (stub, uses C fallback)
│ ├── parallel.h # Thread pool declarations
│ ├── parallel.c # Thread pool implementation
│ ├── arch/ # Model architecture plugins
│ │ ├── arch.h # Architecture interface
│ │ ├── arch.c # Plugin registry
│ │ └── llama2.c # LLaMA 2 architecture
│ ├── format/ # File format parsers
│ │ ├── gguf.c # GGUF format parser
│ │ └── safetensors.c # Safetensors parser
│ ├── pool/ # Model pool management
│ │ ├── pool.h # Pool interface
│ │ └── pool.c # LRU eviction, memory tracking
│ ├── mkfile # Plan 9 build file
│ └── tests/ # Plan 9 test source files
│ ├── mkfile # Plan 9 test build file
│ ├── test_*.c # Various unit tests
│ └── ...
├── test/ # C test harness (Linux host)
│ ├── harness.c # Main test driver (supports dual-VM testing)
│ ├── reference.c/h # Reference implementations
│ ├── qemu.c/h # QEMU VM management (supports socket networking)
│ └── fat.c/h # FAT disk operations (mtools)
├── qemu/
│ ├── 9front.qcow2 # VM disk image
│ └── shared.img # FAT disk for file sharing
├── models/ # Model files (download separately)
│ ├── *.safetensors # Safetensors format (HuggingFace)
│ └── *.gguf # GGUF format (llama.cpp)
├── mkfile # Root Plan 9 build file
└── tokenizer.bin # Tokenizer data
The test harness compiles and runs all tests in a Plan 9 QEMU VM:
make testRequirements:
qemu-system-x86_64mtools(mcopy, mkfs.vfat)curl(for downloading 9front if needed)
| Test | Description |
|---|---|
| rmsnorm | RMS normalization |
| softmax | Softmax function |
| matmul | Matrix multiplication |
| rng | Random number generator (xorshift) |
| model_loading | Config and weights loading |
| generation | End-to-end text generation (FP32) |
| generation_simd | FP32 generation with SIMD optimizations |
| quantize | INT8 quantize/dequantize roundtrip |
| quantized_matmul | Quantized matrix multiplication |
| generation_quantized | End-to-end text generation (Q8_0, must match FP32) |
| llmfs_local | 9P file server local mount and generation |
| llmfs_remote | Dual-VM remote 9P inference (CPU serves, terminal mounts) |
| benchmark | Performance benchmark (scalar vs SIMD vs threaded) |
| simd_validation | SIMD correctness vs scalar baseline |
| simd_debug | Minimal SIMD debug test |
| softmax_simd | Softmax SIMD optimization tests |
| rmsnorm_simd | RMSNorm SIMD optimization tests |
| arch_detect | Architecture registry and constants |
| format_detect | File format detection (GGUF, safetensors) |
| softmax_benchmark | Softmax performance benchmark |
| softmax_accuracy | Softmax numerical accuracy tests |
| gguf_dequant | GGUF Q4_0/Q8_0 dequantization |
| gguf_parse | GGUF header and metadata parsing |
| http | HTTP client (Plan 9 dial) |
| safetensors | Safetensors format parsing |
| pool_lru | Model pool LRU eviction and reference counting |
# Build everything (from repo root)
mk
# Build only src targets (run, runq, export)
cd src && mk
# Build only test binaries
cd src/tests && mk
# Clean
mk clean9front uses amd64 (64-bit):
- Compiler:
6c(NOT8cwhich is for 386) - Linker:
6l(NOT8l) - Object files:
.6extension
# Compile
6c -w program.c
# Link
6l -o program program.6
# Combined
6c -w program.c && 6l -o program program.6The inference engine supports SIMD vectorization and multi-threading for improved performance.
Matrix-vector multiplication is accelerated using SSE2 packed float instructions:
| Operation | Implementation | Speedup |
|---|---|---|
| matmul | SSE2 assembly (simd_amd64.s) | ~5.7x |
| dot_product | SSE2 assembly | ~4x |
| rmsnorm | SSE2 assembly | ~3x |
| softmax | C with 4x unrolling (needs exp()) | ~2x |
| vec_add, vec_scale | SSE2 assembly | ~4x |
The SIMD implementation uses:
- 8-element unrolled loops with 2 accumulators
- 4-element cleanup loop
- Scalar remainder for non-aligned sizes
- Horizontal sum via SHUFPS for final reduction
Parallel execution uses Plan 9's libthread:
- Auto-detects CPU count from
/dev/sysstat - Channel-based work distribution
- Parallel attention head computation
| Mode | GFLOPS | Speedup |
|---|---|---|
| Scalar (1 thread) | 3.4 | 1.0x |
| SIMD (1 thread) | 19.0 | 5.7x |
| SIMD (4 threads) | 18.7 | 5.6x |
Note: Multi-threading overhead can exceed benefit for small matrices.
/* In model.c / modelq.c */
extern OptConfig opt_config;
opt_config.use_simd = 1; /* Enable SIMD (default) */
opt_config.nthreads = 4; /* Set thread count (0 = auto) */Command-line flags:
./run model.safetensors -z tok.bin --no-simd # Disable SIMD
./run model.safetensors -z tok.bin --threads 2 # Set thread count
In Plan 9:
6c -w run.c && 6l -o run run.6
./run model.safetensors -z tokenizer.bin -n 50 -i 'Once upon a time'
Run with a GGUF Q8_0 model (8-bit quantization):
./run model-Q8_0.gguf -z tokenizer.bin -n 50 -i 'Once upon a time'
GGUF supports various quantization levels (Q4_0, Q8_0, etc.) for reduced memory and faster inference.
| Option | Description |
|---|---|
-t <float> |
Temperature (0.0 = greedy, 1.0 = default) |
-p <float> |
Top-p sampling (0.9 = default) |
-s <int> |
Random seed |
-n <int> |
Number of tokens to generate |
-i <string> |
Input prompt |
-z <string> |
Path to tokenizer |
-m generate|chat |
Mode (default: generate) |
The export tool can inspect model files:
# Show model info (works on Linux or Plan 9)
./export info model.safetensors
./export info model-Q8_0.ggufNote: For quantized models, download pre-quantized GGUF files directly from HuggingFace rather than converting. The llama.cpp project provides tools for creating GGUF files with various quantization levels (Q4_0, Q8_0, etc.).
Build on Linux:
gcc -o export src/export.c -lmBuild on Plan 9:
6c export.c && 6l -o export export.6
9ml supports two model formats with automatic detection:
The model loader automatically detects format based on file magic:
- GGUF: Magic
0x46554747("GGUF" in little-endian) - Safetensors: 8-byte header size followed by JSON metadata
For safetensors models, the loader can read a config.json file in the same directory to get model configuration. Supported fields:
model_type: Maps to architecture ("llama" → ARCH_LLAMA2)rope_theta: RoPE frequency base (default: 10000.0)num_hidden_layers: Number of transformer layersnum_attention_heads: Number of attention headsnum_key_value_heads: Number of KV heads (for GQA)hidden_size: Embedding dimensionintermediate_size: FFN hidden dimensionmax_position_embeddings: Maximum sequence length
If config.json is not present, values are inferred from tensor shapes.
| Feature | Safetensors | GGUF |
|---|---|---|
| Source | HuggingFace | llama.cpp |
| Precision | FP32/FP16 | FP32/FP16/Quantized |
| Quantization | No | Q4_0, Q8_0, etc. |
| Metadata | JSON header | Key-value pairs |
| Tensor names | HuggingFace names | GGML names |
| File size (15M) | ~60 MB | ~15-17 MB (Q8_0) |
Different formats use different tensor names:
| Weight | Safetensors | GGUF |
|---|---|---|
| Token embeddings | model.embed_tokens.weight |
token_embd.weight |
| Q projection | model.layers.N.self_attn.q_proj.weight |
blk.N.attn_q.weight |
| K projection | model.layers.N.self_attn.k_proj.weight |
blk.N.attn_k.weight |
| V projection | model.layers.N.self_attn.v_proj.weight |
blk.N.attn_v.weight |
| O projection | model.layers.N.self_attn.o_proj.weight |
blk.N.attn_output.weight |
| Gate (w1) | model.layers.N.mlp.gate_proj.weight |
blk.N.ffn_gate.weight |
| Up (w3) | model.layers.N.mlp.up_proj.weight |
blk.N.ffn_up.weight |
| Down (w2) | model.layers.N.mlp.down_proj.weight |
blk.N.ffn_down.weight |
| Attn norm | model.layers.N.input_layernorm.weight |
blk.N.attn_norm.weight |
| FFN norm | model.layers.N.post_attention_layernorm.weight |
blk.N.ffn_norm.weight |
| Final norm | model.norm.weight |
output_norm.weight |
| Output | lm_head.weight |
output.weight |
Important: llama.cpp's GGUF converter interleaves Q and K attention weights for rotary position embeddings. Within each attention head of head_dim rows:
- GGUF rows 0,2,4,... contain original rows 0,1,2,...
- GGUF rows 1,3,5,... contain original rows head_dim/2, head_dim/2+1,...
The 9ml GGUF loader automatically de-interleaves these weights during loading to match the standard llama2.c layout.
To create a GGUF from a HuggingFace model using llama.cpp:
# Install llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && cmake -B build && cmake --build build
# Convert to F32 GGUF
python convert_hf_to_gguf.py /path/to/hf-model --outtype f32
# Quantize to Q8_0
./build/bin/llama-quantize model-f32.gguf model-Q8_0.gguf Q8_0Some models share weights between token embeddings and output projection (tie_word_embeddings: true). The loaders handle this:
- Safetensors: Only
model.embed_tokens.weightpresent, used for both - GGUF: Only
output.weightpresent, used for both (loader falls back)
A 9P file server that exposes LLM inference as a Plan 9 filesystem, enabling distributed inference across machines. Supports multiple models with LRU eviction and per-session model binding.
/mnt/llm/
ctl # RW: load, unload, limit, models commands
info # R: loaded models, available models, memory usage
clone # R: read to create new session, returns ID
0/ # Session 0 directory
ctl # RW: model, temp, topp, seed, steps, generate, reset, close
info # R: model, config, and status (idle|generating|done|error)
prompt # W: write prompt text
output # R: complete output (blocks until done)
stream # R: streaming output (returns tokens as generated)
1/ # Session 1...
In Plan 9:
6c -w llmfs.c && 6l -o llmfs llmfs.6
# Start the file server
./llmfs -s llm
# Mount it
mount /srv/llm /mnt/llm
# Load a model (name, model file, tokenizer)
echo 'load small stories15M.safetensors tokenizer.bin' > /mnt/llm/ctl
# Check pool info
cat /mnt/llm/info
# Create a session
session=`{cat /mnt/llm/clone}
# Bind model to session and configure
echo 'model small' > /mnt/llm/$session/ctl
echo 'temp 0.0' > /mnt/llm/$session/ctl
echo 'steps 50' > /mnt/llm/$session/ctl
# Set prompt and generate
echo 'Once upon a time' > /mnt/llm/$session/prompt
echo 'generate' > /mnt/llm/$session/ctl
# Read output (blocks until complete)
cat /mnt/llm/$session/output
# Check session info (includes status)
cat /mnt/llm/$session/info
On the server machine (cpu):
# Start llmfs
./llmfs -s llm
echo 'load small stories15M.safetensors tokenizer.bin' > /srv/llm/ctl
# Export over network
aux/listen1 -tv tcp!*!564 /bin/exportfs -r /srv/llm
On the client machine (terminal):
# Connect to remote server
srv tcp!cpu!564 llm
mount /srv/llm /mnt/llm
# Use it as if local
cat /mnt/llm/clone
echo 'model small' > /mnt/llm/0/ctl
echo 'Once upon a time' > /mnt/llm/0/prompt
echo generate > /mnt/llm/0/ctl
cat /mnt/llm/0/output
| Command | Description |
|---|---|
load <name> <model> <tokenizer> |
Load model into pool with given name |
unload <name> |
Unload model from pool (fails if in use) |
limit <max_models> <max_memory> |
Set pool limits (models and bytes) |
models <path> |
Set directory to scan for available models |
| Command | Description |
|---|---|
model <name> |
Bind session to named model |
temp <float> |
Set temperature (0.0 = greedy) |
topp <float> |
Set top-p sampling (0.0-1.0) |
seed <int> |
Set random seed |
steps <int> |
Set max tokens to generate |
generate |
Start generation |
reset |
Reset session state |
close |
Close session |
The /mnt/llm/info file shows pool status:
loaded:
small: 60.5MB (0 refs)
large: 13.5GB (2 refs)
available:
tinyllama.gguf
mistral-7b.safetensors
memory: 13.56GB / 16.00GB
limit: 8 models
The /mnt/llm/N/info file shows session status:
model: small
temp: 0.8
topp: 0.9
seed: 12345
steps: 256
status: done 202.50 tok/s
# Load multiple models into the pool
echo 'load small stories15M.safetensors tokenizer.bin' > /mnt/llm/ctl
echo 'load large llama2-7b.gguf tokenizer.bin' > /mnt/llm/ctl
# Check pool status
cat /mnt/llm/info
# Create session and bind to specific model
session=`{cat /mnt/llm/clone}
echo 'model large' > /mnt/llm/$session/ctl
# Generate using bound model
echo 'Once upon a time' > /mnt/llm/$session/prompt
echo 'generate' > /mnt/llm/$session/ctl
cat /mnt/llm/$session/output
# Check session info (includes bound model)
cat /mnt/llm/$session/info
The pool uses LRU eviction: when memory or model count limits are reached, the least recently used models with zero references are unloaded. Sessions hold references to their bound models, preventing eviction while in use.
HuggingFace integration for automatic model downloads is not yet available. HuggingFace now uses git-lfs and the xet format for model storage, which requires additional tooling support.
To use models with 9ml:
-
Download manually from HuggingFace website or using
huggingface-cli:# Install huggingface-cli pip install huggingface-hub # Download safetensors model huggingface-cli download Xenova/llama2.c-stories15M model.safetensors --local-dir models/ # Download GGUF model (pre-quantized) huggingface-cli download tensorblock/Xenova_llama2.c-stories15M-GGUF \ llama2.c-stories15M-Q8_0.gguf --local-dir models/
-
Copy to shared disk for use in Plan 9:
# Copy model files to the FAT shared disk mcopy -i qemu/shared.img models/model.safetensors models/tokenizer.bin :: -
Load in Plan 9 via llmfs:
echo 'pool-load mymodel /mnt/host/model.safetensors /mnt/host/tokenizer.bin' > /mnt/llm/ctl
#include <u.h>
#include <libc.h>| POSIX | Plan 9 |
|---|---|
int8_t |
schar |
uint8_t |
uchar |
int32_t |
int |
uint32_t |
uint |
int64_t |
vlong |
uint64_t |
uvlong |
ssize_t |
vlong |
size_t |
ulong |
NULL |
nil |
| POSIX | Plan 9 |
|---|---|
printf(...) |
print(...) |
fprintf(stderr, ...) |
fprint(2, ...) |
exit(0) |
exits(0) |
exit(1) |
exits("error") |
clock_gettime() |
nsec() |
mmap() |
Use open() + read() + malloc() |
void
main(int argc, char *argv[])
{
// ... code ...
exits(0); // or exits("error message")
}Plan 9 may pad structs differently. When reading binary files with struct headers:
// WRONG - struct may be padded
read(fd, &config, sizeof(Config));
// RIGHT - read raw bytes, then copy fields
#define CONFIG_FILE_SIZE (7 * sizeof(int))
int buf[7];
read(fd, buf, CONFIG_FILE_SIZE);
config.dim = buf[0];
config.hidden_dim = buf[1];
// ...Remove all #pragma omp directives - Plan 9 doesn't support OpenMP.
When writing assembly for Plan 9 amd64:
First argument: BP (RARG register)
Subsequent args: Stack at 16(SP), 24(SP), 32(SP), 40(SP)...
Return value: AX (integer), X0 (float)
Callee-saved: None (caller saves all)
Stack layout (verified empirically):
0(SP) = return address
8(SP) = padding/frame
16(SP) = 2nd argument
24(SP) = 3rd argument
32(SP) = 4th argument
40(SP) = 5th argument
Important: Use SUBL/TESTL/JLE pattern instead of CMPL/JGE for loop comparisons - the Plan 9 assembler's comparison semantics differ from standard x86.
The simd_amd64.s file contains SSE2 implementations for performance-critical operations:
Use $0 frame size (no local stack variables) for simple functions:
TEXT matmul_simd(SB), $0 // Works - no local frame
TEXT rmsnorm_simd(SB), $0 // Works - no local frameUsing $8 or other frame sizes changes stack argument offsets and can cause memory faults. If you need temp storage, use registers instead of stack.
Plan 9 assembler doesn't support all SSE instructions. Use BYTE encoding:
// CVTSI2SS R14, X1 (convert int64 in R14 to float in X1)
// F3 49 0F 2A CE = REX.WB prefix + opcode + ModR/M
BYTE $0xF3; BYTE $0x49; BYTE $0x0F; BYTE $0x2A; BYTE $0xCE
// MOVD R8d, X0 (move 32-bit from R8 to XMM0)
// 66 41 0F 6E C0 = operand-size + REX.B + opcode + ModR/M
BYTE $0x66; BYTE $0x41; BYTE $0x0F; BYTE $0x6E; BYTE $0xC0
// SQRTSS X0, X1 (sqrt of X0 into X1)
// F3 0F 51 C8
BYTE $0xF3; BYTE $0x0F; BYTE $0x51; BYTE $0xC8
// RSQRTSS X0, X1 (approximate 1/sqrt of X0 into X1)
// F3 0F 52 C8
BYTE $0xF3; BYTE $0x0F; BYTE $0x52; BYTE $0xC8RSQRTSS- approximate reciprocal sqrt, relative error ~0.0004 (fast but inaccurate)SQRTSS+DIVSS- exact sqrt (slower but matches scalarsqrtf())
For rmsnorm, use exact SQRTSS to match scalar output:
// Compute exact 1/sqrt
SQRTSS X0, X1 // X1 = sqrt(X0)
MOVL $0x3F800000, R8 // 1.0f in IEEE 754
MOVD R8, X0 // X0 = 1.0
DIVSS X1, X0 // X0 = 1.0 / sqrt(...)Plan 9 enables floating-point exceptions by default. To disable:
setfcr(getfcr() & ~(FPINVAL|FPZDIV|FPOVFL|FPUNFL|FPINEX));This affects x87 FPU but SSE MXCSR is typically initialized with exceptions masked (0x1F80).
- Memory faults often indicate wrong stack offsets - verify with matmul_simd pattern
- Denormal exceptions suggest RSQRTSS with very small values - use SQRTSS+DIVSS
- Add pointer validation at function entry for debugging:
TESTQ DI, DI
JZ bad_ptr
TESTQ SI, SI
JZ bad_ptrImplement binary search manually:
int str_lookup(char *str, TokenIndex *sorted_vocab, int vocab_size) {
int lo = 0, hi = vocab_size - 1;
while (lo <= hi) {
int mid = (lo + hi) / 2;
int cmp = strcmp(str, sorted_vocab[mid].str);
if (cmp == 0) return sorted_vocab[mid].id;
if (cmp < 0) hi = mid - 1;
else lo = mid + 1;
}
return -1;
}The test harness manages QEMU automatically. For manual debugging:
bootargsprompt -> Press Enter (accept default)userprompt -> Press Enter (accept default: glenda)- Reach
term%prompt (rc shell)
dossrv -f /dev/sdG0/data shared
mount -c /srv/shared /mnt/host
The shared disk may not be mounted. Mount it manually:
dossrv -f /dev/sdG0/data shared
mount -c /srv/shared /mnt/host
Common missing functions in Plan 9:
bsearch- implement manuallyround- usefloor(x + 0.5f)sqrtf/expf/etc- usesqrt/exp(Plan 9 has double versions)
Check struct padding - use raw byte reading for binary file headers.