Skip to content

Latest commit

 

History

History
804 lines (610 loc) · 22.6 KB

File metadata and controls

804 lines (610 loc) · 22.6 KB

9ml - Machine Learning Models for Plan 9

Port of llama2.c to Plan 9 (9front).

Rules

  1. NEVER skip tests. All tests are mandatory. If a test cannot run, fix the environment - do not skip.
  2. All changes must be tested. Run tests before committing:
    • Linux host: make test (or cd test && gcc -o harness *.c -lm && ./harness)
    • Plan 9 native: mk (compiles all targets)
  3. Tests must pass. Do not merge if tests fail.

Quick Start

Running Tests (Linux Host)

# Build and run all tests
make test

# Or manually:
cd test
gcc -o harness *.c -lm
./harness

The test harness:

  1. Starts a Plan 9 QEMU VM
  2. Compiles and runs tests inside Plan 9
  3. Compares output against C reference implementations
  4. Reports pass/fail

Building in Plan 9

# Clone repo and build natively in Plan 9
mk            # Build all: run, runq, export, tests
mk clean      # Clean all build artifacts

Supported Model Formats

9ml supports safetensors and GGUF model formats:

# Run with safetensors model
./run model.safetensors -z tokenizer.bin -n 50

# Run with GGUF model (supports quantized models)
./run model-Q8_0.gguf -z tokenizer.bin -n 50

Project Structure

9ml/
├── src/
│   ├── run.c              # FP32 inference (ported)
│   ├── runq.c             # INT8 quantized inference (ported)
│   ├── model.c            # Model loading helpers
│   ├── modelq.c           # Quantized model helpers
│   ├── export.c           # Model export/conversion tool
│   ├── llmfs.c            # 9P file server for multi-model inference
│   ├── simd.h             # SIMD function declarations
│   ├── simd_amd64.s       # SSE2 assembly implementations
│   ├── simdq_amd64.s      # Quantized SIMD (stub, uses C fallback)
│   ├── parallel.h         # Thread pool declarations
│   ├── parallel.c         # Thread pool implementation
│   ├── arch/              # Model architecture plugins
│   │   ├── arch.h         # Architecture interface
│   │   ├── arch.c         # Plugin registry
│   │   └── llama2.c       # LLaMA 2 architecture
│   ├── format/            # File format parsers
│   │   ├── gguf.c         # GGUF format parser
│   │   └── safetensors.c  # Safetensors parser
│   ├── pool/              # Model pool management
│   │   ├── pool.h         # Pool interface
│   │   └── pool.c         # LRU eviction, memory tracking
│   ├── mkfile             # Plan 9 build file
│   └── tests/             # Plan 9 test source files
│       ├── mkfile         # Plan 9 test build file
│       ├── test_*.c       # Various unit tests
│       └── ...
├── test/                  # C test harness (Linux host)
│   ├── harness.c          # Main test driver (supports dual-VM testing)
│   ├── reference.c/h      # Reference implementations
│   ├── qemu.c/h           # QEMU VM management (supports socket networking)
│   └── fat.c/h            # FAT disk operations (mtools)
├── qemu/
│   ├── 9front.qcow2       # VM disk image
│   └── shared.img         # FAT disk for file sharing
├── models/                # Model files (download separately)
│   ├── *.safetensors      # Safetensors format (HuggingFace)
│   └── *.gguf             # GGUF format (llama.cpp)
├── mkfile                 # Root Plan 9 build file
└── tokenizer.bin          # Tokenizer data

Running Tests

Linux Host Testing

The test harness compiles and runs all tests in a Plan 9 QEMU VM:

make test

Requirements:

  • qemu-system-x86_64
  • mtools (mcopy, mkfs.vfat)
  • curl (for downloading 9front if needed)

Test Coverage

Test Description
rmsnorm RMS normalization
softmax Softmax function
matmul Matrix multiplication
rng Random number generator (xorshift)
model_loading Config and weights loading
generation End-to-end text generation (FP32)
generation_simd FP32 generation with SIMD optimizations
quantize INT8 quantize/dequantize roundtrip
quantized_matmul Quantized matrix multiplication
generation_quantized End-to-end text generation (Q8_0, must match FP32)
llmfs_local 9P file server local mount and generation
llmfs_remote Dual-VM remote 9P inference (CPU serves, terminal mounts)
benchmark Performance benchmark (scalar vs SIMD vs threaded)
simd_validation SIMD correctness vs scalar baseline
simd_debug Minimal SIMD debug test
softmax_simd Softmax SIMD optimization tests
rmsnorm_simd RMSNorm SIMD optimization tests
arch_detect Architecture registry and constants
format_detect File format detection (GGUF, safetensors)
softmax_benchmark Softmax performance benchmark
softmax_accuracy Softmax numerical accuracy tests
gguf_dequant GGUF Q4_0/Q8_0 dequantization
gguf_parse GGUF header and metadata parsing
http HTTP client (Plan 9 dial)
safetensors Safetensors format parsing
pool_lru Model pool LRU eviction and reference counting

Building in Plan 9

Using mkfiles

# Build everything (from repo root)
mk

# Build only src targets (run, runq, export)
cd src && mk

# Build only test binaries
cd src/tests && mk

# Clean
mk clean

Architecture

9front uses amd64 (64-bit):

  • Compiler: 6c (NOT 8c which is for 386)
  • Linker: 6l (NOT 8l)
  • Object files: .6 extension

Manual Compilation

# Compile
6c -w program.c

# Link
6l -o program program.6

# Combined
6c -w program.c && 6l -o program program.6

Performance Optimizations

The inference engine supports SIMD vectorization and multi-threading for improved performance.

SIMD (SSE2)

Matrix-vector multiplication is accelerated using SSE2 packed float instructions:

Operation Implementation Speedup
matmul SSE2 assembly (simd_amd64.s) ~5.7x
dot_product SSE2 assembly ~4x
rmsnorm SSE2 assembly ~3x
softmax C with 4x unrolling (needs exp()) ~2x
vec_add, vec_scale SSE2 assembly ~4x

The SIMD implementation uses:

  • 8-element unrolled loops with 2 accumulators
  • 4-element cleanup loop
  • Scalar remainder for non-aligned sizes
  • Horizontal sum via SHUFPS for final reduction

Thread Pool

Parallel execution uses Plan 9's libthread:

  • Auto-detects CPU count from /dev/sysstat
  • Channel-based work distribution
  • Parallel attention head computation

Benchmark Results (stories15M, 1024x1024 matmul)

Mode GFLOPS Speedup
Scalar (1 thread) 3.4 1.0x
SIMD (1 thread) 19.0 5.7x
SIMD (4 threads) 18.7 5.6x

Note: Multi-threading overhead can exceed benefit for small matrices.

Runtime Configuration

/* In model.c / modelq.c */
extern OptConfig opt_config;
opt_config.use_simd = 1;    /* Enable SIMD (default) */
opt_config.nthreads = 4;    /* Set thread count (0 = auto) */

Command-line flags:

./run model.safetensors -z tok.bin --no-simd    # Disable SIMD
./run model.safetensors -z tok.bin --threads 2  # Set thread count

Running Inference

FP32 Inference (safetensors)

In Plan 9:

6c -w run.c && 6l -o run run.6
./run model.safetensors -z tokenizer.bin -n 50 -i 'Once upon a time'

Quantized Inference (GGUF)

Run with a GGUF Q8_0 model (8-bit quantization):

./run model-Q8_0.gguf -z tokenizer.bin -n 50 -i 'Once upon a time'

GGUF supports various quantization levels (Q4_0, Q8_0, etc.) for reduced memory and faster inference.

Command Line Options

Option Description
-t <float> Temperature (0.0 = greedy, 1.0 = default)
-p <float> Top-p sampling (0.9 = default)
-s <int> Random seed
-n <int> Number of tokens to generate
-i <string> Input prompt
-z <string> Path to tokenizer
-m generate|chat Mode (default: generate)

Model Export Tool

The export tool can inspect model files:

# Show model info (works on Linux or Plan 9)
./export info model.safetensors
./export info model-Q8_0.gguf

Note: For quantized models, download pre-quantized GGUF files directly from HuggingFace rather than converting. The llama.cpp project provides tools for creating GGUF files with various quantization levels (Q4_0, Q8_0, etc.).

Build on Linux:

gcc -o export src/export.c -lm

Build on Plan 9:

6c export.c && 6l -o export export.6

Supported Model Formats

9ml supports two model formats with automatic detection:

Format Detection

The model loader automatically detects format based on file magic:

  1. GGUF: Magic 0x46554747 ("GGUF" in little-endian)
  2. Safetensors: 8-byte header size followed by JSON metadata

config.json Support

For safetensors models, the loader can read a config.json file in the same directory to get model configuration. Supported fields:

  • model_type: Maps to architecture ("llama" → ARCH_LLAMA2)
  • rope_theta: RoPE frequency base (default: 10000.0)
  • num_hidden_layers: Number of transformer layers
  • num_attention_heads: Number of attention heads
  • num_key_value_heads: Number of KV heads (for GQA)
  • hidden_size: Embedding dimension
  • intermediate_size: FFN hidden dimension
  • max_position_embeddings: Maximum sequence length

If config.json is not present, values are inferred from tensor shapes.

Format Comparison

Feature Safetensors GGUF
Source HuggingFace llama.cpp
Precision FP32/FP16 FP32/FP16/Quantized
Quantization No Q4_0, Q8_0, etc.
Metadata JSON header Key-value pairs
Tensor names HuggingFace names GGML names
File size (15M) ~60 MB ~15-17 MB (Q8_0)

Tensor Name Mapping

Different formats use different tensor names:

Weight Safetensors GGUF
Token embeddings model.embed_tokens.weight token_embd.weight
Q projection model.layers.N.self_attn.q_proj.weight blk.N.attn_q.weight
K projection model.layers.N.self_attn.k_proj.weight blk.N.attn_k.weight
V projection model.layers.N.self_attn.v_proj.weight blk.N.attn_v.weight
O projection model.layers.N.self_attn.o_proj.weight blk.N.attn_output.weight
Gate (w1) model.layers.N.mlp.gate_proj.weight blk.N.ffn_gate.weight
Up (w3) model.layers.N.mlp.up_proj.weight blk.N.ffn_up.weight
Down (w2) model.layers.N.mlp.down_proj.weight blk.N.ffn_down.weight
Attn norm model.layers.N.input_layernorm.weight blk.N.attn_norm.weight
FFN norm model.layers.N.post_attention_layernorm.weight blk.N.ffn_norm.weight
Final norm model.norm.weight output_norm.weight
Output lm_head.weight output.weight

GGUF Attention Weight Interleaving

Important: llama.cpp's GGUF converter interleaves Q and K attention weights for rotary position embeddings. Within each attention head of head_dim rows:

  • GGUF rows 0,2,4,... contain original rows 0,1,2,...
  • GGUF rows 1,3,5,... contain original rows head_dim/2, head_dim/2+1,...

The 9ml GGUF loader automatically de-interleaves these weights during loading to match the standard llama2.c layout.

Creating GGUF Files

To create a GGUF from a HuggingFace model using llama.cpp:

# Install llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && cmake -B build && cmake --build build

# Convert to F32 GGUF
python convert_hf_to_gguf.py /path/to/hf-model --outtype f32

# Quantize to Q8_0
./build/bin/llama-quantize model-f32.gguf model-Q8_0.gguf Q8_0

Tied Embeddings

Some models share weights between token embeddings and output projection (tie_word_embeddings: true). The loaders handle this:

  • Safetensors: Only model.embed_tokens.weight present, used for both
  • GGUF: Only output.weight present, used for both (loader falls back)

LLM File Server (llmfs)

A 9P file server that exposes LLM inference as a Plan 9 filesystem, enabling distributed inference across machines. Supports multiple models with LRU eviction and per-session model binding.

File System Structure

/mnt/llm/
    ctl             # RW: load, unload, limit, models commands
    info            # R:  loaded models, available models, memory usage
    clone           # R:  read to create new session, returns ID
    0/              # Session 0 directory
        ctl         # RW: model, temp, topp, seed, steps, generate, reset, close
        info        # R:  model, config, and status (idle|generating|done|error)
        prompt      # W:  write prompt text
        output      # R:  complete output (blocks until done)
        stream      # R:  streaming output (returns tokens as generated)
    1/              # Session 1...

Building llmfs

In Plan 9:

6c -w llmfs.c && 6l -o llmfs llmfs.6

Local Usage

# Start the file server
./llmfs -s llm

# Mount it
mount /srv/llm /mnt/llm

# Load a model (name, model file, tokenizer)
echo 'load small stories15M.safetensors tokenizer.bin' > /mnt/llm/ctl

# Check pool info
cat /mnt/llm/info

# Create a session
session=`{cat /mnt/llm/clone}

# Bind model to session and configure
echo 'model small' > /mnt/llm/$session/ctl
echo 'temp 0.0' > /mnt/llm/$session/ctl
echo 'steps 50' > /mnt/llm/$session/ctl

# Set prompt and generate
echo 'Once upon a time' > /mnt/llm/$session/prompt
echo 'generate' > /mnt/llm/$session/ctl

# Read output (blocks until complete)
cat /mnt/llm/$session/output

# Check session info (includes status)
cat /mnt/llm/$session/info

Remote Usage (Distributed Inference)

On the server machine (cpu):

# Start llmfs
./llmfs -s llm
echo 'load small stories15M.safetensors tokenizer.bin' > /srv/llm/ctl

# Export over network
aux/listen1 -tv tcp!*!564 /bin/exportfs -r /srv/llm

On the client machine (terminal):

# Connect to remote server
srv tcp!cpu!564 llm
mount /srv/llm /mnt/llm

# Use it as if local
cat /mnt/llm/clone
echo 'model small' > /mnt/llm/0/ctl
echo 'Once upon a time' > /mnt/llm/0/prompt
echo generate > /mnt/llm/0/ctl
cat /mnt/llm/0/output

Server Control Commands

Command Description
load <name> <model> <tokenizer> Load model into pool with given name
unload <name> Unload model from pool (fails if in use)
limit <max_models> <max_memory> Set pool limits (models and bytes)
models <path> Set directory to scan for available models

Session Control Commands

Command Description
model <name> Bind session to named model
temp <float> Set temperature (0.0 = greedy)
topp <float> Set top-p sampling (0.0-1.0)
seed <int> Set random seed
steps <int> Set max tokens to generate
generate Start generation
reset Reset session state
close Close session

Info File Format

The /mnt/llm/info file shows pool status:

loaded:
  small: 60.5MB (0 refs)
  large: 13.5GB (2 refs)
available:
  tinyllama.gguf
  mistral-7b.safetensors
memory: 13.56GB / 16.00GB
limit: 8 models

The /mnt/llm/N/info file shows session status:

model: small
temp: 0.8
topp: 0.9
seed: 12345
steps: 256
status: done 202.50 tok/s

Multi-Model Usage

# Load multiple models into the pool
echo 'load small stories15M.safetensors tokenizer.bin' > /mnt/llm/ctl
echo 'load large llama2-7b.gguf tokenizer.bin' > /mnt/llm/ctl

# Check pool status
cat /mnt/llm/info

# Create session and bind to specific model
session=`{cat /mnt/llm/clone}
echo 'model large' > /mnt/llm/$session/ctl

# Generate using bound model
echo 'Once upon a time' > /mnt/llm/$session/prompt
echo 'generate' > /mnt/llm/$session/ctl
cat /mnt/llm/$session/output

# Check session info (includes bound model)
cat /mnt/llm/$session/info

The pool uses LRU eviction: when memory or model count limits are reached, the least recently used models with zero references are unloaded. Sessions hold references to their bound models, preventing eviction while in use.

Downloading Models

HuggingFace integration for automatic model downloads is not yet available. HuggingFace now uses git-lfs and the xet format for model storage, which requires additional tooling support.

To use models with 9ml:

  1. Download manually from HuggingFace website or using huggingface-cli:

    # Install huggingface-cli
    pip install huggingface-hub
    
    # Download safetensors model
    huggingface-cli download Xenova/llama2.c-stories15M model.safetensors --local-dir models/
    
    # Download GGUF model (pre-quantized)
    huggingface-cli download tensorblock/Xenova_llama2.c-stories15M-GGUF \
        llama2.c-stories15M-Q8_0.gguf --local-dir models/
  2. Copy to shared disk for use in Plan 9:

    # Copy model files to the FAT shared disk
    mcopy -i qemu/shared.img models/model.safetensors models/tokenizer.bin ::
  3. Load in Plan 9 via llmfs:

    echo 'pool-load mymodel /mnt/host/model.safetensors /mnt/host/tokenizer.bin' > /mnt/llm/ctl
    

Plan 9 C Porting Guide

Headers

#include <u.h>
#include <libc.h>

Type Mappings

POSIX Plan 9
int8_t schar
uint8_t uchar
int32_t int
uint32_t uint
int64_t vlong
uint64_t uvlong
ssize_t vlong
size_t ulong
NULL nil

Function Mappings

POSIX Plan 9
printf(...) print(...)
fprintf(stderr, ...) fprint(2, ...)
exit(0) exits(0)
exit(1) exits("error")
clock_gettime() nsec()
mmap() Use open() + read() + malloc()

Main Function

void
main(int argc, char *argv[])
{
    // ... code ...
    exits(0);  // or exits("error message")
}

Critical: Struct Padding

Plan 9 may pad structs differently. When reading binary files with struct headers:

// WRONG - struct may be padded
read(fd, &config, sizeof(Config));

// RIGHT - read raw bytes, then copy fields
#define CONFIG_FILE_SIZE (7 * sizeof(int))
int buf[7];
read(fd, buf, CONFIG_FILE_SIZE);
config.dim = buf[0];
config.hidden_dim = buf[1];
// ...

No OpenMP

Remove all #pragma omp directives - Plan 9 doesn't support OpenMP.

Plan 9 amd64 Assembly Calling Convention

When writing assembly for Plan 9 amd64:

First argument:     BP (RARG register)
Subsequent args:    Stack at 16(SP), 24(SP), 32(SP), 40(SP)...
Return value:       AX (integer), X0 (float)
Callee-saved:       None (caller saves all)

Stack layout (verified empirically):

0(SP)   = return address
8(SP)   = padding/frame
16(SP)  = 2nd argument
24(SP)  = 3rd argument
32(SP)  = 4th argument
40(SP)  = 5th argument

Important: Use SUBL/TESTL/JLE pattern instead of CMPL/JGE for loop comparisons - the Plan 9 assembler's comparison semantics differ from standard x86.

SIMD Assembly Implementation Notes

The simd_amd64.s file contains SSE2 implementations for performance-critical operations:

Frame Size Matters

Use $0 frame size (no local stack variables) for simple functions:

TEXT matmul_simd(SB), $0    // Works - no local frame
TEXT rmsnorm_simd(SB), $0   // Works - no local frame

Using $8 or other frame sizes changes stack argument offsets and can cause memory faults. If you need temp storage, use registers instead of stack.

BYTE-Encoded Instructions

Plan 9 assembler doesn't support all SSE instructions. Use BYTE encoding:

// CVTSI2SS R14, X1 (convert int64 in R14 to float in X1)
// F3 49 0F 2A CE = REX.WB prefix + opcode + ModR/M
BYTE $0xF3; BYTE $0x49; BYTE $0x0F; BYTE $0x2A; BYTE $0xCE

// MOVD R8d, X0 (move 32-bit from R8 to XMM0)
// 66 41 0F 6E C0 = operand-size + REX.B + opcode + ModR/M
BYTE $0x66; BYTE $0x41; BYTE $0x0F; BYTE $0x6E; BYTE $0xC0

// SQRTSS X0, X1 (sqrt of X0 into X1)
// F3 0F 51 C8
BYTE $0xF3; BYTE $0x0F; BYTE $0x51; BYTE $0xC8

// RSQRTSS X0, X1 (approximate 1/sqrt of X0 into X1)
// F3 0F 52 C8
BYTE $0xF3; BYTE $0x0F; BYTE $0x52; BYTE $0xC8

Approximate vs Exact Instructions

  • RSQRTSS - approximate reciprocal sqrt, relative error ~0.0004 (fast but inaccurate)
  • SQRTSS + DIVSS - exact sqrt (slower but matches scalar sqrtf())

For rmsnorm, use exact SQRTSS to match scalar output:

// Compute exact 1/sqrt
SQRTSS X0, X1           // X1 = sqrt(X0)
MOVL $0x3F800000, R8    // 1.0f in IEEE 754
MOVD R8, X0             // X0 = 1.0
DIVSS X1, X0            // X0 = 1.0 / sqrt(...)

Plan 9 FPU Exception Handling

Plan 9 enables floating-point exceptions by default. To disable:

setfcr(getfcr() & ~(FPINVAL|FPZDIV|FPOVFL|FPUNFL|FPINEX));

This affects x87 FPU but SSE MXCSR is typically initialized with exceptions masked (0x1F80).

Debugging Tips

  1. Memory faults often indicate wrong stack offsets - verify with matmul_simd pattern
  2. Denormal exceptions suggest RSQRTSS with very small values - use SQRTSS+DIVSS
  3. Add pointer validation at function entry for debugging:
TESTQ DI, DI
JZ bad_ptr
TESTQ SI, SI
JZ bad_ptr

No bsearch

Implement binary search manually:

int str_lookup(char *str, TokenIndex *sorted_vocab, int vocab_size) {
    int lo = 0, hi = vocab_size - 1;
    while (lo <= hi) {
        int mid = (lo + hi) / 2;
        int cmp = strcmp(str, sorted_vocab[mid].str);
        if (cmp == 0) return sorted_vocab[mid].id;
        if (cmp < 0) hi = mid - 1;
        else lo = mid + 1;
    }
    return -1;
}

QEMU VM (for Linux host testing)

The test harness manages QEMU automatically. For manual debugging:

Boot Sequence

  1. bootargs prompt -> Press Enter (accept default)
  2. user prompt -> Press Enter (accept default: glenda)
  3. Reach term% prompt (rc shell)

Mounting Shared Disk in Plan 9

dossrv -f /dev/sdG0/data shared
mount -c /srv/shared /mnt/host

Troubleshooting

"file does not exist" in Plan 9

The shared disk may not be mounted. Mount it manually:

dossrv -f /dev/sdG0/data shared
mount -c /srv/shared /mnt/host

Compilation errors about missing functions

Common missing functions in Plan 9:

  • bsearch - implement manually
  • round - use floor(x + 0.5f)
  • sqrtf/expf/etc - use sqrt/exp (Plan 9 has double versions)

Generation output is garbage

Check struct padding - use raw byte reading for binary file headers.


Resources