Technical notes, architecture snapshots, and decisions accumulated during development. This is the living record of what Optimo was, is, and is becoming.
flowchart TD
A[main.rs<br>Bootstrap Orchestrator]
A --> B[app_state.rs<br>Application State]
A --> C[pipeline.rs<br>Async Orchestration]
A --> D[fold.rs<br>Deterministic Reducer]
A --> E[observation.rs<br>Observation Model]
A --> F[persistence.rs<br>Persistence Boundary]
C --> G[OCR Pipeline]
G --> H[Tokio spawn_blocking + Rayon par_iter]
H --> D
D --> F
F --> I[data/observations.jsonl]
The architecture separates:
- orchestration
- deterministic logic
- observation
- persistence
- Reducer purity — the reducer must remain pure, deterministic, and free of side effects.
- External metadata injection — timestamps, ids, and other non-deterministic metadata must come from the runtime layer.
- Persistence boundary isolation — storage concerns must stay outside the core.
- Derived event model — events must be derived from reducer results, not emitted as side effects.
- First-class observability — observations are part of the system contract.
src/
main.rs # Bootstrap and runtime startup
app_state.rs # Application state (paths, dirs, OCR language)
pipeline.rs # Async orchestration (Tokio + Rayon boundary)
fold.rs # Deterministic weighted positional reducer
observation.rs # Observation model and validation rules
persistence.rs # Persistence boundary (JSONL + SQLite)
timequake.rs # Temporal replay core (deterministic, no I/O)
aggregate_state.rs # Fold-derived deterministic state
snapshot.rs # Structural projection + rehydration payload
profile.rs # Ingestion profile (enum + config)
config.rs # Config resolution (CLI > ENV > FILE > DEFAULT)
ocrys/
mod.rs # OCR facade
tesseract.rs # Tesseract CLI integration
normalize.rs # Canonical line normalization
types.rs # OCRDocument / OCRPage / OCRLine
scripts/
setup_data.sh # Prepare data directories
process_all.sh # Run all images via Docker
process_all_local.sh # Run all images locally via cargo
main.rsloadsAppStateand parses input document paths.pipeline.rsschedules one async task per document usingJoinSet.- Each document crosses into CPU workers using
spawn_blocking. - Rayon executes OCR variants in parallel:
original,high_contrast,rotated. fold.rsmerges variant outputs using inline weighted positional voting:- group evidence by logical position (page, line)
- normalize text (NFC, trim, whitespace collapse, decimal comma harmonization)
- cluster similar candidates via Jaro-Winkler
- accumulate confidence weights incrementally
- recompute winner and convergence/ambiguity scores after each vote
- The final observation is appended by
persistence.rs.
flowchart TD
A[OCR Variant Output<br>original / contrast / rotated]
A --> B[Line Extraction]
B --> C[Positional Alignment<br>page × line index]
C --> D[Normalize<br>NFC · trim · decimal · whitespace]
D --> E[Cluster Matching<br>Jaro-Winkler ≥ 0.90]
E --> F[Inline Vote<br>accumulate confidence weight]
F --> G[Live Winner + Metrics<br>convergence · ambiguity bps]
G --> H[AggregateState]
H --> I[Observation]
I --> J[persistence.rs]
J --> K[data/observations.jsonl]
Input: Vec<OCRDocument>
Output: AggregateState
Guarantees:
- deterministic: same input → same output
- order-independent: stable under permutation
- replayable: no I/O, no timestamps, no randomness
- convergence viva: metrics updated per incoming line
- Deterministic replay from genesis (events ordered by timestamp + id)
- Checkpoint + tail replay (latest snapshot + events after cutoff)
- Rigorous snapshot hydration:
- validates schema_version, document_id/source coherence, confidence match
- fails explicitly before any reducer contamination
- separates projection (reporting) from rehydration (fold resume)
- Equivalence test: genesis and checkpoint+tail replay produce identical final state ✓
- Failure mode tests: 5 tests guarantee no panic, no zombie state on corruption
cargo test timequake::testsAll 5 tests pass ✓
- Schema Evolution — versioned migrations for snapshot format
- Integrity Hash Chain — snapshot_hash + tail_chain_hash for audit
- Observation Replay — emit_observation in replay flow with deterministic metadata
cargo run -- fixtures/sample.png
cargo run -- --replay
cargo run -- --replay <document_uuid>
./scripts/process_all_local.sh fixturesdocker build -t optimo:latest .
docker run --rm \
-v "$(pwd)/fixtures:/app/fixtures:ro" \
-v "$(pwd)/data:/app/data" \
optimo:latest /app/fixtures/sample.png
./scripts/process_all.sh fixturesdata/observations.jsonl # append-only decision records
data/ocrys/latest/ # OCR artifacts per run
- Default OCR language:
ita - Persistence: JSONL (primary) + SQLite (parallel, queryable)
observation.rsdefines typedOcrObservationfor structured audittimequake.rsis the canonical replay engine; no business logic, no I/Oprofile.rsdrives normalization policy per ingestion source- Config precedence: CLI > ENV > FILE (
optimo.yml) > DEFAULT
OCR is currently used only as a pipeline stress-test and input generator.
Long-term objective: a deterministic document analysis engine where parsing, validation, rule evaluation, and structural checks run through the same reducer/observation pipeline — without modifying the deterministic core.
The fold engine matured from a weighted voting prototype into a semantically strict evidence reducer. Seven distinct properties were strengthened:
Algebraic properties — formally tested
- Commutativity: reducing
[A, B, C]in any permutation produces identical output - Idempotence: duplicate documents do not create phantom clusters or inflate scores
- Monotonicity: additional confirming evidence never decreases the convergence score
- Batch/incremental equivalence:
reduce([A,B,C])≡empty.update(A).update(B).update(C)within ±1 bps
Fuzzy clustering with anti-homoglyph guardrails
- Jaro-Winkler clustering on NFC-normalized, case-folded cluster keys
same_script_family()guard blocks Cyrillic–Latin homoglyph injection (e.g.іU+0456 silently merging withi)- Case-insensitive matching with original-variant preservation in the winner display
sanitize_variant_display()strips control characters and zero-width codepoints before BTreeMap storage
Hard idempotency via source fingerprint
- Items are deduplicated by
(position, cluster_key, source)before entering the fold - The same OCR source cannot vote twice for the same cluster at the same position
- Deduplication is order-independent (applied after the deterministic sort)
collision_rate_bps — runtime convergence metric
InlineFoldStatetrackstotal_votesandcollisions(votes that merged into an existing cluster)- Exposed as
AggregateState.collision_rate_bps(serializable;serde(default = 0)for backward-compatible snapshots) - High collision rate indicates dense, well-converging inputs
Configurable similarity_threshold per document type
IngestionProfilegains asimilarity_threshold: f64fieldtesseract/legacy_import: 0.90 — tolerant of OCR noisecarbo/strict: 0.95 — tighter;"Rck30"≠"Rck 30"- New public API:
reduce_documents_with_profile(docs, profile)
Image preprocessing pipeline
src/ocrys/preprocess.rs: ROI extraction → grayscale → Otsu thresholding → optional resize → PNGRoi { x, y, width, height, kind: RegionKind, resize: Option<(u32, u32)> }RegionKind:TitleBlock,InvoiceTotals,StructuralTable,Signature- Otsu implementation with explicit fallback to 128 for empty/flat histograms (no division-by-zero)
83 tests — all passing
fold::tests 9 core reducer invariants
fold_adversarial::tests 17 arithmetic safety, degenerate inputs, normalization, clustering, security
fold_properties::tests 7 algebraic property proofs
ocrys::preprocess::tests 8 ROI pipeline and Otsu thresholding
aggregate_state::tests 6 snapshot hydration and profile enforcement
timequake::tests 5 replay engine equivalence
+ config, profile, snapshot, observation, normalize
cargo test