The official home of the world’s most extensive mainframe modernization suite

The benchmark

Same program.
Same question. Different species.

We run the gauntlet the way your team would: one production-scale COBOL program, one instruction — extract the business rules — and the receipts on the table.

Scroll
SAME 50,000-LINE PROGRAM · SAME QUESTION · INTERNAL BENCHMARKGENERIC LLMPALM ARK0%0%THREE RUNS · THREE ANSWERSTHREE RUNS · ONE TRUTH

The model races ahead —
and tops out near 30.

Ark climbs to ~95 —
with the source attached.

Then the real test:
ask three times.

Accuracy · lineage · determinism

Repeatability is the benchmark

Accuracy is only half the result. The half that decides audits is repeatability: ask a generative model three times and you get three artifacts; nobody can sign off on a moving target. Ark’s pipeline is deterministic — three runs, one hash, one truth your reviewers can pin to the wall.

We’ll happily reproduce this on your code, against any model you choose, inside your perimeter — that’s the standing offer behind the POC. Figures shown (~95% vs ~30%) are from our internal benchmark; ask us for the methodology.

Bring any model. We’ll bring the receipts.

Run it on your code