There are two kinds of things we ask language models to do, and we mostly only train for one of them.
The first is produce the best answer: write the email, draft the function, explain the concept, summarize the report. The frontier has become extraordinary at this, and almost all of post-training is pointed at it. A good answer is fluent, well-judged, and singular — there is one response, and it should be as good as possible.
The second is find everything. Review a redlined contract and flag every deviation from the governing terms. Check a filing against last quarter's and surface every change. Diligence a counterparty's markup against your playbook and miss nothing. Audit a statement before it goes out. Here the unit of value is not the quality of a single response — it is coverage. A brilliant memo that catches six of ten issues is a failure, even if every sentence is perfect, because the four it missed are still in the document when it gets signed.
These are different problems. A model trained to produce the best answer has every incentive, on a "find everything" task, to do the wrong thing: to summarize, to prioritize the most salient issues, to spend its tokens on the three findings that make the best memo and move on. That is excellent behavior when you want an answer and exactly the wrong behavior when you want a review.
We think "find everything" deserves its own kind of model, trained against its own objective. We call them review models — models measured on recall against a standard rather than on the quality of a single response. Polyclerk is our first one: a small, open mixture-of-experts model, post-trained with reinforcement learning whose reward is coverage, that reads whole documents and adjudicates them, clause by clause, against a reference.
We want to be precise about what Polyclerk is and is not. It is not smarter than a frontier model; on open-ended legal reasoning it is not close. What it does is out-cover the frontier, per dollar, on document review against a standard — and on some review tasks it beats a frontier flagship outright, at a fraction of the size and serving cost. That is a narrower claim than "better than the frontier," and it is the true one.
Ask a frontier model to "review this contract" and you get a polished memo about the issues that matter most. Ask a careful associate the same thing and you get a markup that walks the document in order and stops at every clause that departs from what was agreed — the buried damages-cap tweak, the quietly narrowed indemnity, the disclaimer deleted three sections after the headline change, the cross-reference that now points at the wrong section.
The gap between those two outputs is usually read as a gap in intelligence. It is not. It is a gap in objective.
Consider the asymmetry a reviewer actually faces. On a "find everything" task, the cost of a missed issue is unbounded — it is the indemnity you didn't flag that bankrupts the deal, the changed seat that makes the award unenforceable. The cost of one extra true finding is a sentence the reader skims. When the downside of a miss dwarfs the downside of verbosity, the correct policy is to be exhaustive: flag the tenth issue with the same diligence as the first. But a model optimized to produce the single best response is optimized for the opposite — for the response that reads best, which means selecting, compressing, and leading with the strongest points. The very behaviors that make a great answer make a leaky review.
There is a parallel here to a long-standing idea in the study of collaboration: that effective joint work depends on grounding — on both parties sharing what each knows, rather than one party deciding what is worth mentioning. A reviewer's job is to surface the full ground, not to curate it. The frontier's instinct, honed by answer-quality training, is to curate. For review, curation is the bug.
So the question we set out to answer was not "how do we make the model smarter?" It was: what happens if you make coverage the thing the model is rewarded for — directly, at the level of the individual issue — and stop rewarding it for writing the better memo?
Give Polyclerk a document and a reference standard — a governing agreement, a prior version, a negotiation playbook, a house template — and it produces a structured review. It walks the document in order and, for each deviation it finds, it states the clause at issue (quoted verbatim), what the standard says, the nature of the gap, a severity, why the gap matters to the side you represent, and a specific, paste-ready edit. It ends with the issues it could not resolve and a suggested order of attention.
The structure is deliberate. A review is not prose; it is a list of findings that someone has to act on, one at a time, and possibly defend. Verbatim quotes make each finding checkable. A per-finding severity keeps the reader from drowning. A concrete edit turns "consider revising" into something that can go straight into a markup.
A representative case makes the difference concrete. A counterparty returns a redlined agreement. The governing term sheet calls for a New York seat, a three-arbitrator panel, and no damages cap. The redline quietly changes the seat to Zurich, drops to a sole arbitrator, and — several sections away from the dispute-resolution clause — inserts a damages cap and a consequential-damages waiver. Asked to review this, a frontier model reliably catches the headline changes and writes a sharp paragraph about the seat and the panel. Polyclerk catches those and the buried cap and the waiver, and ties each one back to the controlling provision in the term sheet — not because it reasons better about arbitration, but because its training rewarded catching the tenth deviation exactly as much as the first, and penalized leaving one on the table.
The shape generalizes well beyond contracts. A 10-K section checked against the prior year's language. An LP statement audited line by line before distribution. A counterparty markup triaged against a house template. A policy document checked against a regulatory standard. Anywhere there is a document and a standard to hold it to, the task is coverage, and coverage is what Polyclerk is for.
Polyclerk does one thing — review a document against a standard and surface everything that deviates — but that one shape underlies a surprising fraction of professional document work. Below is the surface it covers, organized by the pattern. It doubles as our roadmap: contract and redline review is what we have benchmarked and built; the rest is where the same recipe takes us, because each new domain is the same training recipe pointed at a new corpus.
Contract & commercial review. Review an inbound agreement against your playbook and flag every off-standard term — vendor agreements, NDAs, SaaS and master service agreements. Triage a counterparty's redline and surface every change, with each tied to the position it departs from. Compare a draft against the last version and report exactly what moved. This is the task Polyclerk was trained and measured on.
Diligence & deal review. Run a markup against the governing document and flag every deviation (the arbitration task it beats the frontier on). Extract and normalize key terms across a whole data room. Build a material-contract schedule. Check closing documents against a conditions-precedent checklist. Surface diligence issues across many agreements at once — the work where missing one is the whole risk.
Compliance & regulatory gap-checks. Check a policy, product, or contract against a regulation and surface the gaps. Diff a policy against its prior version or against a newly published rule. Review a data-processing agreement against your requirements. Run AI-governance and vendor-AI reviews against a standard. The structure is identical to contract review — a document, a standard, find everything that doesn't conform.
Litigation & investigations document review. Privilege-log review, claim charts that map allegations to evidence, subpoena triage, chronology extraction across a document set. All coverage tasks: enumerate completely, against a reference.
Financial-report review (next). Earnings releases against prior quarters and consensus, statement audits before distribution, filings against the prior year. Same shape, new domain — with hard guardrails on figures, because numbers are a sharper failure mode than prose.
Where a review model stops. Polyclerk reviews; it does not draft, schedule, or advise. Drafting an agreement from scratch, running a standing watch on a docket or a regulatory feed, routing approvals, building a financial model, or giving an open-ended "is this a problem?" opinion are not review-model tasks — route those to a frontier model, an agent, or a person. The boundary is part of the design. A review model earns its place by being narrow and reliable inside the lines, not by pretending the lines aren't there.
The throughline across all of it: document + standard → find everything. That is one capability, and building it once lets it spread across domains — which is why we think of Polyclerk not as a single model but as the first of a family.
Polyclerk is an off-policy reinforcement-learning post-train of an open mid-size mixture-of-experts model — small enough to serve at a small fraction of a frontier flagship's cost. Five choices define it.
The Harvey LAB legal work-product benchmark scores a deliverable not holistically but against a checklist of discrete, ground-truth criteria: "did the review flag that the seat conflicts with §12.2 of the LLC agreement," "did it identify the reduction from three arbitrators to one," "did it catch the damages cap," and so on — dozens of such criteria per task. This is unusually well-suited to training a reviewer, because it lets us define the reward as exactly the thing we care about: criterion-recall, the fraction of a task's ground-truth issues the model's output actually catches.
That is the whole idea, and it is worth dwelling on what it changes. The model is not rewarded for writing a better memo, for fluency, or for confidence. It is rewarded for missing less. The criterion — not the response — becomes the unit of optimization. A response that catches nine of ten issues in clumsy prose beats a beautifully written one that catches six. Over training, this pushes the model away from the frontier's curate-and-summarize instinct and toward the reviewer's walk-everything instinct.
We use an off-policy RL method (OAPL — off-policy advantage learning). Rather than the usual on-policy loop, it trains on a fixed pool of (response, reward) groups: for each task, several candidate reviews and the criterion-recall reward each one earned. The update computes an advantage for each candidate relative to a soft-best baseline over its group — intuitively, a smooth version of "how much better than the group's typical response was this one?" — with an importance-style correction so the current policy can learn from responses it did not itself generate. The practical payoff is that the expensive part — generating and grading candidates — happens once, up front, and the training is stable and sample-efficient on top of it.
Two integrity choices matter here, and we made them deliberately:
- Every training response was generated by the model's own lineage, not distilled from a frontier model. We are not teaching Polyclerk to imitate a bigger model's reviews; we are teaching it to prefer its own better reviews over its own worse ones. The ceiling of this method is the best the model can already do on a good day — and a large part of the gain is simply making the good-day behavior the default.
- Each task was graded by a single judge. Mixing judges within a task's candidate group contaminates the comparison the advantage is built on, so we never did it; any task whose grades straddled two judges was quarantined and regraded clean.
This is also why we can be honest about the model's ceiling. We measured the headroom directly: the gap between the average candidate review and the best of several candidates on the same task is real and sizable. OAPL's job is to move the model toward its own best-of-N behavior. It captures a good share of that headroom — not all of it, and never more than the model's own samples contain.
The conventional way to put a long document in front of a model is to chunk it, embed the chunks, and retrieve the ones that look relevant to the query. For review, that is precisely backwards. The issues you most need to catch are the ones that don't look relevant — the quiet deletion, the off-headline insertion, the cross-reference that breaks. A retriever's entire job is to discard what doesn't match the query, which means it is structurally inclined to throw away exactly the findings a review exists to surface.
Polyclerk reads the entire document in context and walks it in order. In our testing this was not a marginal preference but a dominant one: whole-document review beat retrieval-based review on every task we measured, often by enormous margins. For a reviewer, the right architecture is the simplest one — put the whole thing in front of the model and make it look at all of it.
The single largest quality lever we found had nothing to do with the model. Legal redlines live in tracked changes, and the standard way of converting a document to text silently drops the inserted and deleted runs — which is to say it drops exactly the values a markup review is about. The model was being asked to review a redline it literally could not see.
Once we preserved tracked changes as explicit, in-line markers — {+inserted text+} and {-deleted text-}, in document order, so the model could read what changed and where — recall on markup tasks rose sharply. On a small set of markup tasks, criterion-recall went from roughly 62% to 84% from this change alone. A mid-size model that can see the redline beats a larger model that cannot. The lesson generalizes past law: what the model reads is a bigger lever than how big the model is. Before reaching for a bigger model, make sure the one you have can actually see the thing it is supposed to review.
Review means long documents, and attention cost grows quadratically with length, so naively a reviewer looks expensive to train. It is less expensive than it looks. Splitting the long-context computation across context-parallel ranks turned out to be superlinearly helpful: at our worst-case document length, eight accelerators with context parallelism trained about 2.3× faster than four — and used fewer accelerator-seconds in total — because dividing the quadratic attention work across ranks more than offsets the cost of the extra devices. Long-context post-training is more affordable than the scaling intuition suggests, which matters if review models are going to be cheap enough to run on everything.
We evaluate on held-out "analyze" tasks from the Harvey LAB benchmark — tasks the model never saw during training — scored by criterion-recall with a single consistent judge. The figures below are on 11 held-out analyze tasks.
| Model | Criterion-recall |
|---|---|
| Base model (pre-Polyclerk) | 73.6% |
| Polyclerk | 79.5% |
| Frontier flagship (Opus 4.8) | 90.7% |
Criterion-recall on 11 held-out Harvey LAB "analyze" tasks, single consistent judge.
Polyclerk improves on its own base by about six points and closes roughly one-third of the gap between the base model and a frontier flagship — at a fraction of the size and serving cost. That is the headline we would stand behind: a small, open model, post-trained for coverage, captures a meaningful share of frontier review quality cheaply, on tasks it has never seen.
The sharper result is at the task level. On the arbitration markup task — read a redlined arbitration agreement and flag every deviation from the governing LLC agreement — Polyclerk beat the frontier flagship by 20.3 points, and it edged it on a banking markup task as well. These are not cherry-picked curiosities; they are exactly the tasks the model is built for. The rubric rewards catching every discrete deviation; the redline structure is preserved so the model can see what changed; and exhaustiveness, not open-ended judgment, is what's being measured. Where the task genuinely is coverage, a specialized small model can match or beat the frontier.
The Bench — full comparisonHarvey Labs arbitration benchmark · 96% vs 76% →It is just as important to say where Polyclerk trails, and why. On held-out tasks that are less checklist and more open-ended legal reasoning — apply a body of doctrine to a novel fact pattern, construct an argument, weigh considerations the rubric can't enumerate — the frontier flagship wins decisively. Those tasks reward judgment, and judgment is not what coverage training buys you. Polyclerk's profile is sharp and legible: strong where the work is "find everything against a standard," weaker where the work is "decide what to think."
The task-level pattern is the whole thesis in miniature. A review model wins precisely when three conditions hold: the task is scored on coverage against a ground truth, the structure the task depends on is preserved in what the model reads, and the answer is enumeration rather than judgment. When all three hold — a redline against a governing document, a filing against its prior version — the frontier's answer-quality instinct is a liability and the reviewer's exhaustiveness is the whole game, and a small model trained for exhaustiveness can win.
When any of the three fails — the standard is implicit, the structure is lost in conversion, or the real work is deciding what matters — the advantage evaporates, and a frontier model's judgment reasserts itself. This is not a disappointment; it is a map. It tells you exactly which work to route to a review model and which to keep with a frontier model, and it tells you how to make a borderline task winnable: supply the standard, preserve the structure, and frame the task as coverage.
A review model earns trust by being candid about where it doesn't have any. We tested Polyclerk adversarially, and its weaknesses are real and characterizable. We would rather you learn them here than in production.
Coverage is not judgment. Polyclerk is an exhaustive issue-spotter, not a senior associate, and the distinction shows up in predictable ways. On counterintuitive clauses it can get the direction of effect wrong — for instance, reading the deletion of a disclaimer as the loss of a protection rather than the assumption of a new liability, which is its opposite. On a tracked change that both deletes and inserts, it can judge the inserted text in isolation and miss that the deletion removed more — labeling a narrowed indemnity as "expanded." And it does not reliably reason about survivorship: that a protection deleted in one section may be preserved, untouched, in another. It is strong at finding what changed and weaker at adjudicating what the change means. Treat its output as a high-recall first pass, not a final opinion.
It needs to be told whose side it's on. Which party you represent is frequently not determinable from a document alone — even frontier models only guess at it and flag the guess. Polyclerk's inference of the side is unstable from run to run; the same contract can come back reviewed from either perspective, and because a side flip inverts what counts as favorable versus adverse, it rewrites the entire review. The remedy is operational, not magical: pass the client identity into the request. In production this is not optional.
Out of its domain, it is just a base model. Polyclerk's edge comes from training on legal work product. Point it at a financial statement and you get the underlying open model with a good prompt — competent, but with no special edge and a materially higher risk of fabricating figures. Numbers are exact in a way prose is not; a hallucinated indemnity reads as a debatable opinion, but a hallucinated revenue figure reads as a fact and gets acted on. Outside its training domain, a review model needs hard guardrails: quote figures verbatim, never compute, and flag rather than calculate.
It wants a reference and a verifier. Coverage-against-a-standard is where it shines, which means it genuinely wants a standard — a governing document, a prior version, a playbook. Given one, the hardest judgment calls collapse into lookups ("does this clause match the standard?"), and several of the failure modes above shrink. Without one, it falls back on its own judgment, which is the part it is weakest at. And because its errors cluster in direction-of-effect, the single highest-leverage thing you can put around it is a lightweight verification pass — human or frontier-model — that checks the calls it cannot reliably check itself.
That last point is not a caveat so much as the intended design. Polyclerk does the heavy, exhaustive, expensive-at-frontier-prices work of coverage; a thin verifier checks direction. The model is the part that has to read every clause; the verifier is the part that has to be right about the handful that are subtle. Splitting the labor this way is what makes the economics work.
Everything above implies a deployment pattern, and it is simple:
- Give it the whole document, not a retrieval. Convert the document to text in document order; do not chunk-and-retrieve.
- Preserve the structure the task depends on. For redlines, that means converting tracked changes to explicit
{+/-}markers — if you drop them, you have removed the thing the review is about. Include comments and notes inline. - Pass the reference standard. A playbook, the governing agreement, the prior version — whatever the document is supposed to conform to. This is the largest single lever on output quality, because it turns judgment into lookup.
- Pass the client identity. Don't make the model guess whose side it's on; tell it.
- Put a verifier after it. A human review, or a single frontier-model pass that checks each finding's direction-of-effect. Let the small model do the coverage and the verifier do the adjudication.
- Guardrail the numbers, especially out of domain: quote figures verbatim, never compute.
Run this way, a review model is cheap enough to put in front of every document rather than triaging which ones are worth a frontier call — which is the actual unlock. The value is not that it reviews better than the frontier on any single document; it is that it reviews nearly as well for little enough that you can review everything.
Polyclerk is a small, open, cheap-to-serve model that, given a document and a standard, reviews it with coverage approaching — and on some tasks exceeding — a frontier flagship, for a fraction of the cost. We do not position it as a replacement for a frontier model's judgment, and the limitations section is not a disclaimer bolted on at the end — it is the operating envelope, and the product is designed around it: the reviewer that misses less, paired with a verifier that decides.
We think review models are a category, not a one-off. The recipe is general: make the reward a recall criterion, train off-policy on the model's own graded work, read the whole document, preserve the structure the task is actually about, and deploy with a reference and a verifier. None of that is specific to law. We are applying it next to financial-report review, where the documents are dense, the standards are explicit — prior quarters, consensus estimates, the model under coverage — and "find everything" is, once again, the job. The number-fabrication risk makes the guardrails non-negotiable there, which is part of what we are working on.
If you have a corpus where the real work is coverage against a standard — contracts against playbooks, filings against priors, statements against templates — we would like to hear about it.
Try it. The fastest way to see what a review model does is to give it a marked-up document and a standard and watch it walk the redline. Bring a redlined contract to the legal review surface (preserve the tracked changes), tell it the standard and whose side you're on, and read the markup it returns. Financial-report review is coming to the same surface next. We are most interested in the cases where it surprises you — both the deviation it caught that you'd have missed, and the call it got wrong — because the second kind is how the next model gets trained.
Polyclerk is an off-policy RL (OAPL) post-train of an open mixture-of-experts base, evaluated on the Harvey LAB legal work-product benchmark. Reported figures are criterion-recall on held-out "analyze" tasks under a single consistent judge; task-level results are labeled as such and computed on small N. Training used only responses generated by the model's own lineage — no frontier-model distillation — and single-judge grading per task. We report the operating envelope, including failure modes, deliberately: a review model is only as useful as it is honest about what it misses.