Adverse Insight — Case Study

The problem

Contracts are dense, clause-heavy artifacts written by lawyers for lawyers. Most of the people who actually sign contracts — founders, freelancers, engineers, small-business owners — read them once, decide they're "probably fine," and move on. The clauses that hurt most aren't the obviously hostile ones. They're the buried ones: an unbounded indemnification, a quietly perpetual non-compete, a unilateral termination right with no equivalent on your side, a "MAY at our discretion" verb that should have been "WILL."

Generic LLM summaries don't help. Ask ChatGPT to "summarize this contract" and you'll get a polite paragraph about scope and term. The clauses that matter are the ones a summary smooths over.

I wanted a tool that did the opposite of summarize: extract every clause, treat each one adversarially, and tell me what a lawyer would tell me to push back on.

Why this build matters to me

This sits at the exact intersection I work at — security thinking applied to AI systems. "Adversarial scoring" isn't summarization. It's the same red-team mindset I apply to LLM evaluation, pointed at a different artifact. A contract clause is just a payload that the signer has to evaluate for hidden harm. The same skills transfer.

I also wanted to ship something working, not just framework documentation. The Codex Creator Challenge gave me a 48-hour window. That's a useful constraint — it forces design decisions you'd otherwise defer.

The 3-agent architecture

The pipeline is three specialized agents passing structured data to each other, not one mega-prompt trying to do everything:

Agent 1 — Extractor

Reads the raw contract text and emits a structured list of clauses. Each clause is tagged with its position, its category (indemnification, term, payment, IP, termination, etc.), and a verbatim quote of the source language.

Why a separate agent: clause extraction is a different competence from risk analysis. Mixing them in one prompt produces clauses scored on shallow analysis, or analyses written about clauses that the model paraphrased and slightly changed. Separating them keeps the source clause text honest.

Agent 2 — Adversarial Scorer

Takes one clause at a time and treats it the way a red teamer treats a model output: what's the worst-case interpretation here, and what's the path to that outcome? Outputs a structured risk score (low / medium / high / critical) plus a 1–2 sentence "if this goes wrong" scenario specific to that clause.

Why this design: scoring per clause rather than holistically forces the model to actually engage with each piece of language rather than defaulting to a vibe-summary. It's also debuggable — if a particular score looks wrong, you can read the exact prompt + the exact clause and figure out why.

Agent 3 — Negotiator

For each medium/high/critical clause, drafts a counter-proposal: alternative language plus a 1-line script for what to say in the negotiation conversation. Designed to be copy-pasteable into an email or a redline.

Why this matters: most contract-review tools stop at "here's what's risky." That's the easy half. The hard half is "here's what to ask for instead." The Negotiator agent closes the loop — you don't just learn there's a problem, you walk away with the language to fix it.

The stack and the build

Streamlit for the UI — paste-text-in, get-cards-out. No file upload, no auth, no persistence. Get out of the user's way.
OpenAI API for the three agents — different system prompts, structured JSON output schema, parallel calls per clause for the scorer/negotiator pair.
Python orchestration tying it together. ~600 lines. The agents themselves are mostly prompt design plus schema validation; the interesting code is the orchestration that handles partial failures and clause batching.

Total build time: ~36 hours of actual coding spread across the 48-hour window. The remaining 12 hours were prompt iteration — getting the Adversarial Scorer to be specific rather than generic was 80% of the design work.

What surprised me

The Negotiator was the easiest agent to make good. I expected it to be the hardest, because "draft contract language" sounds like a senior-lawyer skill. In practice, once the Scorer had identified what was wrong with a clause, getting the model to draft alternative language was straightforward. Most of the difficulty in legal drafting is identifying the issue, not phrasing the fix.

The Extractor was harder than expected. Contract structure is inconsistent. Some are well-formatted with numbered clauses, others are wall-of-text. Some use defined terms; others reference "the Party of the First Part." Getting the extractor to produce a clean, structured list across that variance took more iteration than the scoring or drafting steps.

Per-clause scoring beat holistic scoring. An early prototype let one agent see the whole contract and rate each clause in context. The output looked sophisticated but degraded quickly on long contracts — clauses near the end got shallow analysis. Splitting into per-clause calls (slower, more API spend) produced consistently better outputs.

What I'd do differently

A second-opinion agent. Right now the Scorer's call is final. Adding a second agent that argues the opposite position ("here's why this clause is fine") and a third that reconciles them would catch overcalls. This is the same pattern as having two reviewers + a tiebreaker on a security finding.

A redline export. Currently the Negotiator's output sits in the UI. Letting users export a clean redlined version of the original contract (with track-changes-style markup) would close another step in the workflow.

Domain specialization. A SaaS contract has different risk patterns than a freelance work-for-hire agreement than a residential lease. Letting the user pick a contract type up front, and routing to a domain-specific Scorer prompt, would tighten outputs significantly.

Persistence. The current build is stateless — close the tab and your analysis is gone. For real use this needs at least optional save-to-account.

Why I built this and not something else

Two reasons. First, contracts are a domain where adversarial thinking actually maps cleanly: every clause has a counter-party who wrote it for their benefit, not yours. That's the same mental model I use in security work, just applied to a different surface.

Second, I wanted something useful to non-security people. Most of my work is illegible to the people I love and respect — my mother, my pastor, my friends from before this career. "I evaluate adversarial robustness in frontier model deployments" means nothing to them. "It reads contracts and tells you what to push back on" they understood immediately. Some of them have used it.

That's a small thing, but it's the thing.

Try it

Live demo at adverse-insight.streamlit.app — paste any contract you're allowed to share, see what comes back. Code at github.com/chima-ukachukwu-sec/adverse-insight.

Feedback welcome at chima.ukachukwu.sec@gmail.com.