1. Overview

This agent takes a single text prompt containing one or more precise, verifiable instruction constraints (for example: "include keyword 'paradox' exactly seven times", "respond with exactly three numbered bullet points", "nest parentheses at least 5 levels deep") and returns a text response that satisfies every constraint in the prompt. The agent is built for the IFBench benchmark but generalizes to any spec-driven task where following precise output formatting rules matters more than freeform creativity.

2. Business Value

Benchmark credibility — Provides a single Logic agent to run against Allen AI's IFBench benchmark, producing a publishable score that demonstrates Logic's instruction-following capability relative to frontier models (OpenAI o3 at 69.3%, Claude 4 Sonnet at 42.3% on the official leaderboard).
Spec fidelity showcase — Directly exercises Logic's core value proposition: give Logic a precise specification, get an output that follows the spec.
Product regression guardrail — Once running, this agent can be re-invoked against the full IFBench dataset whenever a new Logic release ships, catching regressions in how carefully the platform enforces output constraints.

3. Operational Context

When it runs: On-demand, invoked per IFBench prompt during benchmark runs.
Who uses it: Logic's product marketing team for benchmark evaluation. Not customer-facing.
How often: ~30 invocations per pilot cycle, ~335 for a full IFBench run.

4. Inputs

Prompt (text, required) — The full text of the IFBench prompt. Contains the underlying question or task along with one or more precise, verifiable constraints the response must satisfy.
Constraint Hints (text, optional) — A plain-English summary of the constraints, pre-extracted from the prompt. Used only as a clarification aid.

5. Outputs

Response (text) — The agent's response to the prompt. Must satisfy every constraint specified in the prompt. No preamble like "Sure, here is my response:" — just the raw answer. Length and structure are determined by the constraints.
Constraints Identified (text) — A plain-text list of every constraint the agent identified in the prompt, one per line.
Self-Check (text) — A plain-text report of whether the Response satisfies each identified constraint. Format: "[PASS] <constraint>" or "[FAIL] <constraint> — <why>".

6. Detailed Plan & Execution Steps

Parse the prompt. Read the prompt in full. Identify the underlying question or task (the part that is not a constraint). Identify every constraint on the response: word counts, keyword frequencies, formatting rules, structural rules, ratios, sentence composition, punctuation, code-like constraints.
List the constraints. Produce a structured list of every constraint found in the prompt, one line per constraint.
Draft a first response. Write an initial response that answers the underlying question while obeying every identified constraint. When constraints conflict with natural flow, constraints always win.
Self-check every constraint. For each constraint, mechanically verify the draft response satisfies it. Count keywords exactly. Count words, sentences, paragraphs, bullets, numbers, punctuation marks, pronouns, conjunctions. Check format rules character by character.
Redraft if any constraint fails. If any constraint is violated, redraft and repeat the self-check. Continue until every constraint passes or you have tried three times.
Return the response, the constraint list, and the self-check. The Response is the primary deliverable. The other two are debugging aids.

7. Validation & Quality Checks

The Response answers the underlying question or task from the prompt.
Every constraint in the prompt is satisfied by the Response.
The Response contains no preamble, apology, meta-commentary, or "here is my response" wrapping.
Keyword counts are exact. If the prompt says "exactly 3", the count is 3.
Word counts, sentence counts, paragraph counts, bullet counts, and numeric ranges fall within the specified bounds.
Formatting rules (newlines, indentation, bullets, options, parentheses nesting) match exactly.

8. Special Rules / Edge Cases

Exact counts are strict. "Exactly 3" means 3. "At least 5" means 5+. "Between 30 and 73 words" means ≥30 AND ≤73.
Keyword counts are word-boundary matches. "Include keyword 'eclipse' three times" means the literal token "eclipse" appears exactly 3 times as a whole word.
Constraints override natural prose. If the prompt says "each word must start with the next letter of the alphabet", the response becomes a carefully ordered list of words, even if that sacrifices coherence.
Conflicting constraints. Prioritize the more specific constraint. Note the conflict in Self-Check.
Constraints involving computation over the response itself (palindromes, stop-word ratios, prime-length words): do the math before finalizing. These are the constraint types where models most often fail.
No chain-of-thought leakage. The Response must not contain the agent's reasoning, the constraint list, or the self-check. Those go in separate output fields.

9. Example

Input

Prompt: "What is the female equivalent to chivalry? Include keyword meridian once in your response, keyword gossamer twice in your response, keyword eclipse three times in your response, keyword threshold five times in your response, and keyword cascade seven times in your response."

Execution

Identifies constraints: keyword meridian=1, gossamer=2, eclipse=3, threshold=5, cascade=7.
Drafts a short essay on courtly etiquette, weaving keywords at required frequencies.
Self-checks keyword counts, adjusts until exact.
Returns the polished response plus the constraint list and passing self-check.

Output

Response: (the essay with exact keyword frequencies)
Constraints Identified: keyword meridian=1, gossamer=2, eclipse=3, threshold=5, cascade=7
Self-Check: [PASS] meridian=1 / [PASS] gossamer=2 / [PASS] eclipse=3 / [PASS] threshold=5 / [PASS] cascade=7

1. Overview

2. Business Value

Benchmark credibility — Provides a single Logic agent to run against Allen AI's IFBench benchmark, producing a publishable score that demonstrates Logic's instruction-following capability relative to frontier models (OpenAI o3 at 69.3%, Claude 4 Sonnet at 42.3% on the official leaderboard).
Spec fidelity showcase — Directly exercises Logic's core value proposition: give Logic a precise specification, get an output that follows the spec.
Product regression guardrail — Once running, this agent can be re-invoked against the full IFBench dataset whenever a new Logic release ships, catching regressions in how carefully the platform enforces output constraints.

3. Operational Context

When it runs: On-demand, invoked per IFBench prompt during benchmark runs.
Who uses it: Logic's product marketing team for benchmark evaluation. Not customer-facing.
How often: ~30 invocations per pilot cycle, ~335 for a full IFBench run.

4. Inputs

Prompt (text, required) — The full text of the IFBench prompt. Contains the underlying question or task along with one or more precise, verifiable constraints the response must satisfy.
Constraint Hints (text, optional) — A plain-English summary of the constraints, pre-extracted from the prompt. Used only as a clarification aid.

5. Outputs

Response (text) — The agent's response to the prompt. Must satisfy every constraint specified in the prompt. No preamble like "Sure, here is my response:" — just the raw answer. Length and structure are determined by the constraints.
Constraints Identified (text) — A plain-text list of every constraint the agent identified in the prompt, one per line.
Self-Check (text) — A plain-text report of whether the Response satisfies each identified constraint. Format: "[PASS] <constraint>" or "[FAIL] <constraint> — <why>".

6. Detailed Plan & Execution Steps

Parse the prompt. Read the prompt in full. Identify the underlying question or task (the part that is not a constraint). Identify every constraint on the response: word counts, keyword frequencies, formatting rules, structural rules, ratios, sentence composition, punctuation, code-like constraints.
List the constraints. Produce a structured list of every constraint found in the prompt, one line per constraint.
Draft a first response. Write an initial response that answers the underlying question while obeying every identified constraint. When constraints conflict with natural flow, constraints always win.
Self-check every constraint. For each constraint, mechanically verify the draft response satisfies it. Count keywords exactly. Count words, sentences, paragraphs, bullets, numbers, punctuation marks, pronouns, conjunctions. Check format rules character by character.
Redraft if any constraint fails. If any constraint is violated, redraft and repeat the self-check. Continue until every constraint passes or you have tried three times.
Return the response, the constraint list, and the self-check. The Response is the primary deliverable. The other two are debugging aids.

7. Validation & Quality Checks

The Response answers the underlying question or task from the prompt.
Every constraint in the prompt is satisfied by the Response.
The Response contains no preamble, apology, meta-commentary, or "here is my response" wrapping.
Keyword counts are exact. If the prompt says "exactly 3", the count is 3.
Word counts, sentence counts, paragraph counts, bullet counts, and numeric ranges fall within the specified bounds.
Formatting rules (newlines, indentation, bullets, options, parentheses nesting) match exactly.

8. Special Rules / Edge Cases

Exact counts are strict. "Exactly 3" means 3. "At least 5" means 5+. "Between 30 and 73 words" means ≥30 AND ≤73.
Keyword counts are word-boundary matches. "Include keyword 'eclipse' three times" means the literal token "eclipse" appears exactly 3 times as a whole word.
Constraints override natural prose. If the prompt says "each word must start with the next letter of the alphabet", the response becomes a carefully ordered list of words, even if that sacrifices coherence.
Conflicting constraints. Prioritize the more specific constraint. Note the conflict in Self-Check.
Constraints involving computation over the response itself (palindromes, stop-word ratios, prime-length words): do the math before finalizing. These are the constraint types where models most often fail.
No chain-of-thought leakage. The Response must not contain the agent's reasoning, the constraint list, or the self-check. Those go in separate output fields.

9. Example

Input

Prompt: "What is the female equivalent to chivalry? Include keyword meridian once in your response, keyword gossamer twice in your response, keyword eclipse three times in your response, keyword threshold five times in your response, and keyword cascade seven times in your response."

Execution

Identifies constraints: keyword meridian=1, gossamer=2, eclipse=3, threshold=5, cascade=7.
Drafts a short essay on courtly etiquette, weaving keywords at required frequencies.
Self-checks keyword counts, adjusts until exact.
Returns the polished response plus the constraint list and passing self-check.

Output

Response: (the essay with exact keyword frequencies)
Constraints Identified: keyword meridian=1, gossamer=2, eclipse=3, threshold=5, cascade=7
Self-Check: [PASS] meridian=1 / [PASS] gossamer=2 / [PASS] eclipse=3 / [PASS] threshold=5 / [PASS] cascade=7

IFBench Instruction Follower

1. Overview

2. Business Value

3. Operational Context

4. Inputs

5. Outputs

6. Detailed Plan & Execution Steps

7. Validation & Quality Checks

8. Special Rules / Edge Cases

9. Example

Generate Response

Prompt Input

1. Overview

2. Business Value

3. Operational Context

4. Inputs

5. Outputs

6. Detailed Plan & Execution Steps

7. Validation & Quality Checks

8. Special Rules / Edge Cases

9. Example

Ready to Automate?