March 2026/20 min

Building a Zendesk Training Gym for LLM Agents: A Curriculum Engineer's Field Notes

How I built a mock Zendesk environment with state machine verification, Boltzmann adaptive curriculum, and WebRL self-evolving tasks — a functional RLVR training gym built in two days for a Curriculum Engineer interview.

RLVRCurriculum LearningLLM AgentsFastAPIReinforcement Learning

Update (March 2026): Since the original publication, I trained Qwen2.5-1.5B on this gym using GRPO with six different reward and curriculum configurations. The best variant (binary accuracy reward) reached 86% task completion — including 70% on hard multi-step tasks. Results and analysis are in Section 10. The environment design, reward shaping analysis, and curriculum framework described below remain unchanged.

1. Why Training Environments Matter More Than Compute

There is a seductive idea circulating in the ML community: that the primary bottleneck to capable AI agents is compute. Train longer, on bigger models, with more RLHF signal, and the capability emerges. After spending months building an IT ticketing system and then subsequently building a training environment on top of it, I am convinced this framing is dangerously incomplete.

The bottleneck is not compute. It is verifiable signal.

DeepSeek-R1 did not become remarkable because it ran on more chips. It became remarkable because its training tasks — mathematical proofs, coding challenges — have a property almost unique in the space of natural language: you can check the answer without a human. The answer is either correct or it is not. This binary verifiability is what makes RLVR (Reinforcement Learning from Verifiable Rewards) work at all.

When I started thinking about training LLM agents to operate inside enterprise software tools — Zendesk, Jira, Salesforce — I realized these environments have exactly the same property. An IT ticket is either assigned or it is not. A workflow state is either in_progress or it is not. You can check the database. No human judge required.

That realization led directly to what I built: a mock Zendesk API in FastAPI, designed from first principles as a training environment for LLM agents, with a five-task curriculum, hybrid reward shaping, state-based verification, and GRPO-compatible normalization. Here is how it came together — and more importantly, why each piece was designed the way it was.

Key Insight: Enterprise SaaS tools are underexplored RLVR environments. They have verifiable state, structured APIs, and real-world complexity gradients — everything you need for curriculum learning, without the ambiguity of open-ended language tasks.

2. What Is RLVR and Why Zendesk?

RLVR — Reinforcement Learning from Verifiable Rewards — is the subset of RL where the reward signal comes from a deterministic verifier rather than a trained reward model or human rater. The canonical examples are math (check the final numeric answer) and code (run the tests). The appeal is obvious: no reward model drift, no preference dataset, no annotation cost.

For agentic tasks, the literature has formalized this under the POMDP framework. AgentBench defines LLM agent evaluation as a Partially Observable Markov Decision Process: the agent sees only the API responses (observation), not the full system state (the database). It takes actions by calling endpoints. The environment transitions deterministically. A verifier checks the final state.

Zendesk (or a mock of it) fits this formalization almost perfectly:

State space: ticket records in a database (status, assignee, priority, metadata)
Action space: HTTP calls to a REST API (dual action space: the agent reasons in text, then executes a tool call)
Observations: JSON responses from the API
Transitions: deterministic, enforced by the state machine
Verifier: check the actual database state, not the agent's claim

The additional advantage of an IT ticketing domain is that it has natural difficulty gradients. Creating a ticket is easy. Following a multi-step lifecycle, discovering error constraints, and rebalancing agent workloads across state-machine boundaries is hard. That gradient is exactly what curriculum learning requires.

Key Insight: The best RLVR environments share three properties: verifiable final state, deterministic transitions, and natural difficulty gradients. IT ticketing systems have all three built in by design — because humans built them to enforce exactly these properties.

3. The Environment: POMDP Meets IT Ticketing

API surface

The mock Zendesk API is a FastAPI application with seven core endpoints:

Method	Endpoint	Description
--------	----------	-------------
`POST`	`/tickets`	Create a new ticket
`GET`	`/tickets/{id}`	Retrieve ticket state
`POST`	`/tickets/{id}/assign`	Assign to an agent (pending → assigned)
`POST`	`/tickets/{id}/start`	Begin work (assigned → in_progress)
`POST`	`/tickets/{id}/resolve`	Mark resolved (in_progress → resolved)
`POST`	`/tickets/{id}/close`	Close ticket (resolved → closed)
`POST`	`/tickets/{id}/reopen`	Attempt to reopen (always fails if closed)

The state machine

Every ticket follows a strict finite state machine. Transitions outside the valid graph return HTTP 400 with an explanatory error message. This is not an accident of design — it is the core pedagogical mechanism. The agent must discover the graph through interaction:

pending → assigned → in_progress → resolved → closed → reopen ✗

python

# core/state_machine.py
VALID_TRANSITIONS = {
    "pending":     ["assigned"],
    "assigned":    ["in_progress"],
    "in_progress": ["resolved"],
    "resolved":    ["closed"],
    "closed":      [],   # terminal state — no valid transitions
}

def transition(ticket: Ticket, target: str) -> Ticket:
    if target not in VALID_TRANSITIONS[ticket.status]:
        raise HTTPException(
            status_code=400,
            detail=f"Cannot transition from {ticket.status!r} to {target!r}"
        )
    ticket.status = target
    ticket.updated_at = datetime.utcnow()
    return ticket

Trajectory recorder middleware

Every HTTP call made during a training session is intercepted by middleware and appended to a session trajectory. This is the raw training data. The design follows WebArena's principle of full reproducibility: every interaction is logged with enough context to replay or analyze the episode.

python

# middleware/recorder.py
@dataclass
class Step:
    step_num:    int
    method:      str
    endpoint:    str
    params:      dict
    status_code: int
    success:     bool
    step_reward: float
    timestamp:   str

# Saved per session as JSONL training data
# session_id → List[Step] → final ORM reward → merged trajectory

The trajectory format is intentionally compatible with standard RLVR training pipelines. Each completed session produces a JSON file: a list of steps with step rewards, plus the final outcome reward and the GRPO-normalized advantage. You can pipe this directly into a training loop.

4. Designing the 5-Task Curriculum

Vygotsky's Zone of Proximal Development — the idea that optimal learning happens at tasks that are neither trivially easy nor hopelessly hard — is not just educational philosophy. It has a mathematical backing in the RL literature. Three independent papers (VCRL, SPEED-RL, and ODF) have each proven via different methods that training signal variance is maximized at a task success rate of approximately 50%. Too easy, and all rollouts are positive — no gradient. Too hard, all rollouts fail — again no gradient.

WebRL operationalized this with a concrete curriculum filter: include tasks with success rate between 5% and 75%. Their self-evolving curriculum improved web agent performance from 4.8% to 42.4% — a 783% increase. The insight was not a new algorithm; it was a principled way to select which tasks to train on.

With this in mind, I designed five tasks ordered by difficulty, each targeting a specific competency:

Task	Difficulty	Steps	Competency Targeted
------	-----------	-------	---------------------
1 — Create urgent ticket	Easy	1	Basic API structure, parameter formatting
2 — Create + assign	Easy	2	Sequential state transitions, agent IDs
3 — Full lifecycle	Medium	5	Ordered FSM traversal, no shortcuts
4 — Reopen constraint	Hard	7	Error recovery, terminal state understanding
5 — Workload rebalancing	Hard	9	Multi-ticket reasoning, state-machine constraints

Task 4: The Closed Ticket Trap

Task 4 deserves special attention because it was inspired by a real pattern I observed while building the IT ticketing system: users frequently re-report issues after a ticket is prematurely closed by a support agent. The correct behavior is to create a new ticket referencing the original, not to attempt to reopen the closed one.

The task script deliberately leads the agent into the wall: it asks the agent to handle a complaint about a closed ticket. A naive agent will try POST /tickets/42/reopen. This returns HTTP 400. The agent must then understand the error, accept the terminal state, and create a new ticket with a reference note. This is a seven-step task precisely because of the detour through the expected failure.

Task 5: The Reassignment Deadlock

Task 5 was inspired by a constraint I discovered while implementing the state machine: you cannot directly reassign an in_progress ticket. The state machine only allows assignment from pending or after reverting to assigned — which is itself not a valid transition from in_progress. So an agent facing "IT001 has three urgent in-progress tickets, reassign the oldest to IT002" must reason about the graph topology before taking any action.

The correct solution is to resolve the ticket, close it, and create a new one assigned to IT002 — a non-obvious sequence that requires both state-machine knowledge and multi-ticket context tracking. Nine steps. No shortcuts.

Key Insight: The best curriculum tasks come from real operational experience. Both Task 4 and Task 5 encode constraints that bite real support engineers — which means an agent that masters them is learning something genuinely useful, not just a synthetic benchmark.

5. Reward Shaping: Beyond ±1

The simplest possible reward for an agentic task is binary: +1 if the final state matches the goal, -1 otherwise. DeepSeek-R1 uses exactly this for math and coding, and it works because those tasks have a clear, single-step verifiable endpoint. Agentic tasks are different: the agent makes many sequential decisions, and a pure outcome reward gives no gradient signal until the very end of a long trajectory.

OpenAI's "Let's Verify Step by Step" showed that Process Reward Models (PRM) outperform Outcome Reward Models (ORM) on mathematical reasoning: 78.2% vs 72.4% on the MATH benchmark. The key advantage is that PRM provides dense signal at every step, not just at the end. For a 9-step agentic task, this matters enormously.

I implemented a hybrid ORM + PRM reward (v4 of the reward function):

python

# rewards/hybrid_v4.py

# ── Outcome Reward (ORM) ──────────────────────────────────────────
def outcome_reward(success: bool, n_steps: int,
                    expected: int, max_steps: int) -> float:
    if not success:
        return -1.0
    if n_steps > max_steps:
        return 0.0   # suspected reward hacking — data discarded, not penalized
    if n_steps <= expected + 1:
        return +1.0  # optimal
    return +0.5    # correct but inefficient

# ── Step Reward (PRM) ─────────────────────────────────────────────
def step_reward(status_code: int,
                is_expected_failure: bool) -> float:
    if 200 <= status_code < 300:
        return +0.10   # successful API call
    if is_expected_failure:
        return +0.05   # expected error — agent is probing the constraint
    return -0.05       # unexpected failure

# ── Combined ──────────────────────────────────────────────────────
def final_reward(orm: float, process_rewards: list[float]) -> float:
    prm_raw   = sum(process_rewards)
    prm_clamp = max(-1.0, min(1.0, prm_raw))
    return orm * 0.7 + prm_clamp * 0.3

Why 0.0 and not -1.0 for excessive steps

The reward for exceeding max_steps is zero, not negative. This is a deliberate choice inspired by DAPO's analysis of reward hacking in RLVR. If you penalize a model heavily for taking too many steps, you train it to prefer inaction. The model learns that the safest strategy is to do nothing, or to complete tasks minimally and exit early — even when more exploration was needed. Setting the reward to zero means the data is ignored during training (no gradient signal), which is neutral rather than actively counterproductive.

Key Insight: In reward design, the question is not just "what reward does the right behavior get?" but "what behavior does the wrong reward encourage?" A large negative penalty for over-stepping trains a passive, exploration-averse model. Zero is a much safer default for out-of-distribution behavior you want to discard.

Expected failures: teaching the constraint

Tasks 4 and 5 both involve calling an endpoint that will fail — and should fail. In Task 4, calling /reopen on a closed ticket is the expected probe. In Task 5, calling /assign on an in-progress ticket is the expected boundary discovery. These are defined in an EXPECTED_FAILURES registry per task, and the PRM gives a small positive reward (+0.05) when these expected failures occur.

python

# rewards/expected_failures.py
EXPECTED_FAILURES = {
    "task_4": [
        {"endpoint": r"/tickets/\d+/reopen", "status": 400}
    ],
    "task_5": [
        {"endpoint": r"/tickets/\d+/assign", "status": 400,
         "context": "ticket_status == 'in_progress'"}
    ],
}

This design follows the VPRM principle: verifiable process reward models where each step signal is derived from deterministic rules (the state machine and the expected-failure registry), not from a neural network judge. The result is a dense, trustworthy reward signal that is both interpretable and manipulation-resistant.

6. Verifier Design: The Ceiling of Your Training

I have come to believe that the verifier is the most underappreciated component in an RLVR pipeline. A weak verifier sets a hard ceiling on what your model can learn — and worse, it can actively mislead training with false positives.

TinyV found that 38.5% of rule-based rejections in standard RLVR evaluation were actually correct answers that the rule had misclassified. The "Pitfalls of Rule- and Model-based Verifiers" paper reported that rule-based verifiers achieve precision above 99% but recall of only 78–92%. And LLM judge verifiers are easily exploited: a single stray { character in the output can shift a model-based judge's score substantially.

The solution for an API-based training environment is straightforward, and it is the single most important architectural decision in the whole system: verify against the actual database state, not the agent's claim about that state.

python

# verifier/state_verifier.py
def verify_task_3(session_id: str, ticket_id: int, db: Session) -> bool:
    """
    Task 3: full lifecycle pending → closed.
    We do NOT ask the agent what happened.
    We read the database directly.
    """
    ticket = db.query(Ticket).filter(Ticket.id == ticket_id).first()
    if not ticket:
        return False
    return (
        ticket.status == "closed"
        and ticket.assignee_id is not None
        and ticket.closed_at is not None
    )

def verify_task_4(session_id: str, original_id: int, db: Session) -> bool:
    """
    Task 4: original ticket remains closed; a new ticket references it.
    We check both conditions in the DB, not in the agent's output text.
    """
    original = db.query(Ticket).filter(Ticket.id == original_id).first()
    if not original or original.status != "closed":
        return False   # must remain closed
    followup = (
        db.query(Ticket)
        .filter(Ticket.related_ticket_id == original_id)
        .first()
    )
    return followup is not None   # new ticket must exist with reference

This approach eliminates the entire class of false-positive verifier errors. The agent cannot hallucinate a successful resolution. The state either exists in the database or it does not. Every reward signal in this system derives from a database read, making it as close to a ground-truth verifier as you can get without a human in the loop.

Key Insight: In API-based environments, state-based verification is trivially achievable and should always be preferred over parsing agent output. The agent's claims about what happened are irrelevant. The database state is the ground truth. This completely bypasses the recall problem that plagues rule-based text verifiers.

7. GRPO Normalization: Making Rewards Comparable Across Tasks

When you have five tasks with different step counts, different expected rewards, and different difficulty levels, raw reward values are not directly comparable across tasks. A +0.7 on Task 1 (one step, trivial) and a +0.7 on Task 5 (nine steps, hard) carry very different information about policy quality.

DeepSeek-R1 addressed this with GRPO (Group Relative Policy Optimization), which normalizes rewards within a group of rollouts for the same prompt. The advantage for each rollout is:

advantage_i = (reward_i − mean(rewards in group)) / std(rewards in group)

I implemented this as a standalone endpoint so that the normalization step can be applied after each training batch without being coupled to any specific training framework:

python

# api/normalize.py — GET /session/normalize
def grpo_normalize(trajectories: list[Trajectory]) -> list[NormalizedTrajectory]:
    # Group by task_id
    groups: dict[str, list[float]] = defaultdict(list)
    for t in trajectories:
        groups[t.task_id].append(t.final_reward)

    stats = {
        task_id: (mean(rewards), std(rewards) or 1e-8)
        for task_id, rewards in groups.items()
    }

    return [
        NormalizedTrajectory(
            **t.__dict__,
            advantage=(t.final_reward - stats[t.task_id][0]) / stats[t.task_id][1]
        )
        for t in trajectories
    ]

The key design decision here is grouping by task_id rather than across all tasks simultaneously. This ensures that the normalization preserves relative quality within each task while still allowing inter-task comparison at the training loop level. An agent that is mediocre at Task 5 but excellent at Task 1 should not have its Task 5 performance inflated just because Task 1 rewards are high.

Key Insight: GRPO normalization is not just a mathematical nicety — it is what allows a curriculum with heterogeneous task difficulty to produce a stable gradient signal. Without it, hard tasks with low absolute rewards would be systematically underweighted during training.

8. Adaptive Curriculum: What Comes Next

The current adaptive curriculum endpoint (GET /curriculum/next) uses a simple threshold rule: if a task's recent success rate exceeds 80%, the agent is promoted to the next task. If it falls below 40%, the agent is demoted. This works but is crude. It treats the curriculum as a linear sequence and ignores inter-task relationships.

Three papers describe more sophisticated approaches that I plan to implement:

DAPO: Binary Batch Filtering

DAPO uses a simpler filter than a full curriculum algorithm: reject any training batch where all rollouts pass (100% success) or all fail (0% success). These batches carry no gradient information and waste compute. DAPO reported 50% fewer training steps than DeepSeek-R1-Zero on Qwen-32B. The implementation here would be straightforward: sample k rollouts per task, discard the batch if the variance is zero.

SEC: Boltzmann Task Sampling

SEC (Skill-based Episodic Curriculum) frames task selection as a Multi-Armed Bandit problem. Each task has a Q-value, updated based on reward signals. Task selection follows Boltzmann (softmax) sampling:

P(task_i) ∝ exp(Q(task_i) / τ)

where τ is a temperature parameter. High τ = uniform exploration; low τ = exploitation of the best current task. The key property is that mastered tasks are naturally de-prioritized as their Q-values converge — no manual threshold needed. I have designed the /curriculum/next endpoint to accept an optional algorithm=sec parameter that will activate this path once implemented.

VCRL: Variance-Based Curriculum with Replay

VCRL adds a replay memory to the curriculum: tasks that are neither mastered nor hopeless are kept in a priority queue, weighted by reward variance. High variance = still learning = high priority. The paper reports +4.67 points over GSPO baseline on reasoning benchmarks. The replay mechanism is particularly relevant for Tasks 4 and 5, which may require many revisits before the agent internalizes the state-machine constraints.

python

# curriculum/adaptive.py — current + planned
class CurriculumManager:
    def next_task(self, agent_id: str, algorithm: str = "threshold") -> str:
        if algorithm == "threshold":
            return self._threshold_select(agent_id)   # current
        elif algorithm == "sec":
            return self._boltzmann_select(agent_id)   # planned
        elif algorithm == "vcrl":
            return self._variance_replay_select(agent_id)  # planned
        raise ValueError(f"Unknown algorithm: {algorithm}")

    def _threshold_select(self, agent_id: str) -> str:
        # 80% → promote, 40% → demote
        recent = self.get_recent_results(agent_id, n=10)
        rate   = sum(1 for r in recent if r.success) / len(recent)
        if rate > 0.80: return self.promote(agent_id)
        if rate < 0.40: return self.demote(agent_id)
        return self.current_task(agent_id)

Key Insight: The threshold curriculum is a reasonable baseline, but it has a fundamental flaw: it treats the curriculum as linear. Real agents benefit from non-linear revisitation — sometimes going back to Task 2 mid-way through Task 5 because a previously mastered competency has drifted. Boltzmann sampling naturally handles this; thresholds do not.

9. Lessons Learned and What I'd Do in Production

Lesson 1: The state machine is the curriculum

Before I built the explicit task definitions, I was thinking about the curriculum as a separate layer on top of the API. Wrong framing. The state machine is the curriculum. Every illegal transition is a learning signal. Every valid path through the FSM defines a competency level. The curriculum tasks are just named paths through a graph I had already designed.

This has a practical implication: the difficulty of a new task can be estimated from graph properties (number of edges traversed, branching factor at each node, number of dead ends encountered) before running any rollouts. You can pre-annotate the task difficulty analytically.

Lesson 2: Expected failures are underrated as a training signal

I almost removed the EXPECTED_FAILURES registry in an early version of the reward function, treating all 400 errors as equally negative. That would have been a mistake. Tasks 4 and 5 require the agent to intentionally hit a boundary to confirm its model of the state machine. Penalizing that probe is exactly backwards. The expected-failure positive reward is small (+0.05) but directionally correct — and directional correctness is what drives the gradient.

Lesson 3: The verifier architecture is the long pole in the tent

In early design, I considered a text-parsing verifier: extract the final ticket status from the agent's response text, match against expected. This is the approach most benchmark harnesses take, and — as TinyV documented — it fails at a 38.5% false-negative rate. The state-based database verifier takes two extra lines of code and eliminates the entire problem category.

What I'd do in production

The current implementation uses an in-memory SQLite database (via SQLAlchemy) and stores trajectories as flat JSON files. For a production training pipeline, I would change three things:

PostgreSQL instead of SQLite — for parallel rollout support. When you run 64 episodes simultaneously (the DAPO batch size recommendation), SQLite's write lock becomes a bottleneck. PostgreSQL handles concurrent writes correctly.
Redis Queue for trajectory buffering — decouple the rollout collection from the normalization and training steps. Episodes complete at different times; a queue absorbs the variance and allows batch assembly by task_id for GRPO normalization.
Docker containers per episode — the WebArena approach. Each rollout gets a fresh container with a clean database. No state leakage between episodes. The current prototype uses session isolation, which is sufficient for sequential rollouts but would break under parallelism.

python

# production/architecture.py (sketch)

# Layer 1: Rollout Workers (Docker containers)
#   - Each worker: fresh FastAPI + fresh SQLite (or PostgreSQL schema)
#   - Agent submits actions → worker records trajectory → pushes to Redis Queue

# Layer 2: Trajectory Collector (Redis Queue)
#   - Assembles batches by task_id
#   - Triggers GRPO normalization when batch_size ≥ N

# Layer 3: Training Adapter
#   - Reads normalized trajectories
#   - Formats as (prompt, action_sequence, advantage) tuples
#   - Feeds to PPO / GRPO / DAPO training loop

# Layer 4: Curriculum Controller
#   - Monitors per-agent per-task success rates
#   - Adjusts task assignment weights (SEC Boltzmann sampling)
#   - Writes new task assignments back to Rollout Worker queue

On τ-bench and the evaluation gap

τ-bench introduced the pass^k metric for agentic evaluation: an agent passes a task only if it succeeds consistently across k trials, not just once. This matters because LLM agents have high variance — a single success is not a reliable signal. My evaluation endpoint currently reports single-run success rates. In production, I would implement pass^5: five independent rollouts per task, require at least four successes. This separates genuine competency from lucky single-run performance.

The Gymnasium paper also raised a distinction I had not thought about carefully: terminated vs. truncated episodes. A terminated episode means the environment reached a natural end state (ticket closed, task verified). A truncated episode means we hit max_steps and stopped artificially. These two outcomes should be handled differently in the training loop — truncated episodes should not update the value baseline in the same way as terminated ones. I currently conflate the two; this is a known limitation.

10. Training Results: The Gym Meets the Athlete

After building the environment described above, I ran the full RLVR training loop: Qwen2.5-1.5B as the base model, GRPO as the RL algorithm, and six different configurations varying reward function, curriculum strategy, and training duration. Each variant trained on 100-300 tasks per epoch, evaluated on 30 held-out tasks (10 easy, 10 medium, 10 hard).

The Comparison Table

Variant	Reward Design	Easy	Medium	Hard	Total
---------	--------------	------	--------	------	-----------
v1 (baseline)	format only	10/10	9/10	0/10	19/30 = 63%
v2a	format + binary accuracy	9/10	10/10	7/10	26/30 = 86%
v2b	v2a + Boltzmann curriculum (1 epoch)	10/10	9/10	0/10	19/30 = 63%
v2c	format + step-by-step partial credit	8/10	0/10	0/10	8/30 = 26%
v3a (SFT)	supervised fine-tuning	10/10	4/10	0/10	14/30 = 46%
v4	format + execution-based stepwise	8/10	3/10	0/10	11/30 = 36%

Key Findings

Binary rewards crush partial rewards. The clearest signal in the data: v2a (binary accuracy) at 86% vs. v2c (step-by-step partial credit) at 26%. This is not a marginal difference — it is a 3.3x gap. The partial credit variants (v2c, v4) performed worse than the baseline that used no accuracy signal at all.

Why? The partial reward created a shortcut the model exploited. When you give 0.5 reward for completing step 1 of a 2-step task, the model learns to reliably produce step 1 and stop. The error logs confirm this: the most common failure mode for v2c/v4 was "Expected 2 actions, got 1." The model became an expert at the first API call and a dropout at everything after.

This is exactly what DeepSeek-R1 and subsequent RLVR papers predict. Binary verifiable rewards force the model to solve the complete task or get nothing. There is no gradient signal for half-measures. The model either learns the full action sequence or the reward is indistinguishable from random noise.

RLVR outperforms SFT. Even the GRPO baseline (v1, 63%) outperformed supervised fine-tuning (v3a, 46%). SFT teaches the model to imitate surface patterns; RLVR teaches it to satisfy constraints. On easy tasks they are comparable (10/10 vs 10/10). The gap emerges on medium tasks (9/10 vs 4/10) where the model needs to chain actions correctly, not just produce plausible-looking output.

Hard tasks require the right reward. Only v2a solved hard tasks (7/10). Every other variant scored 0/10 on hard. This suggests that hard multi-step tasks (4-5 API calls with state dependencies) require both the correct reward signal and sufficient training signal to learn the full action chain. Partial rewards actively prevent this learning.

What This Means for Curriculum Engineering

The Boltzmann curriculum (v2b) showed no improvement over baseline in a single epoch — but this is expected. Curriculum effects compound over training time; with only 100 gradient steps, the difficulty distribution barely shifts. The curriculum infrastructure is correct (I verified the temperature annealing and difficulty sampling); it simply needs more training epochs to demonstrate its value.

The more important lesson: reward design dominates curriculum design. Even a perfect curriculum cannot compensate for a reward function that creates shortcuts. Get the reward right first, then optimize the curriculum.

11. Conclusion

Building this training environment taught me something I didn't expect: the hard problems in RLVR are not the machine learning problems. The ML is almost embarrassingly straightforward once you have the right substrate. The hard problems are the engineering problems: what is the right state machine, how do you define an expected failure, how do you verify state without trusting the agent's self-report, and how do you design a curriculum that stays in the Zone of Proximal Development across a wide range of agent ability levels.

These are fundamentally domain engineering questions. The papers referenced throughout this post — WebRL, DAPO, VCRL, TinyV — are doing the theoretical work that makes the domain engineering choices legible. But the choices themselves require someone who understands the domain: in this case, what it actually feels like to work in IT support, to hit a state-machine constraint at 5pm on a Friday, to discover that a closed ticket cannot be reopened and you have to file a new one with a reference note.

That operational knowledge is, in a real sense, the curriculum. The code is just how you make it legible to a training loop.

If you're building a similar environment for a different enterprise domain — CRM, ITSM, project management — the framework is directly transferable. Find the state machine, annotate the valid transitions, identify the boundary conditions that trap real users, build the tasks from those boundary conditions, and verify against database state. The reward shaping and curriculum algorithm can follow the literature almost exactly.

The environment — not the compute — is the bottleneck. Build it carefully.

References

DeepSeek-AI. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948, January 2025.
Liu, X. et al. AgentBench: Evaluating LLMs as Agents. arXiv:2308.03688. ICLR 2024.
Carta, T. et al. The Landscape of Agentic Reinforcement Learning for LLMs. arXiv:2509.02547, September 2025.
Zhou, S. et al. WebArena: A Realistic Web Environment for Building Autonomous Agents. arXiv:2307.13854. ICLR 2024.
Vygotsky, L. S. Mind in Society: The Development of Higher Psychological Processes. Harvard University Press, 1978. (Zone of Proximal Development.)
Chen, Y. et al. VCRL: Variance-Based Curriculum Reinforcement Learning. arXiv:2509.19803, September 2025.
Li, H. et al. SPEED-RL: Two-Phase Screening for Efficient RL Training. arXiv:2506.09016, June 2025.
Zhang, W. et al. ODF: Optimal Data Filtering for Reinforcement Learning of LLMs. arXiv:2504.03380, April 2025.
Qi, Z. et al. WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning. arXiv:2411.02337. ICLR 2025.
Lightman, H. et al. Let's Verify Step by Step. arXiv:2305.20050. OpenAI, 2023.
Yu, Q. et al. DAPO: An Open-Source LLM Reinforcement Learning System at Scale. arXiv:2503.14476, March 2025.
Wang, Y. et al. VPRM: Verifiable Process Reward Models for Step-Level RLVR. arXiv:2601.17223, January 2026.
Li, R. et al. TinyV: Rethinking Verifier Design for Scalable RLVR. arXiv:2505.14625. ICML 2025.
Zeng, A. et al. Pitfalls of Rule- and Model-based Verifiers in RLVR. arXiv:2505.22203, May 2025.
Klink, P. et al. SEC: Self-Paced Episodic Curriculum via Multi-Armed Bandits. arXiv:2505.14970, May 2025.
Yao, S. et al. τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv:2406.12045, June 2024.
Towers, M. et al. Gymnasium: A Standard Interface for Reinforcement Learning Environments. arXiv:2407.17032. NeurIPS 2025.

All Posts