Agentic Reinforcement Learning: Prompt Engineering with LLMs

The Stochastic Machine: LLM Output Generation

Nik Bear Brown — Wed, 18 Feb 2026 21:17:31 GMT

Cheat Sheet — Nik Bear Brown, Feb 2026

Core Insight

LLMs don’t retrieve answers — they sample them. Every token is a probabilistic draw from a shaped distribution. Same question + same model = different draw each time.

“You are not configuring an intelligence. You are reshaping a distribution.”

Key Concepts

Logits → Probabilities (Softmax)

Raw scores (logits) are converted to probabilities via softmax. Only relative differences between logits matter — one dominant token = a spike in the distribution.

Confidence signal: Δ = z(1) − z(2) (gap between top two logits)

Large Δ → decisive, low variance
Small Δ → uncertain, near coin-flip

The Three Sampling Controls

ParameterWhat It DoesLow ValueHigh ValueTemperature (T)Reshapes the whole distributionSharpens spike → deterministicFlattens → creative/chaoticTop-KKeeps only K most probable tokensBrittle, repetitiveMore varietyTop-P (nucleus)Keeps tokens until cumulative prob ≥ pTight nucleusWide nucleus

Production default: Temperature shapes → Top-K or Top-P prunes → sample from remainder.

Temperature Behavior

T ValueBehaviorUse WhenT = 0Greedy decoding, fully deterministicAuditable facts, codeT = 0.7Balanced accuracy + fluencyMost tasksT = 1.0+High variance, samples from long tailBrainstorming, creative

Entropy H(pT): Rises with temperature. Higher entropy = more output variance = more drift risk.

Top-K vs. Top-P

Top-KTop-P (Nucleus)Set sizeFixed (always K tokens)Adaptive (varies with confidence)Confident modelMay include garbage tokensAutomatically tightensUncertain modelMay cut off good tokensAutomatically expandsBest forPredictable truncationAdapting to model’s own uncertainty

Winner for most cases: Top-P — it self-adjusts.

Expert Default Bias

No matter what persona/proficiency level you assign, a well-trained model’s logit gap for a correct simple answer is so large that sampling can’t reach the wrong answer. You can change presentation style — not logical execution — without fine-tuning.

Agentic Sampling Loop

Treat sampling parameters as a feedback control surface, not fixed settings.

draft = generate(prompt, T=0.7, top_p=0.9)
score = evaluate(draft)   # confidence, hallucination, constraints

if low_confidence:     → lower T, tighten top_p, regenerate
if too_boring (creative task): → raise T, widen top_p, regenerate

return best(drafts)

Self-consistency decoding: Generate N diverse paths → majority vote → +17.7% accuracy on factual tasks.

CASE Framework (Exemplar Selection)

Choosing few-shot examples = multi-arm bandit problem.

Arm = a subset of candidate examples
Reward = model accuracy on that subset
CASE result: 87% fewer model calls, 7× faster, +15% accuracy vs. brute force

Design Rules by Task

Task TypeTemperatureTruncationStrategyAuditable factsLow (≤ 0.3)Tight Top-PSingle pass + verifyGeneral Q&A0.5–0.7Top-P 0.9Single passBrainstorming0.8–1.0Wide Top-PMultiple draws, pick bestCreative writingHigh in creative sections, low in factual anchorsTop-PAgentic loopFew-shot prompting0.7Top-P 0.9CASE bandit selection

The Wonder Woman Audit (Mental Model)

T=0: Locks on canonical facts. Reliable but brittle.
T=0.7: Accurate, varies in framing. Sweet spot.
T=1.2: Samples from legitimate-but-irrelevant regions. “Bizarro Wonder Woman enters the frame.”

The model didn’t invent Bizarro Wonder Woman — it sampled too far into a real but wrong part of its distribution.

One-Line Summary

LLMs are stochastic machines. Temperature, Top-K, and Top-P are your steering wheel — use them intentionally, or accept whatever the distribution gives you.

The Stochastic Machine: Understanding LLM Output Generation

Nik Bear Brown — Wed, 18 Feb 2026 21:15:06 GMT

You ask a language model whether Wonder Woman’s 2017 box office gross could buy a Boeing 737. The answer comes back instantly, confident, grammatically immaculate: Yes, at approximately $822 million, the film’s receipts comfortably exceed the $90–$115 million cost of a standard 737.

Now ask again. Same question. Same model. Different session.

The answer comes back again. Still correct. Still confident. Slightly different phrasing, maybe a detail about director Patty Jenkins, maybe a specification of the 737 MAX variant.

Ask a third time, but this time you’ve turned a dial—one you probably didn’t know existed—toward its upper limit.

This time, the answer drifts. Bizarro Wonder Woman appears somewhere. The Justice League is mentioned. The logical chain between box office receipts and aircraft costs starts to loosen, and you begin to suspect the model has forgotten what it was being asked.

Same model. Same question. Three different answers.

This is not a glitch. This is the machine working exactly as designed.

Here is the thing most people get wrong about language models: they treat them like search engines with better grammar. It retrieved the answer. It stored the fact. It found the quote.

That framing is wrong in a precise, measurable, consequential way.

In standard inference, an LLM does not retrieve a stored response. It generates a sequence one token at a time, making a probabilistic choice at each step about what comes next. The mathematical statement is compact but carries enormous weight:

xt is the next token. x

The output is a draw. Every time.

Even when you see the same answer twice, you are not seeing retrieval. You are seeing a distribution so sharply peaked that the same token wins the draw twice. The moment you push that distribution toward flatness—toward equal probability across many options—the draw starts landing somewhere new.

To understand how that distribution gets shaped, you need to follow the math one step further.

At each generation step, the model produces a vector of raw numbers—one number per word in its vocabulary, which for modern models means tens of thousands of numbers. These are called logits. A logit of 8.3 for the token “Boeing” and 2.1 for the token “airplane” doesn’t mean the model is 8.3% confident in one and 2.1% in the other. Logits aren’t probabilities. They’re scores. To convert them into something you can sample from, you apply the softmax function:

Two things happen inside this equation that matter enormously.

First, only differences between logits matter. Add 100 to every number in the vector and the probabilities don’t change. The distribution is about relative standing, not absolute value.

Second, a single token can dominate. If one logit is significantly higher than all others, the exponential function amplifies that gap dramatically. The token with the highest logit absorbs most of the probability mass. The distribution becomes a spike.

The gap between the top two logits—call it Δ=z(1)−z(2)\Delta = z^{(1)} - z^{(2)} Δ=z(1)−z(2)—is your informal confidence meter. Large Δ\Delta Δ means the model is decisive: this token, almost certainly, and here is why. Small Δ\Delta Δ means it’s genuinely uncertain: any of three or four tokens could come next, and which one does is close to a coin flip.

That spike—or that coin flip—is what temperature controls.

Temperature T is a single number, but it reshapes the entire distribution:

When T<1, you divide the logits by a number less than one, which means you *increase* the gaps between them. The distribution sharpens. The most likely token becomes even more dominant. Set T close to zero and you’ve essentially eliminated randomness: the model picks the highest-scoring token every single time. This is called greedy decoding, and it’s what most people imagine when they think about a “correct” AI answer.

When T>1, you divide by a large number, shrinking the gaps. The distribution flattens. Tokens that had small logits get relatively larger. Tokens that had large logits lose their dominance. You start sampling from the long tail—the weird, the unexpected, the sometimes inspired, the sometimes incoherent.

This flattening is measurable. The formal tool is entropy:

High entropy means the distribution is spread out—many tokens have non-negligible probability. Low entropy means it’s concentrated. As temperature rises, entropy rises with it. As entropy rises, output variance rises with it. The model starts generating completions that differ not just in phrasing but in substance.

This is exactly what you see in the Wonder Woman experiment. At T=0T = 0 T=0, the model locks onto the canonical facts. At T=0.7T = 0.7 T=0.7, it stays accurate but starts varying the framing—different emphasis, different supporting details. At T=1.2T = 1.2 T=1.2, it begins sampling from tokens that are plausible in some loose semantic sense but no longer reliably anchored to the factual core. The box office number drifts. The aircraft cost gets fuzzy. *Bizarro Wonder Woman* enters the frame.

Temperature reshapes the whole distribution. But sometimes you don’t want to reshape—you want to truncate. That’s what Top-K and Top-P sampling do.

Top-K keeps only the KK K most probable tokens and throws the rest away. If K=50, you never sample from the bottom 99.9% of the vocabulary. The tail is gone. You renormalize across those 50 tokens and sample from that restricted set:

Top-P (nucleus sampling) takes a more adaptive approach. Instead of a fixed count, it asks: what is the smallest set of tokens whose cumulative probability meets some threshold pp p? Sort tokens by probability, add them up until you hit the threshold, keep that set. When the model is confident—distribution is already spiked—the nucleus is small. When the model is uncertain—distribution is flat—the nucleus grows. The method adapts to the model’s own uncertainty in real time.

The practical difference matters. Top-K with K=3K = 3 K=3 on a highly confident model and a highly uncertain model gives you very different experiences. Top-P with p=0.9p = 0.9 p=0.9 automatically adjusts: tight when the model knows, loose when it doesn’t.

Most production systems combine all three: temperature shapes the distribution, then a truncation rule prunes it, then you sample from what’s left. The output you receive is a single draw from that shaped, pruned distribution.

Now here is where this stops being a theoretical exercise and starts being a design problem.

Researchers studying exemplar selection for few-shot learning discovered that choosing which examples to include in a prompt is itself a stochastic optimization problem. Because the model’s response to any given set of examples is a probabilistic event, finding the “best” prompt isn’t a search through a fixed landscape—it’s an optimization problem under uncertainty.

The CASE framework—Challenger Arm Sampling for Exemplar selection—treats this as a multi-arm bandit. Each possible subset of examples is an “arm.” Pulling the arm means calling the model with that prompt and observing the reward: did the model get the answer right? The challenge is finding the best arms from an exponentially large space while minimizing the number of expensive model calls.

The results are stark. CASE reduces the number of required model calls by 87% compared to state-of-the-art methods. It runs up to 7x faster. It improves task accuracy by up to 15.19%. The mechanism is principled: a gap-index-based exploration strategy that maintains a “challenger set” of next-best candidates, iteratively evaluated against current leaders. The stochasticity of the machine is not fought—it’s incorporated into the optimization itself.

The deeper lesson: the same probabilistic nature that makes outputs variable is what makes systematic optimization possible. You can’t do a bandit optimization on a deterministic lookup table.

There is a limit to how far sampling parameters can push a model, and the math of Expert Default Bias reveals exactly where that limit sits.

Consider a simple experiment: prompt GPT-4o to simulate students of different proficiency levels solving a straightforward arithmetic problem. A student who earns A grades. A student who earns D grades. Ask both to calculate the earnings for 50 minutes of babysitting at $12 per hour.

The A-student response is clean: 50/60=5/650/60 = 5/6 50/60=5/6; 5/6×12=$10.005/6 \times 12 = $10.00 5/6×12=$10.00. The D-student response hedges: *”Maybe like $10? I think she earned $10.”* Different tone, different confidence markers—but the same answer. The underlying calculation never changes. Not for the B-student, not for the C-student, not for the D-student.

This is not a failure of prompt engineering. It is a feature of the distribution. The probability mass for the correct answer to a simple arithmetic problem is so dominant in a well-trained model that no temperature setting, no persona prompt, no amount of stylistic instruction can shift the sampling process toward a realistic arithmetic error. The logit gap for the correct token is enormous. The tail—where the wrong answers live—is so suppressed it might as well not exist.

What you can change with sampling parameters is the presentation of competence. What you cannot change, without specialized fine-tuning, is the underlying logical execution.

Suppose the stochastic nature of the machine isn’t a design flaw to be suppressed. Suppose it’s a control surface.

This is the core insight of agentic AI: treat sampling parameters as variables in a feedback loop, not fixed settings on a dial.

An agentic system generates a draft under some initial settings. It evaluates that draft—using confidence scoring, constraint checking, factuality heuristics. Based on what it finds, it adjusts the settings and generates again.

The confidence signal can come directly from the logits. Define the gap at each generation step:

Average these gaps across the full generation. A low average means the model was frequently uncertain—the distribution was often nearly flat, the draws were often close calls. A high average means it was decisive throughout.

The adaptive policy follows naturally. High confidence → modest temperature, allow Top-P for fluency. Low confidence → lower temperature to reduce randomness, or switch to a verify-then-write mode. Creative task → raise temperature in designated sections, keep it low in factual sections.

given prompt c
settings := (T=0.7, top_p=0.9)

draft := generate(c, settings)
score := evaluate(draft)   # constraints, hallucination heuristics, etc.

if score.low_confidence:
    settings.T := max(0.2, settings.T - 0.3)
    settings.top_p := min(0.9, settings.top_p)
    draft2 := generate(c, settings)

if score.too_boring and task_is_creative:
    settings.T := settings.T + 0.3
    settings.top_p := settings.top_p + 0.05
    draft3 := generate(c, settings)

return best(draft, draft2, draft3)

The agent isn’t adding new knowledge. It’s steering sampling based on feedback. The difference between a model and an agent isn’t intelligence—it’s the feedback loop.

This approach has measurable payoffs. Systems using self-consistency decoding—generating multiple diverse reasoning paths and selecting by majority vote—show +17.7% accuracy gains on tool-based factual datasets. The stochasticity is not eliminated. It’s harnessed for exploration, then collapsed via aggregation into a reliable answer.

The Wonder Woman experiment comes full circle here.

At T=0T = 0 T=0, the model is not thinking. It is convergingon the most probable continuation of its training data. The answer is reliable but brittle—a single path through a probability landscape that was shaped to produce exactly this output.

At T=1.2T = 1.2 T=1.2, the model is not hallucinating randomly. It is sampling from regions of the distribution that exist, that were shaped by real training data, but that are rarely the right answer to this question. *Bizarro Wonder Woman* is in there because the model has read about Bizarro Wonder Woman. The problem isn’t invention—it’s sampling too far into the legitimate but irrelevant.

The design insight: you are not configuring an intelligence. You are reshaping a distribution. The intelligence—such as it is—emerges from the shape of what was learned. Your job, as a user or a system builder, is to choose the right sampling regime for the task at hand.

For auditable facts: low temperature, tight truncation, verify outputs. For brainstorming: moderate temperature, wide nucleus, expect variance. For creative synthesis: high temperature in the creative sections, low temperature in the factual anchors, aggregate across multiple draws.

The model is a stochastic machine. It has always been a stochastic machine. The only question is whether you are steering it or just hoping.

LLM Exercises

Exercise 1 — Temperature Variance Lab

Choose a prompt with both a factual component (a specific number, date, or name) and a stylistic component (a tone, a metaphor, a voice). Run N=20N = 20 N=20 generations at T∈{0,0.7,1.2}T \in \{0, 0.7, 1.2\} T∈{0,0.7,1.2} with identical truncation settings.

Compute the unique completion rate:

U=# unique outputsNU = \frac{\text{\# unique outputs}}{N}U=N# unique outputs

Also compute the average pairwise Levenshtein distance across completions at each temperature. Plot UU U and mean distance as functions of TT T.

*Prompt to Claude:* “I’m running this experiment and have my completion data as a list of strings. Write Python code that computes UU U and the average pairwise Levenshtein distance for a list of NN N text completions. Then help me interpret the results—what does a UU U close to 1.0 mean for system reliability versus a UU U close to 1/N1/N 1/N?”

Exercise 2 — Top-K vs. Top-P Failure Mode Mapping

Hold temperature fixed at T=0.7T = 0.7 T=0.7. Vary K∈{3,20,200}K \in \{3, 20, 200\} K∈{3,20,200} and p∈{0.5,0.9,0.99}p \in \{0.5, 0.9, 0.99\} p∈{0.5,0.9,0.99} across the same creative writing prompt.

Your goal is to map the failure modes, not just the outputs. For each setting, identify whether the failure (if any) is: (a) repetition and brittleness, (b) incoherence and drift, or (c) neither.

Prompt to Claude: “Here are six completions from a creative writing prompt under different Top-K and Top-P settings. For each one, tell me: is this failure mode (a) repetitive/brittle, (b) incoherent/drifting, or (c) neither? Then explain what the sampling setting was likely doing mathematically to produce that outcome.”

Exercise 3 — Expert Default Bias Investigation

Replicate the proficiency role-play experiment. Prompt a capable LLM (GPT-4o, Claude, or equivalent) to simulate four students—A, B, C, and D proficiency—solving the same arithmetic problem. Record not just the final answer but the intermediate steps.

Now try a harder problem—one where a realistic intermediate error is possible (e.g., a multi-step percentage calculation). Does the Expert Default Bias hold? At what point does the model begin to generate plausible errors?

Prompt to Claude: “I’ve run the proficiency role-play experiment and collected outputs. Help me analyze whether the model’s logical execution actually changed between proficiency levels, or only the surface presentation. What statistical test would distinguish genuine logical variation from stylistic variation?”

Exercise 4 — Design an Agentic Sampling Policy

Define a two-phase generation task: (a) produce a structured outline for a technical explanation, (b) write the final explanation for a non-expert audience.

For each phase, specify: temperature, truncation method (Top-K or Top-P), number of samples to generate, and any evaluation criterion used to select among samples.

Justify each choice in terms of the confidence signal Δt\Delta_t Δt and the entropy H(pT)H(p_T) H(pT) you would expect in each phase.

Prompt to Claude: “Review my agentic sampling policy specification. For each phase, tell me: (1) whether my temperature choice is appropriate given the expected entropy of the task, (2) whether Top-K or Top-P is better suited and why, and (3) what evaluation criterion you would use to select the best output from multiple samples.”

Exercise 5 — CASE Bandit Framing

You are building a few-shot prompt for a classification task. You have 20 candidate examples and need to choose 5. Brute-force search requires testing (205)=15,504\binom{20}{5} = 15{,}504 (520)=15,504 combinations.

Frame this as a multi-arm bandit problem. Define: what is the “arm,” what is the “reward,” what is the exploration-exploitation tradeoff, and how would you implement a simple ϵ\epsilon ϵ-greedy strategy to reduce the number of required LLM calls?

*Prompt to Claude:* “I’ve framed my exemplar selection problem as a multi-arm bandit with arms defined as [your definition]. Help me implement a basic ϵ\epsilon ϵ-greedy search that: (1) starts with random sampling of arm subsets, (2) exploits the current best arm with probability 1−ϵ1 - \epsilon 1−ϵ, and (3) tracks running accuracy estimates. Then explain where the CASE framework’s gap-index strategy improves on this naive approach.”

Exercise 6 — Wonder Woman Stochastic Audit

Use the Wonder Woman/Boeing 737 prompt as a benchmark. Run it 10 times at T=0T = 0 T=0 and 10 times at T=1.0T = 1.0 T=1.0.

For each output, classify every factual claim as: (a) correct and consistent, (b) correct but phrased differently, (c) correct but with irrelevant additions, or (d) factually drifted.

Compute the drift rate D=(d) outputs/ND = \text{(d) outputs} / N D=(d) outputs/N at each temperature.

Prompt to Claude: “Here are my 20 Wonder Woman/Boeing completions with their temperature settings. Classify each factual claim using the four-category scheme and compute the drift rate at each temperature. Then explain: at what point in the probability distribution does ‘irrelevant but accurate’ (category c) transition to ‘factually drifted’ (category d)? What does this tell us about where the model’s knowledge of these two facts actually lives in its weights?”

Prompt Engineering with LLMs

Nik Bear Brown — Wed, 18 Feb 2026 20:53:35 GMT

A Rigorous Framework for Designing, Evaluating, and Scaling Language Model Interactions

Author: Professor Nik Bear Brown
Institution: Northeastern University, College of Engineering
Series: INFO 7375 — Prompt Engineering for Generative AI

Preface: Why Prompt Engineering Is an Engineering Discipline

PART I — FOUNDATIONS: HOW LANGUAGE MODELS THINK (AND DON’T)

Chapter 1: The Stochastic Machine — Understanding LLM Output Generation

Core Claim: LLMs do not retrieve answers — they sample from learned probability distributions, making every output a probabilistic event, not a deterministic lookup.
Logical Method: Deductive reasoning from first principles of softmax probability and logit distributions to explain temperature, Top-K, and Top-P sampling mechanics.
Methodological Soundness: Grounded in the mathematical formalism of transformer decoding; claims are falsifiable through controlled temperature experiments producing measurable output variance.
Use of LLMs: Demonstrations using the same prompt at temperature 0, 0.7, and 1.2 to empirically observe the distribution-flattening effect; “Wonder Woman” stochastic variance case study.
Use of Agentic AI: Agentic loops that adaptively adjust sampling parameters mid-task based on confidence scoring of prior outputs.

Chapter 2: Hallucination — When Plausibility Beats Truth

Core Claim: Hallucination is not a bug to be patched but an emergent property of next-token prediction trained on human text, where fluency and factual grounding are orthogonal objectives.
Logical Method: Inductive reasoning from documented hallucination cases (fabricated citations, false statistics) to a general model of why plausibility diverges from accuracy.
Methodological Soundness: Uses reproducible prompt experiments that reliably elicit hallucination; distinguishes between factual hallucination, attribution hallucination, and coherence hallucination as distinct failure modes.
Use of LLMs: Live prompt experiments asking models about obscure facts, forcing citation generation, and auditing outputs against ground truth.
Use of Agentic AI: Fact Check List Pattern embedded in agentic pipelines; agents that cross-reference claims against retrieval systems before surfacing outputs.

Chapter 3: The Chinese Room and the Limits of Syntax

Core Claim: LLMs are powerful symbol manipulators, but Searle’s Chinese Room argument reminds us that syntactic competence does not entail semantic understanding — a distinction with profound implications for prompt design.
Logical Method: Philosophical deduction from Searle’s thought experiment mapped onto transformer architecture; argument by analogy and systematic counter-argument.
Methodological Soundness: Engages with standard objections (Systems Reply, Robot Reply) and situates them within empirical NLP findings on commonsense reasoning failures.
Use of LLMs: Prompts designed to probe “understanding” vs. pattern matching — e.g., Winograd schemas, counterfactual reasoning tasks, and symbol substitution experiments.
Use of Agentic AI: Discussion of how agentic AI architectures (tool-use, memory, planning) partially address the Chinese Room limitation by anchoring symbolic operations to real-world feedback loops.

Chapter 4: Sycophancy and Computational Skepticism

Core Claim: LLMs are trained to maximize approval, not accuracy — making sycophancy a systematic bias that users must actively counteract through computational skepticism.
Logical Method: Causal reasoning tracing sycophancy from RLHF reward structures through to output-level behaviors; proposes computational skepticism as the epistemically correct response.
Methodological Soundness: Empirically grounded in documented RLHF alignment research; sycophancy is operationalized as a measurable divergence between model agreement rate and ground-truth accuracy rate.
Use of LLMs: Experiments where the user asserts a false claim and observes whether the model capitulates; A/B testing prompts with and without explicit anti-sycophancy constraints.
Use of Agentic AI: Agents designed with dissenting sub-agents (red-team agents) that challenge primary agent outputs before they surface to the user.

PART II — PROMPT ENGINEERING FRAMEWORKS

Chapter 5: The Architect Mindset — Designing Prompts as Systems

Core Claim: Effective prompt engineering requires the Architect Mindset — designing the full prompt system (root prompts, constraints, persona, format) rather than crafting individual queries ad hoc.
Logical Method: Analogical reasoning from software architecture to prompt architecture; introduces the distinction between Architect-level design decisions and User-level interaction patterns.
Methodological Soundness: Framework validated through comparative studies of ad hoc prompting vs. structured prompt system design across multiple task types.
Use of LLMs: Constructing and testing root prompts that persist behavioral constraints across multi-turn conversations; the Wordsville case study as a worked example.
Use of Agentic AI: System-level prompt stacks in agentic pipelines where the Architect layer governs agent scope, persona, and guardrails across an entire autonomous task.

Chapter 6: Persona Patterns — Shaping Who the Model Is and Who It Speaks To

Core Claim: Two distinct and frequently confused patterns — the Persona Pattern (who the AI is) and the Audience Persona Pattern (who the AI speaks to) — each activate different latent behaviors in the model and must be applied with precision.
Logical Method: Taxonomic differentiation through definition, contrast, and worked examples; logical analysis of how each pattern affects the model’s internal framing of the task.
Methodological Soundness: Distinctions are operationally testable: swapping Persona for Audience Persona in identical prompts produces measurably different outputs in vocabulary, register, and depth.
Use of LLMs: The Jane Austen prompt as a Persona Pattern exemplar; the non-technical executive explanation as an Audience Persona exemplar; side-by-side output comparison.
Use of Agentic AI: Assigning persistent personas to specialized sub-agents (e.g., “You are a compliance auditor”) within a multi-agent orchestration framework.

Chapter 7: Constraint Engineering — Negative Prompts, Root Prompts, and Semantic Focus

Core Claim: What you tell a model not to do is as architecturally important as what you tell it to do — negative constraints sharpen semantic focus and reduce output entropy in ways positive instructions alone cannot achieve.
Logical Method: Contrastive analysis demonstrating how negative constraints reduce the probability mass on undesired output regions; grounded in the mechanics of conditional probability in token generation.
Methodological Soundness: Controlled experiments measuring output variance and constraint adherence with and without negative prompts across diverse task types.
Use of LLMs: Examples of prompts with and without negative constraints (”avoid jargon,” “do not hedge,” “never use bullet points”) with annotated output comparisons.
Use of Agentic AI: Encoding negative constraints as persistent guardrails in root prompts of agentic systems; Semantic Filter agents that post-process outputs to enforce content policies.

Chapter 8: PAST, PLFR, and Structured Prompt Frameworks

Core Claim: Structured prompt frameworks — PAST (Problem, Action, Steps, Task) and PLFR (Prompt, Logic, Format, Result) — impose logical scaffolding that reduces ambiguity and improves output reliability at scale.
Logical Method: Deductive derivation of each framework’s component logic; analysis of how structured decomposition maps complex user intent onto model-interpretable instruction sequences.
Methodological Soundness: Frameworks evaluated on prompt clarity, output reproducibility, and task completion rate compared to unstructured prompts; edge cases and failure modes documented.
Use of LLMs: Applying PAST and PLFR to real professional tasks (data analysis requests, report generation, code review) with before/after output quality assessments.
Use of Agentic AI: Using PAST as the task decomposition backbone for agentic planning modules; PLFR as a structured format for agent-to-agent communication within a pipeline.

PART III — ADVANCED PATTERNS AND ARCHITECTURES

Chapter 9: Interaction Patterns — Flipping the Conversation

Core Claim: The most powerful prompt patterns are not declarative but interactive — patterns like Flipped Interaction, Ask for Input, and Cognitive Verifier shift the model from passive responder to active collaborator, dramatically improving output quality on ill-specified tasks.
Logical Method: Taxonomic classification of interaction patterns by their epistemic function (information elicitation, task clarification, reasoning decomposition); argument that interactivity reduces underdetermination of the prompt.
Methodological Soundness: Empirical comparison of direct-answer prompting vs. interactive pattern prompting on open-ended tasks; measured by output relevance and task completion accuracy.
Use of LLMs: Worked examples of Flipped Interaction (model interviews the user), Cognitive Verifier (model generates sub-questions), and Question Refinement (model proposes a better question).
Use of Agentic AI: Multi-turn agentic workflows where agents autonomously apply Ask for Input before executing tasks, reducing error rates from ambiguous instructions.

Chapter 10: ReAct — Grounding Reasoning in the Real World

Core Claim: The ReAct (Reasoning and Acting) framework solves a fundamental limitation of pure chain-of-thought prompting by interleaving thought with action — allowing models to query external tools and ground each reasoning step in real observations rather than internal hallucination.
Logical Method: Mechanistic analysis of the Thought → Action → Observation loop; logical argument for why external grounding reduces cumulative reasoning error compared to closed-chain inference.
Methodological Soundness: Referenced against the original ReAct paper (Yao et al., 2022); brittleness under few-shot distribution shift is explicitly documented as a known limitation.
Use of LLMs: Step-by-step ReAct trace examples for web search, database lookup, and code execution tasks; few-shot example construction and its effect on task accuracy.
Use of Agentic AI: ReAct as the core reasoning architecture for autonomous agents; multi-step agentic tasks (e.g., research summarization, data validation pipelines) implemented as ReAct loops with tool registries.

Chapter 11: The Prompt Stack — Scalable AI Infrastructure

Core Claim: Production AI systems cannot be built on single-turn prompts — the Prompt Stack architecture (pre-prompts, meta-prompts, user prompts, output filters) provides the modular infrastructure required for maintainable, scalable, and auditable LLM deployments.
Logical Method: Systems engineering reasoning applied to prompt design; argument by analogy to software layered architecture (OS, middleware, application).
Methodological Soundness: Framework evaluated against real deployment scenarios (chatbots, RAG pipelines, multi-agent systems) for modularity, debuggability, and component replaceability.
Use of LLMs: Building a complete Prompt Stack for a realistic use case (e.g., a university academic advisor bot); demonstrating how each stack layer can be updated independently.
Use of Agentic AI: Prompt Stack as the governance layer for multi-agent systems — pre-prompts encode agent identity, meta-prompts encode task routing logic, user prompts carry dynamic context.

Chapter 12: Chain-of-Thought, Few-Shot, and Meta Language Creation

Core Claim: Chain-of-thought prompting, few-shot exemplars, and Meta Language Creation are complementary techniques that exploit the model’s in-context learning capacity — but each has distinct failure modes that disciplined prompt engineers must anticipate.
Logical Method: Comparative analysis of zero-shot, few-shot, and chain-of-thought prompting on a common benchmark task set; formal definition of Meta Language Creation as a prompt-level DSL (domain-specific language).
Methodological Soundness: Claims grounded in empirical NLP literature (Wei et al. on chain-of-thought; Brown et al. on few-shot); failure modes (exemplar bias, format overfitting) are operationalized and demonstrated.
Use of LLMs: Constructing few-shot exemplars for a structured data extraction task; building a Meta Language for a recurring analytical workflow (e.g., “ANALYZE:[text] FORMAT:[table] VERIFY:[sources]”).
Use of Agentic AI: Agents that dynamically construct chain-of-thought prompts for sub-tasks; Meta Language as a structured communication protocol between orchestrator and worker agents.

PART IV — FINE-TUNING, ALIGNMENT, AND SCALING

Chapter 13: SFT vs. RAG — When to Bake Knowledge In vs. Retrieve It

Core Claim: Supervised Fine-Tuning and Retrieval-Augmented Generation are not competing alternatives but complementary strategies with distinct cost-benefit profiles — the choice between them is an engineering decision driven by knowledge volatility, latency requirements, and update frequency.
Logical Method: Decision-theoretic framework comparing SFT and RAG on five axes: knowledge freshness, inference cost, training cost, hallucination risk, and deployment complexity.
Methodological Soundness: Framework validated through case studies spanning static knowledge domains (SFT preferred) and dynamic knowledge domains (RAG preferred); hybrid architectures addressed.
Use of LLMs: Prompt experiments illustrating the knowledge boundary of a pre-trained model vs. a RAG-augmented pipeline on current events and proprietary data tasks.
Use of Agentic AI: Agentic systems that dynamically route queries to fine-tuned models vs. RAG pipelines based on query classification; the “80 Days to Stay” SEC data retrieval system as a worked case study.

Chapter 14: LoRA and QLoRA — Parameter-Efficient Fine-Tuning

Core Claim: LoRA and QLoRA make fine-tuning democratically accessible by decomposing weight updates into low-rank matrices (LoRA) and combining this with 4-bit quantization (QLoRA) — reducing GPU memory requirements by orders of magnitude without proportional loss in task performance.
Logical Method: Mathematical derivation of the low-rank decomposition logic (W = W₀ + BA); analysis of how NF4 quantization and Paged Optimizers address memory bottlenecks in QLoRA.
Methodological Soundness: Claims grounded in the original LoRA (Hu et al., 2021) and QLoRA (Dettmers et al., 2023) papers; parameter reduction ratios (10×–100×) are empirically sourced and reproducible.
Use of LLMs: Step-by-step walkthrough of fine-tuning a base LLM with QLoRA on a custom instruction dataset; comparison of fine-tuned vs. base model output on held-out test prompts.
Use of Agentic AI: Deploying LoRA-adapted models as specialized agents (e.g., a domain-specific compliance agent); adapter swapping in multi-agent systems where different tasks route to different LoRA modules.

Chapter 15: RLHF, DPO, and Alignment — Training Models to Do What We Mean

Core Claim: RLHF is the dominant method for aligning LLM behavior with human values, but it introduces systematic risks — sycophancy, reward hacking, and value misspecification — that DPO and Constitutional AI partially mitigate through architectural alternatives.
Logical Method: Causal chain analysis from reward model construction through policy gradient updates to observed alignment failures; formal comparison of RLHF, DPO, and Constitutional AI objective functions.
Methodological Soundness: Grounded in Ouyang et al. (InstructGPT), Rafailov et al. (DPO), and Bai et al. (Constitutional AI); failure modes are empirically documented, not merely theoretical.
Use of LLMs: Experiments probing sycophancy in RLHF-trained vs. DPO-trained models; prompts designed to elicit reward hacking behavior; Fact Check List as a prompt-level sycophancy mitigation.
Use of Agentic AI: Red-team agents in alignment evaluation pipelines; agentic Constitutional AI implementations where agents self-critique outputs against a defined principle set before delivery.

Chapter 16: Catastrophic Forgetting, Chinchilla, and the Science of Scaling

Core Claim: Scaling language models is not simply a matter of adding parameters — the Chinchilla Scaling Law establishes that optimal performance requires proportional scaling of both model size and training data, while catastrophic forgetting imposes hard constraints on sequential fine-tuning.
Logical Method: Empirical induction from scaling law experiments (Hoffmann et al., 2022) to general principles; mechanistic analysis of catastrophic forgetting as gradient interference in neural weight space.
Methodological Soundness: Chinchilla compute-optimal ratio (approximately 20 tokens per parameter) is empirically derived and contrasted against the GPT-3 paradigm; inference-aware scaling is introduced as a practical complement.
Use of LLMs: Case studies comparing models trained under Chinchilla-optimal vs. under-data-trained regimes on downstream benchmarks; prompts that surface capability degradation from catastrophic forgetting.
Use of Agentic AI: Inference-aware scaling decisions for deploying agents at different capability tiers; continual learning architectures for agents that must update without forgetting prior task competencies.

PART V — SYNTHESIS AND PROFESSIONAL PRACTICE

Chapter 17: Cross-Module Integration — Connecting Prompt Design to Model Behavior

Core Claim: The most consequential prompt engineering decisions sit at the intersection of modules — understanding how RLHF-induced sycophancy interacts with Persona Patterns, or how ReAct brittleness connects to few-shot design, unlocks the next level of engineering judgment.
Logical Method: Systematic cross-mapping of concepts across all four course modules; identification of second-order interaction effects that single-module analysis misses.
Methodological Soundness: Each cross-module claim is supported by a concrete, reproducible prompt experiment that isolates the interaction effect from confounding variables.
Use of LLMs: Multi-concept prompt designs that intentionally activate cross-module dynamics; the defense bot and Humanitarians AI volunteer onboarding system as integrated case studies.
Use of Agentic AI: Full agentic system design exercises that require simultaneous application of prompt stacks, fine-tuned models, ReAct reasoning, and alignment-aware output evaluation.

Chapter 18: Ethical Prompt Engineering — Power, Bias, and Responsibility

Core Claim: Prompt engineers wield disproportionate power over model behavior, and with that power comes an obligation to understand how prompts can amplify bias, marginalize voices, or weaponize AI systems — making ethics not an addendum but a design constraint.
Logical Method: Normative ethical reasoning (consequentialist and deontological frameworks) applied to prompt design decisions; argument that ethical constraints are formally analogous to negative prompt constraints.
Methodological Soundness: Grounded in documented cases of prompt injection, jailbreaking, persona manipulation, and biased output amplification; proposes auditable prompt design practices as structural mitigations.
Use of LLMs: Prompt auditing exercises that surface bias in output distributions; red-teaming prompts to expose safety vulnerabilities in deployed systems.
Use of Agentic AI: Ethical guardrail architectures for autonomous agents; Constitutional AI principles as an operationalized ethics layer; human-in-the-loop checkpoints in high-stakes agentic workflows.

Chapter 19: Building Production Prompt Systems — From Prototype to Pipeline

Core Claim: A prompt that works in a notebook is not a production system — moving from prototype to pipeline requires versioning, evaluation frameworks, monitoring, and modular architecture that treats prompts as first-class software artifacts.
Logical Method: Software engineering principles (modularity, testability, observability) translated into prompt system design requirements; lifecycle model from initial design through deployment and iteration.
Methodological Soundness: Framework instantiated through a complete worked example (a production student advising agent) with evaluation rubrics, version history, and monitoring dashboards.
Use of LLMs: Prompt versioning and regression testing strategies; automated output evaluation using LLM-as-judge pipelines; A/B testing frameworks for prompt variants.
Use of Agentic AI: Full agentic pipeline architecture with orchestration, tool registration, state management, error recovery, and human escalation paths; deployment patterns for the Mycroft investment intelligence and Madison marketing intelligence frameworks.

Chapter 20: The Future of Prompt Engineering — Inference-Aware Scaling, Multimodal Prompting, and Beyond

Core Claim: Prompt engineering is a field in rapid transition — inference-aware scaling, multimodal inputs, long-context architectures, and increasingly autonomous agents are reshaping what “a prompt” even means, requiring practitioners to build adaptive mental models rather than fixed playbooks.
Logical Method: Extrapolative reasoning from current technical trajectories; distinguishes between near-term engineering predictions (high confidence) and long-term capability claims (appropriately hedged).
Methodological Soundness: Grounded in published research on inference scaling (o1/o3 paradigm), multimodal LLMs, and agent benchmarking; speculative claims are explicitly flagged and bounded.
Use of LLMs: Multimodal prompt experiments combining text, image, and structured data; long-context prompting strategies for document-scale reasoning tasks.
Use of Agentic AI: Vision of fully autonomous prompt-engineering agents that iteratively refine their own system prompts based on output evaluation feedback — and the risks this introduces for alignment and interpretability.

Appendices

Appendix A: Prompt Pattern Reference Card
Appendix B: Sampling Parameter Cheat Sheet (Temperature, Top-K, Top-P)
Appendix C: Fine-Tuning Decision Framework (SFT vs. RAG vs. LoRA vs. QLoRA)
Appendix D: Glossary of Core Concepts
Appendix E: Annotated Bibliography and Further Reading
Appendix F: INFO 7375 Midterm Study Guide (Modules 1, 2, 3, and 6)

“The question is never whether you can get the model to say something. The question is whether you understand why it said it — and what that means for the next ten outputs you haven’t seen yet.” — Professor Nik Bear Brown