Challenge Design Guidelines

Quality rules and anti-patterns for building challenges that teach — not hint, not leak, and can't be bypassed without genuine understanding.

This page distills the design principles that separate good Kubeasy challenges from mediocre ones. Apply these rules at every layer of a challenge: descriptions, objectives, manifests, policies.

1. Mystery preservation — everywhere

The learner must not know what's wrong before they investigate. This applies at every layer, not just the description field.

description and initialSituation

Describe what the learner will observe, not why it's happening.

# BAD — names the fix mechanism
initialSituation: |
  The job's activeDeadlineSeconds is set to 30, but the script takes 60 seconds.
  kubectl describe job shows "DeadlineExceeded".

# GOOD — observable symptoms only
initialSituation: |
  A nightly job has been consistently failing.
  The pod runs for a while, then gets terminated before finishing.
  Check the job events and status for clues.

Specific things to never include in initialSituation:

The name of the Kubernetes event or error reason (DeadlineExceeded, OOMKilled, Evicted)
The exact field that needs changing (activeDeadlineSeconds, nodeSelector, secretKeyRef)
The duration or threshold value that defines the bug (takes about 60 seconds, limit is 32Mi)
Anything that tells the learner where to look before they've looked

Validation titles and descriptions

Titles are shown in the UI and CLI output. A title that names the fix hands the solution to the learner before they've validated anything.

# BAD — tells the learner what they needed to do
- key: memory-fix
  title: "Memory Limit Set to 256Mi"

# GOOD — describes the outcome only
- key: stable-operation
  title: "Stable Operation"

Kyverno policy messages

Kyverno error messages are shown verbatim when a learner tries a blocked operation. They must not hint at the fix or name the correct approach.

# BAD — "fix the resource limits instead" tells them exactly what to do
message: "Container image must be preserved - fix the resource limits instead"

# BAD — "add persistent storage instead" names the solution
message: "Cannot change the container image - add persistent storage instead"

# BAD — "Add ingress rules instead" describes the correct fix
message: "Deleting the NetworkPolicy is not allowed. Add ingress rules instead."

# GOOD — describes only what's blocked
message: "Cannot change the container image"
message: "The deny-all NetworkPolicy must not be deleted or modified."
message: "Only 1 replica is allowed"

YAML comments in manifests and policies

Learners can read all YAML comments via kubectl get -o yaml. Any comment that reveals the root cause, the broken field, or the fix is a spoiler.

# BAD — direct spoiler
nodeSelector:
  disktype: ssd  # BUG: This label doesn't exist on any node

# BAD — indirect hint
activeDeadlineSeconds: 30  # TODO: increase this

# GOOD — no comment (preferred)
nodeSelector:
  disktype: ssd

# GOOD — structural comment only
# Kyverno: preserves the challenge image to prevent bypasses

2. Validation depth — enforce the learning artifact

A validation that checks only behavioral outcomes (pod is Ready, logs show success) can be passed without possessing the target knowledge. The strongest challenges add a spec validation that checks for the structural change the learner needs to make.

The question to ask

"Can a learner pass all validations without ever touching the thing this challenge is supposed to teach?"

If the answer is yes — a log validation can be faked, a status check passes after recreating from scratch — you need a spec validation.

Examples

Challenge concept	Behavioral check only (weak)	With spec check (strong)
Wire credentials from a Secret	Pod is Ready + logs show "Connected"	+ env contains `valueFrom.secretKeyRef.name: <secret>`
Persistent storage (PVC)	Volume mounted at `/data` + data survives restart	+ volumes contains `persistentVolumeClaim: {}`
Toleration for a tainted node	Pod is scheduled + Running	+ spec.template.spec.tolerations contains `key: <taint-key>`
Liveness probe configured	Pod is Ready	+ `spec.template.spec.containers[0].livenessProbe` exists

Spec validation patterns

# Assert a Secret reference is used
- path: spec.template.spec.containers[0].env
  contains:
    valueFrom:
      secretKeyRef:
        name: database-credentials

# Assert a PVC-backed volume (not hostPath or emptyDir)
- path: spec.template.spec.volumes
  contains:
    persistentVolumeClaim: {}

# Assert a toleration is present
- path: spec.template.spec.tolerations
  contains:
    key: dedicated

# Assert a field exists (liveness probe)
- path: spec.template.spec.containers[0].livenessProbe
  exists: true

The contains operator matches any element in a list where all specified key-value pairs are present. contains: {persistentVolumeClaim: {}} matches any volume entry that has a persistentVolumeClaim key, regardless of its value.

3. Bypass protection — the full picture

Kyverno policies lock the immutable frame of the challenge. Think through every bypass path, not just the obvious one.

Always protect: container image and command

The image defines what the app does. If a learner can swap it, they can trivially pass behavioral checks. The command (and args when the application logic lives there) defines what the container runs.

rules:
  - name: preserve-image
    validate:
      message: "Cannot change the container image"
      pattern:
        spec:
          template:
            spec:
              containers:
                - name: app
                  image: "busybox:1.36"

  - name: preserve-command
    validate:
      message: "Container command must be preserved"
      pattern:
        spec:
          template:
            spec:
              containers:
                - name: api
                  command: ["/bin/sh"]

If the entire application script lives in args (e.g., a shell one-liner), protect args the same way you protect command. A learner can rewrite args to emit the expected log string without making any meaningful change.

NetworkPolicy challenges — block UPDATE too

Blocking DELETE on a NetworkPolicy without blocking UPDATE/PATCH means a learner can patch it to add a blanket allow rule, bypassing the need to create a targeted policy.

operations:
  - DELETE
  - UPDATE   # required — patch is an UPDATE

RBAC challenges — block built-in ClusterRoles

For challenges requiring a custom Role with specific verbs, a learner can bypass the learning by binding to the built-in view ClusterRole, which often coincidentally satisfies the permission checks (list pods/configmaps allowed, secrets denied).

- name: block-builtin-role-binding
  match:
    resources:
      kinds: ["RoleBinding"]
      namespaces: ["<challenge-slug>"]
  validate:
    message: "Using built-in ClusterRoles (view/edit/admin) is not allowed"
    deny:
      conditions:
        all:
          - key: "{{ request.object.roleRef.kind }}"
            operator: Equals
            value: ClusterRole
          - key: "{{ request.object.roleRef.name }}"
            operator: In
            value: ["view", "edit", "admin", "cluster-admin"]

Delete-and-recreate bypass

If a learner can delete a protected resource and recreate it from scratch with a correct spec, they bypass the diagnostic phase entirely. Protect the resources that define the broken scenario:

Jobs, CronJobs: protect image + command on the Job resource, not just the CronJob
Deployments: protect image (already common) + container name
Policies themselves: if using kind: Policy (namespace-scoped), a learner can delete the policy before recreating the Deployment — prefer kind: ClusterPolicy with namespaces: scoping for high-stakes challenges

What NOT to protect

Never protect the fields the learner needs to change. The solution space must stay open:

Challenge concept	Leave open
Resource limits	`resources.limits`, `resources.requests`
Probes	`livenessProbe`, `readinessProbe`
Env vars from Secrets	`env[].valueFrom`
RBAC objects	`Role`, `RoleBinding` (except built-in ClusterRole bindings)
NetworkPolicies (new allow rules)	Creating new NetworkPolicy resources
Tolerations	`spec.template.spec.tolerations`

4. Challenge type accuracy

The type field sets learner expectations about what they'll find.

Type	Starting state	Learner expectation
`fix`	Something is broken and deployed	Diagnose why it's broken, repair it
`operate`	Infrastructure is running, nothing is broken	Create or configure a missing resource
`improve`	Works but isn't production-ready	Harden it — probes, limits, security
`migrate`	Working but outdated setup	Transform to a new pattern

A challenge where nothing is broken and the task is simply "create a missing Service" should be operate, not fix. Learners arriving at a fix challenge will waste time looking for something that isn't broken.

5. Description accuracy

The description must not only avoid hints — it must also be factually accurate.

# BAD — describes intermittent failure, but the deny-all NetworkPolicy
# blocks 100% of traffic (total outage, not partial)
description: "Some requests reach the backend, others time out — it's not consistent."

# GOOD — matches the actual behavior
description: "The backend is completely unreachable. Every request times out."

Inaccurate descriptions mislead learners into looking for intermittent issues, race conditions, or load-related problems when the reality is deterministic.

6. Common bypass vectors — quick reference

Scenario	Bypass path	Mitigation
Log validation	Rewrite `args` to emit the expected string without changing anything else	Protect `args` in Kyverno; add `spec` validation
RBAC challenge	Bind to built-in `view` ClusterRole via RoleBinding	Block RoleBindings referencing built-in ClusterRoles
NetworkPolicy challenge	PATCH the deny-all policy to add a blanket allow	Add `UPDATE` to blocked operations
Persistent storage	Use `hostPath` instead of PVC	Add `spec` check: volumes contains `persistentVolumeClaim: {}`
Scheduling challenge	Remove the node taint instead of adding a toleration	Add `spec` check: tolerations contains the taint key
Any challenge	Delete and recreate the resource with correct spec	Protect Job/CronJob image+command; consider ClusterPolicy
RBAC challenge	Delete and recreate Deployment with a different SA	Protect `serviceAccountName` in Kyverno
Image-swap bypass	`kubectl set image` to a trivially passing image	Preserve image in Kyverno (standard — always do this)

Challenge Design Guidelines

On this page