Challenge Design Guidelines
Quality rules and anti-patterns for building challenges that teach — not hint, not leak, and can't be bypassed without genuine understanding.
This page distills the design principles that separate good Kubeasy challenges from mediocre ones. Apply these rules at every layer of a challenge: descriptions, objectives, manifests, policies.
1. Mystery preservation — everywhere
The learner must not know what's wrong before they investigate. This applies at every layer, not just the description field.
description and initialSituation
Describe what the learner will observe, not why it's happening.
# BAD — names the fix mechanism
initialSituation: |
The job's activeDeadlineSeconds is set to 30, but the script takes 60 seconds.
kubectl describe job shows "DeadlineExceeded".
# GOOD — observable symptoms only
initialSituation: |
A nightly job has been consistently failing.
The pod runs for a while, then gets terminated before finishing.
Check the job events and status for clues.Specific things to never include in initialSituation:
- The name of the Kubernetes event or error reason (
DeadlineExceeded,OOMKilled,Evicted) - The exact field that needs changing (
activeDeadlineSeconds,nodeSelector,secretKeyRef) - The duration or threshold value that defines the bug (
takes about 60 seconds,limit is 32Mi) - Anything that tells the learner where to look before they've looked
Validation titles and descriptions
Titles are shown in the UI and CLI output. A title that names the fix hands the solution to the learner before they've validated anything.
# BAD — tells the learner what they needed to do
- key: memory-fix
title: "Memory Limit Set to 256Mi"
# GOOD — describes the outcome only
- key: stable-operation
title: "Stable Operation"Kyverno policy messages
Kyverno error messages are shown verbatim when a learner tries a blocked operation. They must not hint at the fix or name the correct approach.
# BAD — "fix the resource limits instead" tells them exactly what to do
message: "Container image must be preserved - fix the resource limits instead"
# BAD — "add persistent storage instead" names the solution
message: "Cannot change the container image - add persistent storage instead"
# BAD — "Add ingress rules instead" describes the correct fix
message: "Deleting the NetworkPolicy is not allowed. Add ingress rules instead."
# GOOD — describes only what's blocked
message: "Cannot change the container image"
message: "The deny-all NetworkPolicy must not be deleted or modified."
message: "Only 1 replica is allowed"YAML comments in manifests and policies
Learners can read all YAML comments via kubectl get -o yaml. Any comment that reveals the root cause, the broken field, or the fix is a spoiler.
# BAD — direct spoiler
nodeSelector:
disktype: ssd # BUG: This label doesn't exist on any node
# BAD — indirect hint
activeDeadlineSeconds: 30 # TODO: increase this
# GOOD — no comment (preferred)
nodeSelector:
disktype: ssd
# GOOD — structural comment only
# Kyverno: preserves the challenge image to prevent bypasses2. Validation depth — enforce the learning artifact
A validation that checks only behavioral outcomes (pod is Ready, logs show success) can be passed without possessing the target knowledge. The strongest challenges add a spec validation that checks for the structural change the learner needs to make.
The question to ask
"Can a learner pass all validations without ever touching the thing this challenge is supposed to teach?"
If the answer is yes — a log validation can be faked, a status check passes after recreating from scratch — you need a spec validation.
Examples
| Challenge concept | Behavioral check only (weak) | With spec check (strong) |
|---|---|---|
| Wire credentials from a Secret | Pod is Ready + logs show "Connected" | + env contains valueFrom.secretKeyRef.name: <secret> |
| Persistent storage (PVC) | Volume mounted at /data + data survives restart | + volumes contains persistentVolumeClaim: {} |
| Toleration for a tainted node | Pod is scheduled + Running | + spec.template.spec.tolerations contains key: <taint-key> |
| Liveness probe configured | Pod is Ready | + spec.template.spec.containers[0].livenessProbe exists |
Spec validation patterns
# Assert a Secret reference is used
- path: spec.template.spec.containers[0].env
contains:
valueFrom:
secretKeyRef:
name: database-credentials
# Assert a PVC-backed volume (not hostPath or emptyDir)
- path: spec.template.spec.volumes
contains:
persistentVolumeClaim: {}
# Assert a toleration is present
- path: spec.template.spec.tolerations
contains:
key: dedicated
# Assert a field exists (liveness probe)
- path: spec.template.spec.containers[0].livenessProbe
exists: trueThe contains operator matches any element in a list where all specified key-value pairs are present. contains: {persistentVolumeClaim: {}} matches any volume entry that has a persistentVolumeClaim key, regardless of its value.
3. Bypass protection — the full picture
Kyverno policies lock the immutable frame of the challenge. Think through every bypass path, not just the obvious one.
Always protect: container image and command
The image defines what the app does. If a learner can swap it, they can trivially pass behavioral checks.
The command (and args when the application logic lives there) defines what the container runs.
rules:
- name: preserve-image
validate:
message: "Cannot change the container image"
pattern:
spec:
template:
spec:
containers:
- name: app
image: "busybox:1.36"
- name: preserve-command
validate:
message: "Container command must be preserved"
pattern:
spec:
template:
spec:
containers:
- name: api
command: ["/bin/sh"]If the entire application script lives in args (e.g., a shell one-liner), protect args the same way you protect command. A learner can rewrite args to emit the expected log string without making any meaningful change.
NetworkPolicy challenges — block UPDATE too
Blocking DELETE on a NetworkPolicy without blocking UPDATE/PATCH means a learner can patch it to add a blanket allow rule, bypassing the need to create a targeted policy.
operations:
- DELETE
- UPDATE # required — patch is an UPDATERBAC challenges — block built-in ClusterRoles
For challenges requiring a custom Role with specific verbs, a learner can bypass the learning by binding to the built-in view ClusterRole, which often coincidentally satisfies the permission checks (list pods/configmaps allowed, secrets denied).
- name: block-builtin-role-binding
match:
resources:
kinds: ["RoleBinding"]
namespaces: ["<challenge-slug>"]
validate:
message: "Using built-in ClusterRoles (view/edit/admin) is not allowed"
deny:
conditions:
all:
- key: "{{ request.object.roleRef.kind }}"
operator: Equals
value: ClusterRole
- key: "{{ request.object.roleRef.name }}"
operator: In
value: ["view", "edit", "admin", "cluster-admin"]Delete-and-recreate bypass
If a learner can delete a protected resource and recreate it from scratch with a correct spec, they bypass the diagnostic phase entirely. Protect the resources that define the broken scenario:
- Jobs, CronJobs: protect image + command on the Job resource, not just the CronJob
- Deployments: protect image (already common) + container name
- Policies themselves: if using
kind: Policy(namespace-scoped), a learner can delete the policy before recreating the Deployment — preferkind: ClusterPolicywithnamespaces:scoping for high-stakes challenges
What NOT to protect
Never protect the fields the learner needs to change. The solution space must stay open:
| Challenge concept | Leave open |
|---|---|
| Resource limits | resources.limits, resources.requests |
| Probes | livenessProbe, readinessProbe |
| Env vars from Secrets | env[].valueFrom |
| RBAC objects | Role, RoleBinding (except built-in ClusterRole bindings) |
| NetworkPolicies (new allow rules) | Creating new NetworkPolicy resources |
| Tolerations | spec.template.spec.tolerations |
4. Challenge type accuracy
The type field sets learner expectations about what they'll find.
| Type | Starting state | Learner expectation |
|---|---|---|
fix | Something is broken and deployed | Diagnose why it's broken, repair it |
operate | Infrastructure is running, nothing is broken | Create or configure a missing resource |
improve | Works but isn't production-ready | Harden it — probes, limits, security |
migrate | Working but outdated setup | Transform to a new pattern |
A challenge where nothing is broken and the task is simply "create a missing Service" should be operate, not fix. Learners arriving at a fix challenge will waste time looking for something that isn't broken.
5. Description accuracy
The description must not only avoid hints — it must also be factually accurate.
# BAD — describes intermittent failure, but the deny-all NetworkPolicy
# blocks 100% of traffic (total outage, not partial)
description: "Some requests reach the backend, others time out — it's not consistent."
# GOOD — matches the actual behavior
description: "The backend is completely unreachable. Every request times out."Inaccurate descriptions mislead learners into looking for intermittent issues, race conditions, or load-related problems when the reality is deterministic.
6. Common bypass vectors — quick reference
| Scenario | Bypass path | Mitigation |
|---|---|---|
| Log validation | Rewrite args to emit the expected string without changing anything else | Protect args in Kyverno; add spec validation |
| RBAC challenge | Bind to built-in view ClusterRole via RoleBinding | Block RoleBindings referencing built-in ClusterRoles |
| NetworkPolicy challenge | PATCH the deny-all policy to add a blanket allow | Add UPDATE to blocked operations |
| Persistent storage | Use hostPath instead of PVC | Add spec check: volumes contains persistentVolumeClaim: {} |
| Scheduling challenge | Remove the node taint instead of adding a toleration | Add spec check: tolerations contains the taint key |
| Any challenge | Delete and recreate the resource with correct spec | Protect Job/CronJob image+command; consider ClusterPolicy |
| RBAC challenge | Delete and recreate Deployment with a different SA | Protect serviceAccountName in Kyverno |
| Image-swap bypass | kubectl set image to a trivially passing image | Preserve image in Kyverno (standard — always do this) |