Kubeasy LogoKubeasy

Challenge Design Guidelines

Quality rules and anti-patterns for building challenges that teach — not hint, not leak, and can't be bypassed without genuine understanding.

This page distills the design principles that separate good Kubeasy challenges from mediocre ones. Apply these rules at every layer of a challenge: descriptions, objectives, manifests, policies.


1. Mystery preservation — everywhere

The learner must not know what's wrong before they investigate. This applies at every layer, not just the description field.

description and initialSituation

Describe what the learner will observe, not why it's happening.

# BAD — names the fix mechanism
initialSituation: |
  The job's activeDeadlineSeconds is set to 30, but the script takes 60 seconds.
  kubectl describe job shows "DeadlineExceeded".

# GOOD — observable symptoms only
initialSituation: |
  A nightly job has been consistently failing.
  The pod runs for a while, then gets terminated before finishing.
  Check the job events and status for clues.

Specific things to never include in initialSituation:

  • The name of the Kubernetes event or error reason (DeadlineExceeded, OOMKilled, Evicted)
  • The exact field that needs changing (activeDeadlineSeconds, nodeSelector, secretKeyRef)
  • The duration or threshold value that defines the bug (takes about 60 seconds, limit is 32Mi)
  • Anything that tells the learner where to look before they've looked

Validation titles and descriptions

Titles are shown in the UI and CLI output. A title that names the fix hands the solution to the learner before they've validated anything.

# BAD — tells the learner what they needed to do
- key: memory-fix
  title: "Memory Limit Set to 256Mi"

# GOOD — describes the outcome only
- key: stable-operation
  title: "Stable Operation"

Kyverno policy messages

Kyverno error messages are shown verbatim when a learner tries a blocked operation. They must not hint at the fix or name the correct approach.

# BAD — "fix the resource limits instead" tells them exactly what to do
message: "Container image must be preserved - fix the resource limits instead"

# BAD — "add persistent storage instead" names the solution
message: "Cannot change the container image - add persistent storage instead"

# BAD — "Add ingress rules instead" describes the correct fix
message: "Deleting the NetworkPolicy is not allowed. Add ingress rules instead."

# GOOD — describes only what's blocked
message: "Cannot change the container image"
message: "The deny-all NetworkPolicy must not be deleted or modified."
message: "Only 1 replica is allowed"

YAML comments in manifests and policies

Learners can read all YAML comments via kubectl get -o yaml. Any comment that reveals the root cause, the broken field, or the fix is a spoiler.

# BAD — direct spoiler
nodeSelector:
  disktype: ssd  # BUG: This label doesn't exist on any node

# BAD — indirect hint
activeDeadlineSeconds: 30  # TODO: increase this

# GOOD — no comment (preferred)
nodeSelector:
  disktype: ssd

# GOOD — structural comment only
# Kyverno: preserves the challenge image to prevent bypasses

2. Validation depth — enforce the learning artifact

A validation that checks only behavioral outcomes (pod is Ready, logs show success) can be passed without possessing the target knowledge. The strongest challenges add a spec validation that checks for the structural change the learner needs to make.

The question to ask

"Can a learner pass all validations without ever touching the thing this challenge is supposed to teach?"

If the answer is yes — a log validation can be faked, a status check passes after recreating from scratch — you need a spec validation.

Examples

Challenge conceptBehavioral check only (weak)With spec check (strong)
Wire credentials from a SecretPod is Ready + logs show "Connected"+ env contains valueFrom.secretKeyRef.name: <secret>
Persistent storage (PVC)Volume mounted at /data + data survives restart+ volumes contains persistentVolumeClaim: {}
Toleration for a tainted nodePod is scheduled + Running+ spec.template.spec.tolerations contains key: <taint-key>
Liveness probe configuredPod is Ready+ spec.template.spec.containers[0].livenessProbe exists

Spec validation patterns

# Assert a Secret reference is used
- path: spec.template.spec.containers[0].env
  contains:
    valueFrom:
      secretKeyRef:
        name: database-credentials

# Assert a PVC-backed volume (not hostPath or emptyDir)
- path: spec.template.spec.volumes
  contains:
    persistentVolumeClaim: {}

# Assert a toleration is present
- path: spec.template.spec.tolerations
  contains:
    key: dedicated

# Assert a field exists (liveness probe)
- path: spec.template.spec.containers[0].livenessProbe
  exists: true

The contains operator matches any element in a list where all specified key-value pairs are present. contains: {persistentVolumeClaim: {}} matches any volume entry that has a persistentVolumeClaim key, regardless of its value.


3. Bypass protection — the full picture

Kyverno policies lock the immutable frame of the challenge. Think through every bypass path, not just the obvious one.

Always protect: container image and command

The image defines what the app does. If a learner can swap it, they can trivially pass behavioral checks. The command (and args when the application logic lives there) defines what the container runs.

rules:
  - name: preserve-image
    validate:
      message: "Cannot change the container image"
      pattern:
        spec:
          template:
            spec:
              containers:
                - name: app
                  image: "busybox:1.36"

  - name: preserve-command
    validate:
      message: "Container command must be preserved"
      pattern:
        spec:
          template:
            spec:
              containers:
                - name: api
                  command: ["/bin/sh"]

If the entire application script lives in args (e.g., a shell one-liner), protect args the same way you protect command. A learner can rewrite args to emit the expected log string without making any meaningful change.

NetworkPolicy challenges — block UPDATE too

Blocking DELETE on a NetworkPolicy without blocking UPDATE/PATCH means a learner can patch it to add a blanket allow rule, bypassing the need to create a targeted policy.

operations:
  - DELETE
  - UPDATE   # required — patch is an UPDATE

RBAC challenges — block built-in ClusterRoles

For challenges requiring a custom Role with specific verbs, a learner can bypass the learning by binding to the built-in view ClusterRole, which often coincidentally satisfies the permission checks (list pods/configmaps allowed, secrets denied).

- name: block-builtin-role-binding
  match:
    resources:
      kinds: ["RoleBinding"]
      namespaces: ["<challenge-slug>"]
  validate:
    message: "Using built-in ClusterRoles (view/edit/admin) is not allowed"
    deny:
      conditions:
        all:
          - key: "{{ request.object.roleRef.kind }}"
            operator: Equals
            value: ClusterRole
          - key: "{{ request.object.roleRef.name }}"
            operator: In
            value: ["view", "edit", "admin", "cluster-admin"]

Delete-and-recreate bypass

If a learner can delete a protected resource and recreate it from scratch with a correct spec, they bypass the diagnostic phase entirely. Protect the resources that define the broken scenario:

  • Jobs, CronJobs: protect image + command on the Job resource, not just the CronJob
  • Deployments: protect image (already common) + container name
  • Policies themselves: if using kind: Policy (namespace-scoped), a learner can delete the policy before recreating the Deployment — prefer kind: ClusterPolicy with namespaces: scoping for high-stakes challenges

What NOT to protect

Never protect the fields the learner needs to change. The solution space must stay open:

Challenge conceptLeave open
Resource limitsresources.limits, resources.requests
ProbeslivenessProbe, readinessProbe
Env vars from Secretsenv[].valueFrom
RBAC objectsRole, RoleBinding (except built-in ClusterRole bindings)
NetworkPolicies (new allow rules)Creating new NetworkPolicy resources
Tolerationsspec.template.spec.tolerations

4. Challenge type accuracy

The type field sets learner expectations about what they'll find.

TypeStarting stateLearner expectation
fixSomething is broken and deployedDiagnose why it's broken, repair it
operateInfrastructure is running, nothing is brokenCreate or configure a missing resource
improveWorks but isn't production-readyHarden it — probes, limits, security
migrateWorking but outdated setupTransform to a new pattern

A challenge where nothing is broken and the task is simply "create a missing Service" should be operate, not fix. Learners arriving at a fix challenge will waste time looking for something that isn't broken.


5. Description accuracy

The description must not only avoid hints — it must also be factually accurate.

# BAD — describes intermittent failure, but the deny-all NetworkPolicy
# blocks 100% of traffic (total outage, not partial)
description: "Some requests reach the backend, others time out — it's not consistent."

# GOOD — matches the actual behavior
description: "The backend is completely unreachable. Every request times out."

Inaccurate descriptions mislead learners into looking for intermittent issues, race conditions, or load-related problems when the reality is deterministic.


6. Common bypass vectors — quick reference

ScenarioBypass pathMitigation
Log validationRewrite args to emit the expected string without changing anything elseProtect args in Kyverno; add spec validation
RBAC challengeBind to built-in view ClusterRole via RoleBindingBlock RoleBindings referencing built-in ClusterRoles
NetworkPolicy challengePATCH the deny-all policy to add a blanket allowAdd UPDATE to blocked operations
Persistent storageUse hostPath instead of PVCAdd spec check: volumes contains persistentVolumeClaim: {}
Scheduling challengeRemove the node taint instead of adding a tolerationAdd spec check: tolerations contains the taint key
Any challengeDelete and recreate the resource with correct specProtect Job/CronJob image+command; consider ClusterPolicy
RBAC challengeDelete and recreate Deployment with a different SAProtect serviceAccountName in Kyverno
Image-swap bypasskubectl set image to a trivially passing imagePreserve image in Kyverno (standard — always do this)

On this page