Easy
Binary Content Moderation
Classify 8 posts as safe or harmful. Clear signals: spam, hate speech vs everyday content.
8 posts2 actionsBinary
Medium
Tiered Policy Enforcement
Apply 5-level severity. Harsh criticism, repeat offenders, political content, zero-tolerance.
10 posts5 actionsPartial credit
Hard
Full Queue + Appeals
Appeals, gaming slang vs threats, crisis escalation, repeat offenders, second-appeal denials.
12 posts6 actionsContext bonus
approve
Safe, policy-compliant content
warn
Minor first-time violation
remove
Clear policy violation
ban_temporary
Repeat or serious offense
ban_permanent
Zero-tolerance or 5+ violations
escalate
Ambiguous, appeals, or crisis
GET/healthLiveness check
GET/tasksList all tasks
POST/reset?task_id=task_easyStart episode, get session_id
POST/step?session_id=abc123Submit action, get reward
GET/state?session_id=abc123Full state + score
Non-sparse: every post scores independently (0.0 - 1.0)
Partial credit: one severity level off scores ~0.65 instead of 0.0
Context bonus: +0.3 for history-dependent or context-dependent posts
Violation ID: correctly identifying the violation category earns bonus