🛡️

Content Moderation Queue

OpenEnv Live v1.0.0

A real-world RL environment simulating Trust & Safety moderation. AI agents triage social media posts, handle appeals, detect crisis content, and apply graduated policy enforcement.

30
Labeled Posts
3
Difficulty Levels
6
Action Types
9
Violation Types
How It Works
1
Reset
Start episode, pick difficulty
POST /reset
2
Observe
Read post, history, context
session_id
3
Decide
Choose action + violation
POST /step
4
Score
Get reward 0.0 - 1.0
GET /state
Tasks
Easy
Binary Content Moderation
Classify 8 posts as safe or harmful. Clear signals: spam, hate speech vs everyday content.
8 posts2 actionsBinary
Medium
Tiered Policy Enforcement
Apply 5-level severity. Harsh criticism, repeat offenders, political content, zero-tolerance.
10 posts5 actionsPartial credit
Hard
Full Queue + Appeals
Appeals, gaming slang vs threats, crisis escalation, repeat offenders, second-appeal denials.
12 posts6 actionsContext bonus
Action Space
approve
Safe, policy-compliant content
warn
Minor first-time violation
remove
Clear policy violation
ban_temporary
Repeat or serious offense
ban_permanent
Zero-tolerance or 5+ violations
escalate
Ambiguous, appeals, or crisis
API Endpoints
GET/healthLiveness check
GET/tasksList all tasks
POST/reset?task_id=task_easyStart episode, get session_id
POST/step?session_id=abc123Submit action, get reward
GET/state?session_id=abc123Full state + score
Reward Design
Non-sparse: every post scores independently (0.0 - 1.0)
Partial credit: one severity level off scores ~0.65 instead of 0.0
Context bonus: +0.3 for history-dependent or context-dependent posts
Violation ID: correctly identifying the violation category earns bonus
Baseline Scores

Meta Llama 3 8B Instruct

temperature=0 | seed=42 | reproducible
Easy
0.500
Medium
0.533
Hard
0.423
Interactive API Docs View Tasks Health Check