🛡️

Content Moderation Queue

OpenEnv Live v1.0.0

A real-world RL environment simulating Trust & Safety moderation. AI agents triage social media posts, handle appeals, detect crisis content, and apply graduated policy enforcement.

Labeled Posts

Difficulty Levels

Action Types

Violation Types

How It Works

Reset

Start episode, pick difficulty

POST /reset

Observe

Read post, history, context

session_id

Decide

Choose action + violation

POST /step

Score

Get reward 0.0 - 1.0

GET /state

Tasks

Easy

Binary Content Moderation

Classify 8 posts as safe or harmful. Clear signals: spam, hate speech vs everyday content.

8 posts2 actionsBinary

Medium

Tiered Policy Enforcement

Apply 5-level severity. Harsh criticism, repeat offenders, political content, zero-tolerance.

10 posts5 actionsPartial credit

Hard

Full Queue + Appeals

Appeals, gaming slang vs threats, crisis escalation, repeat offenders, second-appeal denials.

12 posts6 actionsContext bonus

Action Space

approve

Safe, policy-compliant content

warn

Minor first-time violation

remove

Clear policy violation

ban_temporary

Repeat or serious offense

ban_permanent

Zero-tolerance or 5+ violations

escalate

Ambiguous, appeals, or crisis

API Endpoints

GET/healthLiveness check

GET/tasksList all tasks

POST/reset?task_id=task_easyStart episode, get session_id

POST/step?session_id=abc123Submit action, get reward

GET/state?session_id=abc123Full state + score

Reward Design

Non-sparse: every post scores independently (0.0 - 1.0)

Partial credit: one severity level off scores ~0.65 instead of 0.0

Context bonus: +0.3 for history-dependent or context-dependent posts

Violation ID: correctly identifying the violation category earns bonus

Baseline Scores

Meta Llama 3 8B Instruct

temperature=0 | seed=42 | reproducible

Easy

0.500

Medium

0.533

Hard

0.423

Interactive API Docs View Tasks Health Check