RL Environment as a Service — Robotics Center of Silicon Valley

What this is

A complete RL research environment: simulation, real hardware, and the pipeline between them

The SVRC RL Environment is not just a simulator and not just a physical lab. It is both, connected by a standardized pipeline that lets you train policies in simulation and deploy them on real robots with minimal transfer effort. We provide URDF files, pre-built simulation environments, benchmark tasks with defined metrics, reward function libraries, and access to real hardware for validation.

The goal is to eliminate the months of setup work that typically precedes useful RL research. Instead of building your own simulation environment, calibrating a real robot, and designing a sim-to-real pipeline from scratch, you start with a working system and focus on your actual research question.

Simulators

Supported simulation environments

SVRC provides tested, validated simulation environments for three major simulators. Each includes URDF/MJCF robot models, task scenes, and example training scripts.

MuJoCo

Best for: fast iteration, contact-rich manipulation, academic research. MuJoCo is the most widely used simulator in robot learning research. SVRC provides MJCF models for OpenArm 1, DK1, and Unitree G1/Go2 with calibrated dynamics parameters matched to real hardware. Benchmark tasks run at 1000+ FPS on a single GPU. All 10 canonical tasks have MuJoCo implementations.

NVIDIA IsaacSim

Best for: GPU-parallel training, photorealistic rendering, large-scale RL. IsaacSim enables thousands of parallel environment instances for massive data generation. SVRC provides USD robot assets, IsaacGym task implementations, and domain randomization configurations. Recommended for teams with NVIDIA GPU clusters who need high throughput.

Genesis

Best for: differentiable simulation, gradient-based optimization, novel research directions. Genesis provides differentiable physics, enabling gradient-based policy optimization and system identification. SVRC provides Genesis-compatible robot models and task environments. This is the newest supported simulator — ideal for researchers exploring differentiable robotics.

Robot Models

Available URDF and simulation models

Every SVRC hardware platform has validated simulation models. These are not generic URDFs — they are calibrated to match the dynamics and kinematics of real SVRC hardware, with documented deviation metrics.

OpenArm 1

URDF, MJCF, and USD formats. 6-DOF arm with gripper. Calibrated joint friction, damping, and inertia parameters. Sim-to-real position error under 3mm at end-effector after calibration. Includes 3 gripper variants (parallel jaw, soft finger, suction). Available in all three simulators. This is the primary platform for most RL research at SVRC.

DK1 Bimanual

Dual-arm URDF with shared workspace configuration. 12-DOF (6 per arm) with coordinated control interface. MuJoCo and IsaacSim models available. Bimanual task environments include handover, cooperative lifting, and two-handed assembly. Sim-to-real coordination timing error under 15ms.

Unitree G1 Humanoid

Full-body URDF with 23+ DOF. MuJoCo and IsaacSim models. Includes locomotion, loco-manipulation, and standing manipulation task environments. Walking gait parameters matched to real G1 hardware. Contact models for hand manipulation. Recommended for whole-body control and humanoid research.

Unitree Go2 Quadruped

12-DOF quadruped URDF. MuJoCo and IsaacSim models with terrain generation. Locomotion, navigation, and mobile inspection task environments. Gait parameters validated against real Go2 on flat terrain, stairs, and slopes. Useful for navigation RL and mobile manipulation when combined with a mounted arm.

Benchmarks

10 canonical manipulation benchmark tasks

SVRC defines 10 benchmark tasks with standardized success metrics, reset procedures, and difficulty ratings. Use these to evaluate your policies against a common baseline and compare across methods.

Task	Robot	Difficulty	Success Metric	Simulators
Reach target	OpenArm 1	Easy	End-effector within 5mm of target	MuJoCo, IsaacSim, Genesis
Pick and place (single)	OpenArm 1	Easy	Object in target zone, upright	MuJoCo, IsaacSim, Genesis
Pick and place (cluttered)	OpenArm 1	Medium	Target object placed without disturbing others	MuJoCo, IsaacSim
Peg insertion	OpenArm 1	Medium	Peg fully inserted, force under 10N	MuJoCo, IsaacSim
Stack blocks (3 high)	OpenArm 1	Medium	3 blocks stacked, stable for 2 seconds	MuJoCo, IsaacSim, Genesis
Cable routing	DK1 Bimanual	Hard	Cable follows target path within 10mm	MuJoCo
Bimanual handover	DK1 Bimanual	Medium	Object transferred without drop	MuJoCo, IsaacSim
Lid opening	OpenArm 1	Hard	Container lid open more than 90 degrees	MuJoCo
Tool use (spatula flip)	OpenArm 1	Hard	Object flipped using tool, landed in zone	MuJoCo
Humanoid tabletop sort	Unitree G1	Hard	3 objects sorted into correct bins while standing	IsaacSim

Each task includes: environment definition, reset procedure, reward function, success metric, and baseline policy performance. Submit your results to the SVRC Benchmark Leaderboard.

Rewards

Reward function library

Designing reward functions is one of the hardest parts of RL. SVRC provides pre-built, tested reward functions for common manipulation objectives. Use them as-is or as starting points for custom reward design.

Reaching rewards

Dense distance-based rewards for end-effector positioning. Options: L2 distance, exponential decay, and staged rewards with bonus at threshold. Includes orientation-aware variants for tasks where approach angle matters. All tested and tuned for stable PPO and SAC training on OpenArm environments.

Grasp and manipulation rewards

Composite rewards for grasping: approach phase (distance to pre-grasp), contact phase (grip force in target range), lift phase (height above table), and success bonus. Includes penalties for excessive force, dropped objects, and workspace boundary violations. Tuned for both parallel-jaw and soft-finger grippers.

Contact-rich task rewards

Specialized rewards for insertion, assembly, and force-sensitive tasks. Includes force magnitude tracking, torque alignment rewards, and tactile contact area metrics. These are the hardest rewards to design from scratch — our library encodes lessons learned from hundreds of training runs on real contact tasks.

Safety and constraint rewards

Penalty terms for joint limit violations, self-collision, excessive velocity, and workspace boundary crossings. Essential for sim-to-real transfer — policies that learn unsafe behaviors in simulation will damage real hardware. Our safety rewards are calibrated to the actual joint limits and velocity constraints of SVRC hardware.

Sim-to-Real

The sim-to-real pipeline

Training in simulation is only useful if the policy transfers to real hardware. Here is the SVRC sim-to-real pipeline, from trained policy to real-world deployment on OpenArm 1:

1. Train in simulation

Use SVRC-provided MuJoCo, IsaacSim, or Genesis environments with calibrated robot models. Apply domain randomization (friction, mass, visual appearance, lighting) using our tested randomization ranges. Train until convergence on the sim benchmark — we provide target success rates for each task.

2. Evaluate in sim

Run the trained policy on 100 evaluation episodes with held-out randomization seeds. Record success rate, completion time, and force profiles. Compare against SVRC baselines for the task. If sim performance is below baseline, iterate on the policy before moving to real hardware — real-world testing is expensive.

3. Deploy to real hardware

Export the policy to a ROS2-compatible inference node. SVRC provides deployment scripts that handle observation preprocessing, action scaling, and safety constraints (joint limits, velocity caps). Run the policy on real OpenArm 1 with a human operator ready at the e-stop. Start with slow execution speed and increase as confidence grows.

4. Measure the gap and iterate

Compare real-world success rate to simulation success rate. The gap is your sim-to-real transfer loss. Common sources: friction model mismatch (fix with system identification), visual domain gap (fix with real-world fine-tuning data), and timing differences (fix with action delay modeling). SVRC provides diagnostic tools to identify the dominant gap source for your task.

Access

Access tiers

Community (Free)

Includes: URDF/MJCF files for all SVRC robots, MuJoCo task environments for all 10 benchmarks, basic reward functions, and documentation. Best for: students, individual researchers, and anyone getting started with RL on SVRC platforms. Access: download from GitHub, no registration required. Community support via Forum.

Research License (Free for academics)

Includes: everything in Community plus: IsaacSim and Genesis environments, full reward function library, baseline policy weights, evaluation scripts, and benchmark submission access. Best for: university research labs and graduate students publishing on RL. Access: request with .edu email, approved within 2 business days.

Enterprise (Custom pricing)

Includes: everything in Research plus: custom task environment development, real hardware validation sessions at SVRC facilities, dedicated engineering support, private benchmark instances, and integration with SVRC Data Platform for experiment tracking. Best for: companies building products with RL-trained policies. Access: contact us for scoping and pricing.

Getting Started

Your first RL training run in 15 minutes

Here is the fastest path from zero to a trained policy on the SVRC RL environment:

Prerequisites: Python 3.10+, MuJoCo 3.0+ (pip install mujoco), a machine with a GPU (optional but recommended for training speed).

Step 1: Clone the environment repository.

git clone https://github.com/svrc-robotics/rl-environments.git
cd rl-environments
pip install -e .

Step 2: Run the reach task with a random policy.

python examples/reach_random.py --render

This loads the OpenArm 1 MuJoCo model, runs 100 episodes with random actions, and renders the visualization. You should see the arm moving randomly toward target positions.

Step 3: Train a policy with PPO.

python train.py --task reach --algo ppo --steps 500000

This trains a PPO policy on the reach task for 500K steps. On a single GPU, this takes approximately 10 minutes. The trained policy checkpoint is saved to checkpoints/reach_ppo/.

Step 4: Evaluate the trained policy.

python evaluate.py --task reach --checkpoint checkpoints/reach_ppo/best.pt --episodes 100 --render

This runs 100 evaluation episodes and reports success rate, mean completion time, and position error. The SVRC baseline for the reach task is 97% success rate — your PPO policy should reach 90%+ after 500K steps.

From here, try harder tasks (pick_place, peg_insertion, stack_blocks), different algorithms (SAC, TD3), or deploy to real hardware following the sim-to-real guide on the Developer Wiki.

Platform Integration

Integration with the SVRC data pipeline

Data Platform for experiment tracking

Log training runs, evaluation metrics, and policy checkpoints to the SVRC Data Platform. Compare RL-trained policies against imitation learning baselines. Track sim performance and real-world performance side by side. Enterprise tier includes automated regression alerts.

View Data Platform →

Teleoperation data for RL bootstrapping

Use SVRC Data Services to collect teleoperation demonstrations, then use them to bootstrap RL training via demo-augmented RL or offline RL. Combining human demonstrations with RL fine-tuning often outperforms pure RL by 20-40% on contact-rich tasks.

Data Services →

Open datasets for pre-training

Pre-train visual encoders or policy backbones on SVRC open datasets before RL fine-tuning. Pre-trained representations reduce RL sample complexity by 3-5x on our benchmarks. Compatible with LeRobot and Open X-Embodiment data loaders.

Browse datasets →

FAQ

8 questions about the RL environment

1. Do I need real hardware to use the RL environment?

No. The simulation environments (MuJoCo, IsaacSim, Genesis) run entirely in software. You only need real hardware when you want to validate sim-to-real transfer. The Community and Research tiers are fully software-based.

2. How accurate are the simulation models?

OpenArm 1 sim-to-real position error is under 3mm at the end-effector after calibration. Joint dynamics (friction, damping) are identified from real hardware and validated quarterly. Contact physics are approximate — expect 10-20% performance gap on contact-rich tasks, which is why we recommend real-world fine-tuning for deployment.

3. Can I use my own robot in the SVRC RL environment?

Yes, if you have a URDF file. The environment framework supports arbitrary URDF models. However, benchmark results and baselines are only provided for SVRC hardware. Contact us if you need help creating or validating a URDF for your robot.

4. What RL algorithms do you support?

The environment is algorithm-agnostic — it provides a standard Gymnasium interface. Our example scripts include PPO, SAC, and TD3 implementations. You can use any RL library that supports Gymnasium: Stable-Baselines3, CleanRL, RLlib, or your own implementation.

5. How do I submit benchmark results?

Run the evaluation script with the --submit flag. It records your policy checkpoint hash, evaluation metrics, and environment configuration, then submits to the SVRC Benchmark Leaderboard. Research License or Enterprise access required for submission.

6. Can I use the environment for a university course?

Yes. The Community tier is free and sufficient for most course assignments. For courses that need IsaacSim environments or benchmark submission access, request a Research License with your faculty email. See the educator page for course integration guidance.

7. What GPU do I need?

MuJoCo runs on CPU and is fast enough for most training tasks. IsaacSim requires an NVIDIA GPU (RTX 3070 or better recommended). Genesis runs on both CPU and GPU. For the getting-started example above, no GPU is needed — CPU training on the reach task converges in 10 minutes.

8. How does this relate to SVRC Data Services?

The RL Environment provides simulation and real-hardware environments for training. Data Services provides human-collected teleoperation data. The two complement each other: use teleoperation data to bootstrap RL training, then use RL to fine-tune beyond human performance. The approach page explains our philosophy on combining data sources.