Robotics Glossary: 60+ Terms Explained

A

ACT (Action Chunking with Transformers)

ACT is an imitation learning algorithm introduced by Tony Zhao et al. (2023) that trains a transformer-based policy to predict a fixed-length chunk of future actions rather than a single action at each timestep. By predicting action sequences in one shot, ACT reduces the compounding error typical of step-by-step behavioral cloning and produces temporally consistent motion. The architecture encodes RGB observations and proprioceptive state through a CVAE-style encoder and decodes action chunks using a transformer. ACT was demonstrated on the ALOHA bimanual platform, achieving strong performance on tasks such as opening a bag and transferring eggs. See also: Action Chunking (deep dive).

PolicyTransformerImitation Learning

Action Space

The action space is the complete set of outputs a robot policy can produce at each timestep. For a robot arm it typically includes joint positions, joint velocities, or end-effector poses (Cartesian position + quaternion); for a mobile robot it includes wheel velocities or steering commands. Action spaces are described as either discrete (a finite menu of actions) or continuous (real-valued vectors). The dimensionality and representation of the action space strongly influences how easy it is to train a stable policy: end-effector delta-pose spaces are often easier for imitation learning, while joint-torque spaces give finer force control but require more careful normalization.

PolicyControl

ALOHA (A Low-cost Open-source Hardware System for Bimanual Teleoperation)

ALOHA is an open-source bimanual teleoperation system developed at Stanford, consisting of two ViperX 300 robot arms and two WidowX 250 leader arms mounted on a shared frame with an integrated wrist camera. It was designed to collect high-quality demonstration data at low cost — the original build is under $20,000 — and underpins the ACT policy experiments. Mobile ALOHA extends the platform with a wheeled base, enabling whole-body loco-manipulation tasks such as cooking and cleaning. ALOHA datasets are publicly available and have become a de-facto benchmark for bimanual manipulation research. Learn more at SVRC Data Services.

HardwareTeleoperationBimanual

AMR (Autonomous Mobile Robot)

An autonomous mobile robot navigates through its environment without fixed tracks or human guidance, using onboard sensors (LiDAR, cameras, IMU) combined with SLAM, path-planning, and obstacle-avoidance algorithms. Unlike AGVs (automated guided vehicles) that follow magnetic strips, AMRs build and update a map in real time and re-route dynamically around people and objects. Modern warehouse AMRs from companies like Boston Dynamics, Locus Robotics, and 6 River Systems have driven broad adoption in logistics. AMRs are often combined with manipulator arms to create mobile manipulators capable of pick-and-place at scale.

Mobile RoboticsNavigationSLAM

B

Behavioral Cloning (BC)

Behavioral cloning is the simplest form of imitation learning: a supervised regression problem where the policy is trained to mimic expert demonstrations by minimizing the prediction error between the policy's output and the expert's action at each observed state. BC is easy to implement and scales well with data, but suffers from distributional shift — because it never receives corrective feedback, small errors cause the robot to visit states not present in the training data, which can cascade into task failure. Techniques like DAgger (Dataset Aggregation) and GAIL were developed specifically to address BC's compounding-error problem.

Imitation LearningSupervised Learning

Bimanual Manipulation

Bimanual manipulation refers to tasks that require two robot arms working in coordination, analogous to how humans use both hands simultaneously. Examples include folding laundry, tying knots, opening jars, and assembling parts that must be stabilized by one hand while the other performs fine operations. Bimanual tasks are substantially harder than single-arm tasks because the policy must coordinate two high-dimensional action streams while respecting physical constraints between the arms. The ALOHA platform was purpose-built for collecting bimanual demonstrations, and ACT is among the leading policies for bimanual control.

ManipulationHardware

BOM (Bill of Materials)

In robotics hardware, the BOM lists every component, subassembly, part number, quantity, and unit cost required to build a system. Accurate BOMs are critical for production scaleup, procurement, supply-chain risk management, and cost modeling. For open-source robot platforms such as OpenArm or ALOHA, a published BOM allows external teams to reproduce the hardware without proprietary dependencies. Enterprise teams evaluating robot deployment often request a BOM to benchmark total cost of ownership against lease or robot-as-a-service alternatives — compare SVRC leasing options.

HardwareManufacturing

C

Cartesian Space (Task Space)

Cartesian space (also called task space or operational space) describes a robot's configuration in terms of the position and orientation of its end-effector relative to a world or base frame, typically expressed as (x, y, z, roll, pitch, yaw) or (x, y, z, quaternion). Controlling a robot in Cartesian space is often more intuitive for imitation learning because human demonstrations map naturally to end-effector trajectories. The transformation from joint space to Cartesian space is called forward kinematics; the inverse is inverse kinematics.

KinematicsControl

Co-training

Co-training in robotics refers to training a single policy on data from multiple robot embodiments, tasks, or environments simultaneously. The hypothesis is that diverse data sources teach the policy robust visual and behavioral representations that transfer better to new settings. The Open X-Embodiment dataset was assembled specifically to enable co-training across more than 22 robot types. Large foundation models like RT-2 and OpenVLA rely on co-training with internet-scale vision-language data alongside robot demonstration data to bootstrap generalization.

TrainingGeneralizationFoundation Model

Contact-rich Manipulation

Contact-rich manipulation tasks are those where purposeful, sustained contact between the robot and environment is essential to task success — such as peg-in-hole insertion, screwing bolts, folding fabric, or kneading dough. These tasks are challenging because small positional errors produce large force spikes, and stiff position controllers can damage parts or destabilize the robot. Successful approaches combine compliant control (impedance or admittance control), force-torque sensing, and learned policies that anticipate and exploit contact.

ManipulationControlForce Sensing

Continuous Control

Continuous control refers to robot policies that output real-valued action vectors (e.g., joint torques, velocities, or Cartesian deltas) rather than selecting from a discrete set of actions. Most physical robot manipulation tasks require continuous control because smooth, precise motion cannot be adequately represented by a finite action menu. Standard deep RL algorithms for continuous control include DDPG, TD3, and SAC; for imitation learning, behavioral cloning and Diffusion Policy are commonly used in continuous action spaces.

ControlReinforcement Learning

D

Data Augmentation (for robotics)

Data augmentation in robot learning applies random transformations to training observations to improve policy robustness without collecting additional demonstrations. Common image augmentations include random cropping, color jitter, Gaussian blur, and cutout. More sophisticated augmentations overlay distracting backgrounds, change lighting conditions, or inject sensor noise to prevent overfitting to specific visual features in the training environment. Some approaches augment actions as well — for example, adding noise to joint trajectories to teach the policy to recover from perturbations. Augmentation is especially important when training data is expensive (each demonstration requires human operator time).

TrainingRobustnessData

Degrees of Freedom (DOF)

Degrees of freedom describes the number of independent parameters needed to specify the configuration of a mechanical system. A robot arm with six revolute joints has 6 DOF — enough to position and orient its end-effector arbitrarily within its reachable workspace (barring singularities). A 7-DOF arm adds one redundant joint that allows null-space optimization for obstacle avoidance or comfort poses. Human arms have roughly 7 DOF at the shoulder-elbow-wrist chain, making 7-DOF robots natural choices for anthropomorphic manipulation. Mobile bases add 2–3 DOF; full humanoids exceed 30 DOF.

KinematicsHardware

Demonstration

A demonstration (also called a trajectory or episode in imitation learning contexts) is a recorded sequence of observations and actions provided by a human or expert controller that illustrates how to perform a task. Demonstrations are the primary data source for behavioral cloning and other imitation learning algorithms. They can be collected via teleoperation, kinesthetic teaching, or motion capture. Data quality — smooth motion, consistent task execution, adequate coverage of the task's state space — matters as much as quantity for downstream policy performance. SVRC collects production-quality robot demonstrations through our data services.

DataImitation Learning

Diffusion Policy

Diffusion Policy, introduced by Chi et al. (2023), formulates robot action generation as a denoising diffusion process — the same class of generative models used in image generation. At inference time, the policy iteratively refines a sample of Gaussian noise into a sequence of actions conditioned on the current observation using a learned score network (typically a CNN or transformer). Compared to deterministic behavioral cloning, Diffusion Policy naturally represents multimodal action distributions (multiple valid ways to perform a task) and achieves state-of-the-art results on contact-rich manipulation benchmarks. See the detailed article.

PolicyGenerative ModelImitation Learning

Dexterous Manipulation

Dexterous manipulation refers to fine, multi-fingered manipulation tasks that exploit the full kinematic and sensory capabilities of a robotic hand — in-hand regrasping, rolling objects across fingertips, card dealing, surgical suturing, and similar tasks. Dexterity requires high-DOF end-effectors (5+ fingers, each with 3+ joints), dense tactile sensing, and policies capable of reasoning about complex contact geometry. Reinforcement learning trained in simulation (e.g., OpenAI's Dactyl) and recent diffusion-based policies have pushed the frontier, but dexterous manipulation at human-level reliability remains an open research problem.

ManipulationHardwareResearch Frontier

E

Embodied AI

Embodied AI refers to artificial intelligence systems that perceive and act through a physical body situated in the real world, rather than operating purely on text or images in isolation. The embodiment hypothesis holds that true intelligence requires sensorimotor grounding — learning through interaction, not just pattern matching on static datasets. In practice, embodied AI research encompasses robot learning, VLA models, sim-to-real transfer, and physical foundation models. Companies like Google DeepMind (RT series), Physical Intelligence (pi0), and NVIDIA (GR00T) are the primary industrial drivers. SVRC's own data platform is built for embodied AI data workflows.

Foundation ModelPhysical AI

End-Effector

The end-effector is the device at the distal end of a robot arm that directly interacts with the environment. It may be a parallel-jaw gripper, a suction cup, a multi-finger hand, a welding torch, a paint nozzle, or any task-specific tool. The end-effector's pose — its position and orientation in space — is the primary control output for most manipulation policies. The tool center point (TCP) is the reference point on the end-effector used for Cartesian control. Choosing the right end-effector is a critical deployment decision: grippers optimized for one object class (e.g., rigid boxes) may fail on soft or irregular items. Browse SVRC hardware options.

HardwareManipulation

Episode

An episode is a single, complete attempt at a task — from the initial state to either task success, failure, or a timeout. In reinforcement learning, the agent interacts with the environment for one episode, accumulates rewards, and then the environment is reset. In imitation learning, each recorded demonstration constitutes one episode. Episodes are the fundamental unit of robot learning datasets: a dataset of 1,000 episodes contains 1,000 task attempts with associated observations, actions, and outcomes. Episode length, reset conditions, and success criteria must be precisely defined to ensure consistent data collection.

DataReinforcement LearningImitation Learning

Extrinsics (camera)

Camera extrinsics define the position and orientation (6-DOF pose) of a camera relative to a reference frame — typically the robot base or end-effector. Together with intrinsic parameters (focal length, principal point, lens distortion), extrinsics allow projecting 3D world points onto the image plane and, conversely, lifting 2D detections into 3D space. Accurate extrinsic calibration is critical for visuomotor policies that must map visual observations to robot actions in a consistent coordinate frame. Eye-in-hand (wrist-mounted) cameras require re-calibration when the end-effector or camera is replaced.

PerceptionCalibration

F

Force Torque Sensor (FT Sensor)

A force-torque sensor measures the six-axis wrench (three forces Fx, Fy, Fz and three torques Tx, Ty, Tz) applied at a robot's wrist or end-effector. FT sensors are essential for contact-rich and assembly tasks where pure position control would either miss contacts or apply excessive force. They enable impedance and admittance control loops, detect slip and collision, and provide rich sensory inputs for learned policies. High-precision FT sensors from ATI and Robotiq are standard in research labs; MEMS-based low-cost sensors are increasingly viable for production deployments.

HardwareSensingControl

Foundation Model (robotics)

A foundation model is a large neural network pretrained on broad, diverse data that can be adapted to many downstream tasks via fine-tuning or prompting. In robotics, foundation models are typically large vision-language models (VLMs) extended with action outputs to form VLAs, or large visuomotor policies trained on cross-embodiment datasets. Examples include RT-2 (Google DeepMind), OpenVLA, Octo, and pi0 (Physical Intelligence). Foundation models for robotics are appealing because they can leverage internet-scale pretraining, support language conditioning, and generalize across tasks without per-task retraining from scratch. See SVRC model catalog.

VLAPretrainingGeneralization

Forward Kinematics (FK)

Forward kinematics computes the end-effector's pose in Cartesian space given the robot's joint angles (or displacements for prismatic joints). For a serial chain robot, FK is computed by multiplying a sequence of homogeneous transformation matrices (one per joint), typically derived from Denavit-Hartenberg (DH) parameters or a URDF description. FK always has a unique solution — given joint angles, there is exactly one end-effector pose — unlike the inverse problem (IK), which may have zero, one, or many solutions. FK is used in simulation, collision checking, visualization, and real-time robot state monitoring.

KinematicsControl

G

Generalization (robot policy)

Generalization measures how well a robot policy performs on objects, scenes, or tasks it has not seen during training. It is the central challenge of robot learning: a policy that memorizes training demonstrations but fails on novel instances has no practical value. Researchers distinguish object generalization (new instances of known categories), category generalization (entirely new object classes), and task generalization (new instruction phrasings or goal configurations). Improving generalization typically requires larger and more diverse training data, co-training with internet data, domain randomization in simulation, and foundation model priors.

PolicyResearch Frontier

Grasp Pose

A grasp pose specifies the 6-DOF position and orientation of a robot hand or gripper relative to an object such that the gripper can close and securely hold the object. Grasp pose estimation is typically done from depth or point-cloud data using analytic methods (e.g., antipodal grasp sampling) or learned detectors such as GraspNet-1Billion, GQ-CNN, or AnyGrasp. A valid grasp pose must be reachable by the robot, collision-free during approach, and stable under the expected task loads. Grasp quality metrics include force-closure, contact stability, and task-specific wrench resistance.

ManipulationPerception

Gripper

A gripper is the most common class of robot end-effector, designed to grasp and hold objects. Parallel-jaw grippers are the simplest and most widely used, with two opposing fingers driven by a motor or pneumatics. Suction grippers use vacuum to pick smooth, flat surfaces. Soft grippers use compliant materials (silicone, fabric) to conform around irregular objects. Multi-fingered hands (3–5 fingers) enable dexterous manipulation but are harder to control and more expensive. Gripper selection depends critically on object geometry, surface properties, required payload, and whether in-hand reorientation is needed.

HardwareEnd-Effector

H

HDF5 (Hierarchical Data Format v5)

HDF5 is a binary file format and library for storing and accessing large, structured scientific datasets efficiently. In robotics, HDF5 is the standard container for robot demonstration datasets: a single file stores synchronized camera images, joint angles, gripper states, force readings, and metadata in hierarchical groups, with chunked I/O enabling fast random access during training. The LeRobot and ALOHA ecosystems both use HDF5 natively. The alternative Zarr format offers cloud-native chunked storage with better support for concurrent writes. SVRC's data collection pipelines output HDF5 by default.

DataStorageEngineering

Humanoid Robot

A humanoid robot has a body structure broadly similar to a human — typically a torso, two legs, two arms, and a head — enabling it to operate in environments designed for people and to use human tools. Notable humanoids include Boston Dynamics Atlas, Agility Robotics Digit, Figure 01, and Tesla Optimus. Humanoids present extreme engineering challenges: bipedal locomotion requires real-time balance control, and coordinating 30+ DOF for loco-manipulation tasks demands whole-body control. Despite this complexity, humanoids are attracting enormous investment because their form factor generalizes across diverse workplaces without infrastructure changes.

HardwareLocomotionBimanual

Human-Robot Interaction (HRI)

Human-robot interaction is an interdisciplinary field studying how people and robots communicate, collaborate, and share physical space effectively and safely. HRI research spans safety standards (ISO/TS 15066 for collaborative robots), user interface design for teleoperation, natural language instruction, legible robot motion (making robot intent readable to bystanders), and social robotics (using gaze, gesture, and speech for non-verbal communication). In industrial co-bot deployments, HRI directly determines whether workers accept and effectively use robots alongside them. Good HRI design reduces accidents, improves throughput, and lowers training burden on the human side.

SafetyCollaboration

I

Imitation Learning (IL)

Imitation learning is a family of machine learning methods that train robot policies from human demonstrations rather than from engineered reward functions. The simplest form is behavioral cloning (supervised regression on state-action pairs). More advanced variants — DAgger (iterative correction), GAIL (adversarial imitation), and IRL (recovering a reward function) — address the distributional shift and reward specification problems that plague pure BC. IL has become the dominant paradigm for teaching dexterous manipulation because reward engineering for complex manipulation is extremely difficult, whereas collecting human demonstrations is tractable at scale via teleoperation. See the full deep-dive article.

Core ConceptPolicyData

Inverse Kinematics (IK)

Inverse kinematics solves for the joint angles that place a robot's end-effector at a desired Cartesian pose. Unlike forward kinematics, IK may have zero, one, or infinitely many solutions depending on the robot's kinematic structure and the target pose. Analytic IK solvers exist for standard 6-DOF configurations; numerical methods (Jacobian pseudo-inverse, Newton-Raphson, optimization-based) handle arbitrary geometries and redundant robots. IK is used in motion planning, teleoperation mapping (converting operator hand pose to joint commands), and any Cartesian-space controller. Libraries like KDL, IKFast, and track-ik are commonly used in ROS environments.

KinematicsControlPlanning

Isaac Sim

NVIDIA Isaac Sim is a robotics simulation platform built on the Omniverse USD framework, providing high-fidelity physics (via PhysX 5), photo-realistic rendering (via RTX path tracing), and ROS 2 integration out of the box. It is purpose-built for generating synthetic training data, testing robot policies, and sim-to-real transfer research. Isaac Sim supports domain randomization of textures, lighting, and object poses at scale, and integrates with NVIDIA's Isaac Lab reinforcement learning framework. Its GPU-accelerated physics allows training RL policies with thousands of parallel simulation instances. Learn more at the SVRC Isaac Sim resource page.

SimulationSynthetic DataTool

J

Joint Space (Configuration Space)

Joint space (also called configuration space or C-space) is the space of all possible joint angle vectors for a robot. A point in joint space uniquely specifies the robot's complete configuration. Motion planning algorithms like RRT and PRM work in joint space to find collision-free paths between configurations, since collision checking is more straightforward there than in Cartesian space. Many RL policies output joint positions or velocities directly in joint space, while imitation learning policies often operate in Cartesian space for easier human-demonstrator alignment. See the joint space article.

KinematicsPlanning

Joint Torque

Joint torque is the rotational force applied by a motor at a robot joint, measured in Newton-meters (Nm). Torque-controlled robots (as opposed to position-controlled ones) can regulate contact forces directly, enabling compliant behaviors such as yielding when pushed and precisely controlling assembly forces. Torque sensing at each joint is a key feature of collaborative robots (cobots) like the Franka Panda, Universal Robots UR series, and Kuka iiwa, enabling safe human-robot collaboration and whole-body compliant control. Learning policies that output joint torques rather than positions require careful training to avoid unstable oscillations.

ControlHardwareForce

K

Kinematic Chain

A kinematic chain is a series of rigid body links connected by joints that together form a robot's mechanical structure. An open chain (serial robot arm) has one free end (the end-effector), making FK straightforward. A closed chain (parallel robot, hexapod) has multiple loops that provide higher stiffness and speed but require more complex kinematics. The kinematic chain determines the robot's workspace, singularities, and the Jacobian matrix used for Cartesian control. URDF files describe kinematic chains as a tree of links and joints for simulation and control software.

KinematicsMechanics

Kinesthetic Teaching

Kinesthetic teaching (also called lead-by-nose or direct guidance) is a method of robot programming where a human physically grasps the robot arm and moves it through the desired motion path while the robot records the trajectory. It requires the robot to be backdrivable (low joint friction and compliance) so the operator can move it with minimal effort. Kinesthetic teaching is intuitive and requires no external hardware, but it is limited to tasks the operator can physically demonstrate, and it produces only proprioceptive data (no wrist camera observations) unless cameras are co-recorded. Gravity compensation mode on torque-controlled robots like the Franka Panda makes kinesthetic teaching practical.

Data CollectionImitation Learning

L

Language-conditioned Policy

A language-conditioned policy takes a natural language instruction (e.g., "pick up the red cup and place it on the tray") as an additional input alongside visual observations, enabling a single policy network to perform multiple tasks selected at runtime without retraining. Language conditioning is typically implemented by encoding instructions with a pretrained language model (CLIP, T5, PaLM) and fusing the resulting embedding with image features. VLA models such as RT-2, OpenVLA, and pi0 are language-conditioned by design. This approach reduces the need to train separate policies per task and supports zero-shot generalization to novel instruction phrasings.

VLAFoundation ModelGeneralization

Latent Space

A latent space is a compressed, lower-dimensional representation of data learned by a neural network — the output of an encoder that captures the most task-relevant features of an observation. In robot learning, latent spaces are used in VAEs (variational autoencoders) for learning structured representations of visual scenes, in world models for predicting future states, and in CVAE-based policies (like ACT) for encoding multimodal action distributions. A well-structured latent space places semantically similar observations close together, enabling interpolation, planning, and data augmentation in the latent domain rather than in raw pixel space.

Representation LearningPolicy

LeRobot

LeRobot is Hugging Face's open-source library for robot learning, providing standardized implementations of imitation learning algorithms (ACT, Diffusion Policy, TDMPC), a unified dataset format, visualization tools, and pretrained model weights. It aims to lower the barrier to entry for robot learning research by providing a single cohesive framework analogous to what Transformers did for NLP. LeRobot integrates with the Hugging Face Hub for dataset and model sharing, and supports both simulated (gymnasium-robotics, MuJoCo) and physical robot environments. The companion SO-100 low-cost robot kit was released alongside it.

ToolOpen SourceImitation Learning

LeRobot HF Dataset

The LeRobot dataset format is a standardized schema for robot demonstration data hosted on the Hugging Face Hub. Each dataset consists of Parquet files (for scalar timeseries: joint positions, actions, rewards, done flags) plus compressed MP4 video chunks for camera streams, all indexed by episode and frame. A meta/info.json file describes camera names, robot type, fps, and data statistics used for normalization. This format allows any LeRobot-compatible algorithm to load any published dataset with a single line of code, enabling rapid cross-dataset experimentation. Dozens of manipulation and mobile manipulation datasets are already published in this format.

DataStandardOpen Source

M

Manipulation

Manipulation refers to purposeful physical interaction with objects — picking, placing, assembling, folding, inserting, pouring, and similar tasks. Robot manipulation is one of the most active research areas in embodied AI, because even simple everyday tasks (loading a dishwasher, opening a package) require rich perception, precise motor control, and robust grasp planning. Manipulation difficulty scales from simple pick-and-place with known objects in fixed setups, through contact-rich assembly, to fully dexterous in-hand reorientation with novel objects in unstructured scenes. SVRC's data services specialize in collecting manipulation demonstrations for training and evaluation.

Core ConceptTask

MoveIt

MoveIt is the most widely used open-source motion planning framework for robot arms, originally developed at Willow Garage and now maintained by PickNik Robotics. MoveIt 2 runs on ROS 2 and provides planners (OMPL, CHOMP, PILZ), Cartesian trajectory planning, collision checking against MoveIt's planning scene, kinematics plugins (KDL, IKFast, TracIK), and grasp planning integration. It is the standard middleware layer between a robot learning policy (which outputs desired end-effector poses or waypoints) and the low-level joint controller that executes smooth, collision-free trajectories on the physical robot.

ToolPlanningROS

Multi-task Learning

Multi-task learning trains a single policy on demonstrations from multiple distinct tasks simultaneously, with the expectation that shared representations learned across tasks improve performance on each individual task and enable generalization to new tasks. In robotics, this often means training on hundreds of tasks with varied objects, goals, and environments. The key challenge is balancing the gradient contributions of different tasks (gradient interference) and ensuring the policy can distinguish between tasks at inference time — typically via language conditioning or one-hot task identifiers. Multi-task policies are a prerequisite for general-purpose robotic assistants.

PolicyGeneralizationTraining

N

Neural Policy

A neural policy is a robot control policy parameterized by a neural network that maps observations (images, proprioception, language) directly to actions (joint positions, Cartesian deltas, gripper commands). In contrast to classical motion planning pipelines, neural policies learn the mapping end-to-end from data without hand-engineered intermediate representations. Modern neural policies use convolutional encoders for vision, transformers for sequence modeling, and architectures like ACT, Diffusion Policy, or VLA backbones for action generation. A key property of neural policies is that they can be trained from demonstrations or reward signals, enabling them to handle tasks too complex for hand-coded controllers.

PolicyDeep Learning

Non-prehensile Manipulation

Non-prehensile manipulation refers to manipulating objects without grasping them — instead using pushing, rolling, pivoting, flipping, tilting, or other contact strategies that leverage gravity and surface friction. For example, pushing a box across a table to position it, or nudging a peg upright before grasping it. Non-prehensile strategies can move objects into graspable configurations, reposition items too large to grasp, or work in cluttered scenes where a grasp approach is infeasible. Planning non-prehensile actions requires modeling quasi-static or dynamic object mechanics and contact physics, making it an active research topic at the intersection of manipulation and motion planning.

ManipulationPlanning

O

Observation Space

The observation space defines all sensor inputs available to the robot policy at each timestep. Common modalities include RGB images from wrist or overhead cameras, depth maps from structured-light or stereo sensors, proprioceptive state (joint positions, velocities, torques), gripper state, end-effector pose, tactile readings, and task-specification inputs such as language embeddings or goal images. The observation space design profoundly affects policy performance and generalization: richer observations carry more information but increase model complexity, training time, and the risk of overfitting to irrelevant visual features.

PerceptionPolicy

Open-loop Control

Open-loop control executes a pre-planned trajectory without using sensor feedback during execution — the robot simply follows the commanded positions or velocities regardless of what actually happens. This is appropriate for highly repeatable tasks in controlled environments, such as CNC machining or pick-and-place on a fixed conveyor. Open-loop control is fast and simple but fails when disturbances occur, because no corrective action is taken. In contrast, closed-loop (feedback) control continuously compares the actual state to the desired state and applies corrective commands, making it far more robust for robot learning in variable environments.

Control

Open X-Embodiment

Open X-Embodiment (OXE) is a large-scale robot demonstration dataset assembled by Google DeepMind and 33 research institutions, comprising over 1 million robot episodes from 22 different robot embodiments and more than 527 skills. It was created to enable co-training across embodiments — the hypothesis being that diverse robot experience teaches richer manipulation representations than single-robot datasets alone. RT-X, the model trained on OXE, demonstrated positive transfer across embodiments and improved performance on held-out tasks compared to single-embodiment baselines. OXE data is publicly available and has catalyzed a wave of cross-embodiment robotics research.

DatasetFoundation ModelMulti-embodiment

P

Payload

Payload is the maximum mass (including the weight of any end-effector and tooling) that a robot arm can carry while maintaining its rated positional accuracy and dynamic performance. Payload specifications typically range from under 1 kg for collaborative research robots (WidowX 250: 250 g) to 500+ kg for large industrial arms. Critically, rated payload is usually quoted at full reach with the arm fully extended; at closer range and more favorable postures, robots can often handle significantly more. Exceeding payload limits degrades accuracy, accelerates wear, and can trigger safety faults or physical damage. SVRC's hardware catalog lists payload for each robot.

HardwareSpecs

Policy (robot)

In robot learning, a policy (denoted π) is a function that maps observations to actions: π(o) → a. The policy is the learned "brain" of the robot that determines what to do at every timestep given what it perceives. Policies can be represented as neural networks (neural policies), decision trees, Gaussian processes, or lookup tables. They can be deterministic (one action per observation) or stochastic (a distribution over actions). Policy quality is measured by task success rate across diverse conditions, not just on training demonstrations. The core challenge of robot learning is training policies that generalize reliably beyond their training distribution.

Core ConceptDeep Learning

Policy Rollout

A policy rollout is a single episode of executing a trained policy on the robot (or in simulation) from an initial state to task completion or timeout. Rollouts are used to evaluate policy performance, collect new data for further training (as in DAgger or RL fine-tuning), and debug failure modes. The number of rollouts needed for reliable performance estimation depends on task variability — high-variance tasks may require 50+ rollouts to get a stable success rate estimate. In research, rollouts are often categorized by initial condition (in-distribution vs. out-of-distribution objects/scenes) to characterize generalization.

EvaluationPolicy

Pre-training

Pre-training is the phase of model development in which a neural network is trained on a large, diverse dataset before task-specific fine-tuning. For robotics foundation models, pretraining may occur on internet-scale vision-language data (images, video, text), cross-embodiment robot datasets (Open X-Embodiment), synthetic simulation data, or a combination. The pretrained model learns rich general representations of objects, actions, and concepts that transfer to downstream robot tasks with far fewer demonstrations than training from scratch. Pre-training is the mechanism behind the success of VLA models such as RT-2, which benefits from both robotic and internet-scale pretraining.

Foundation ModelTrainingTransfer Learning

Q

Q-function (Action-Value Function)

The Q-function Q(s, a) estimates the expected cumulative discounted reward an agent will receive by taking action a in state s and then following a given policy thereafter. Q-functions are central to reinforcement learning algorithms such as DQN (discrete actions) and SAC, TD3, and DDPG (continuous actions). In robot RL, learning accurate Q-functions for long-horizon manipulation tasks is challenging because rewards are sparse and the state-action space is high-dimensional. Recent work in offline RL (IQL, CQL) uses Q-functions to extract policies from fixed datasets without online interaction, bridging the gap between imitation learning and RL.

Reinforcement LearningValue Function

Quasi-static Manipulation

Quasi-static manipulation assumes that motion is slow enough that inertial and dynamic forces are negligible — the system is effectively in static equilibrium at each instant. This simplification enables tractable contact mechanics modeling for planning pushing, sliding, pivoting, and in-hand regrasping actions. Many robot manipulation benchmarks (including most tabletop pick-and-place tasks) operate in the quasi-static regime. When tasks involve fast throws, dynamic catches, or high-speed assembly, quasi-static assumptions break down and full rigid-body dynamics with contact simulation (e.g., MuJoCo, Isaac Sim) are required.

ManipulationMechanics

R

Real-to-Sim Transfer

Real-to-sim transfer (the complement of sim-to-real) involves constructing or calibrating a simulation to match the real world as closely as possible — essentially building a digital twin of real conditions. This is used to replay real failure cases in simulation, generate additional synthetic training data matched to real sensor characteristics, and test policy updates safely before deployment. Techniques include photogrammetric scene reconstruction, physics parameter identification (system identification), and neural rendering methods (NeRF, 3D Gaussian Splatting) to match camera appearance. Accurate real-to-sim pipelines dramatically reduce the number of physical experiments needed for policy iteration.

SimulationDigital TwinData

Reach

Reach is the maximum distance from a robot arm's base to any point its end-effector can access within its workspace. For a serial arm, maximum reach equals the sum of all link lengths. Effective reach in a deployment is smaller — accounting for joint limits, self-collision avoidance, and the need to approach objects from multiple orientations. Reach determines which workstation layouts and object placements are feasible. When selecting robots for a task, engineers must confirm that the required workspace (including all approach directions for grasping) falls within the robot's reachable envelope at acceptable accuracy.

HardwareSpecsKinematics

Replay Buffer

A replay buffer (or experience replay memory) is a dataset of past (state, action, reward, next state, done) transitions collected by an RL agent during environment interaction. At each training step, random mini-batches are sampled from the buffer to train the value function or policy, breaking temporal correlations that would destabilize gradient updates. In offline RL and robot learning, the replay buffer is replaced by a fixed dataset of human demonstrations or previously collected rollouts. Prioritized experience replay weights sampling by temporal-difference error to focus training on informative transitions.

Reinforcement LearningData

Reward Function

The reward function defines the learning objective for a reinforcement learning agent: it assigns a scalar reward signal r(s, a, s') to each (state, action, next-state) transition, telling the agent how good or bad its actions are. Reward function design is one of the hardest parts of applying RL to robotics: sparse rewards (1 on success, 0 otherwise) are clean but lead to slow learning; dense rewards (e.g., negative distance to goal) guide learning but can be gamed in unexpected ways (reward hacking). Alternatives include reward learning from demonstrations (IRL, RLHF), task-specific simulation metrics, and learned preference models. Imitation learning sidesteps the reward design problem entirely by learning directly from demonstrations.

Reinforcement LearningCore Concept

S

Sim-to-Real Transfer

Sim-to-real transfer is the process of training a robot policy entirely or primarily in simulation and then deploying it on a physical robot, with the goal that the policy works without (or with minimal) additional real-world data. The core challenge is the reality gap — differences in physics fidelity, visual appearance, sensor noise, and unmodeled dynamics between simulation and the real world. Key mitigation techniques include domain randomization (randomizing simulation parameters during training), system identification (calibrating simulation to match real hardware), and adaptive fine-tuning on small amounts of real data. See the detailed article.

Transfer LearningSimulationDeployment

State Space

The state space is the complete set of configurations a robot and its environment can be in. In RL, the Markov state s encodes all information needed to predict future rewards and state transitions — ideally a complete description of the world. In practice, the agent only has access to partial observations (images, joint angles) that may not fully capture the state (e.g., occluded objects, unknown physics parameters). Designing an observation space that approximates the Markov state well while remaining computationally tractable is a key challenge in robot learning system design.

Reinforcement LearningControl

Surgical Robotics

Surgical robotics applies robot systems to medical procedures, most famously via Intuitive Surgical's da Vinci platform for minimally invasive laparoscopic surgery. Surgical robots provide motion scaling (translating large operator movements to sub-millimeter instrument motion), tremor filtration, and enhanced visualization inside the patient. Emerging research explores autonomous surgical subtasks (suturing, tissue retraction), AI-assisted guidance, and tele-surgery over low-latency 5G links. Regulatory approval (FDA 510(k) or PMA for the US) adds substantial validation burden. Surgical robotics sits at the intersection of teleoperation, HRI, and contact-rich manipulation.

MedicalTeleoperationApplication

T

Task-parameterized Learning

Task-parameterized learning encodes demonstrations relative to multiple coordinate frames or task parameters (e.g., the object's pose, a target location, an obstacle frame) rather than in a fixed world frame. When executing, the policy adapts automatically to new object and target configurations without retraining, because it has learned motion relative to task-relevant references. Task-parameterized Gaussian Mixture Models (TP-GMM) and kernelized movement primitives are classical implementations. This approach provides strong geometric generalization for structured pick-and-place tasks, though it requires task frames to be identified and tracked at runtime.

Imitation LearningGeneralizationPolicy

Teleoperation

Teleoperation is the remote control of a robot by a human operator, used both for direct task execution (surgical robots, space robotics, bomb disposal) and as the primary method for collecting high-quality imitation learning demonstrations. In robot learning, a common setup uses a leader-follower architecture: the operator moves a lightweight leader arm and the robot (follower) tracks the leader in real time. VR-based teleoperation systems (using hand tracking or controllers) are increasingly popular as they are more ergonomic and allow higher data throughput. SVRC provides professional teleoperation data collection services for enterprise robot learning teams.

Data CollectionImitation LearningHardware

Trajectory

A trajectory is a time-parameterized sequence of robot states (joint angles or Cartesian poses) that describes how the robot moves from a start configuration to a goal. Trajectories can be generated by motion planners (planning a collision-free path then time-parameterizing it for smooth execution), by teleoperation recording (capturing the operator's motion at a fixed frequency), or predicted directly by a neural policy. Trajectory smoothness and velocity continuity are important for physical robot safety — abrupt discontinuities cause mechanical stress and can trigger safety stops. Trajectory representations include splines, dynamic movement primitives (DMPs), and discrete waypoint sequences.

PlanningControlData

Transfer Learning

Transfer learning in robotics involves taking a model pretrained on one domain (e.g., internet vision-language data, simulation, or a different robot) and adapting it to a target task or robot with limited additional data. Fine-tuning the final layers of a pretrained backbone on robot demonstration data is the most common approach; full fine-tuning all weights is used when sufficient robot data is available. Transfer learning is the mechanism that makes foundation models practical for robotics — the alternative of training from scratch on robot data alone would require millions of demonstrations. See also pre-training, sim-to-real transfer.

Foundation ModelTraining

U

URDF (Unified Robot Description Format)

URDF is an XML-based file format that describes a robot's kinematic and dynamic properties: links (rigid bodies with mass, inertia, and visual/collision meshes) and joints (the connections between links, with type, axis, limits, and damping parameters). URDF is the standard robot description format in ROS and is supported by all major simulation platforms (Isaac Sim, MuJoCo, Gazebo, PyBullet). It enables loading the robot's kinematics into motion planners like MoveIt, visualizing the robot in RViz, and instantiating physics simulation models. XACRO (XML macro language) is commonly used to parameterize and modularize URDF files for complex robots. OpenArm and most SVRC hardware have publicly available URDF models.

ToolStandardSimulation

V

VLA (Vision-Language-Action Model)

A Vision-Language-Action model is a neural network that jointly processes visual observations (RGB images), natural language instructions, and robot proprioception to produce action outputs. VLAs extend large vision-language models (VLMs such as PaLM-E, LLaVA, or Gemini) by adding an action head — training the model to output robot joint positions or end-effector deltas alongside its language predictions. Notable VLAs include RT-2 (tokenizes actions as text tokens and fine-tunes a VLM), OpenVLA (open-source, 7B parameter, trained on Open X-Embodiment), and pi0 (Physical Intelligence's flow-matching VLA). See the VLA and VLM article and the SVRC model catalog.

Foundation ModelLanguageCore Concept

ViperX

ViperX is a series of 6-DOF robot arms manufactured by Trossen Robotics, widely used in academic robot learning research due to their low cost, ROS support, and compatibility with the DYNAMIXEL servo ecosystem. The ViperX 300 (with 300 mm reach) and ViperX 300-S are among the most common research arms found in imitation learning setups and are the follower arms in the original ALOHA system. ViperX arms have modest payload (~750 g) and accuracy compared to industrial robots but offer an accessible entry point for manipulation research. Browse SVRC's hardware store for availability.

HardwareResearch Robot

Visual Servoing

Visual servoing uses camera feedback in a closed-loop controller to guide a robot toward a goal defined in image space (Image-Based Visual Servoing, IBVS) or 3D space estimated from images (Position-Based Visual Servoing, PBVS). In IBVS, the controller minimizes the error between detected image features (keypoints, object bounding boxes) and their desired positions in the image plane, without explicitly computing 3D poses. Visual servoing is attractive because it directly compensates for calibration errors and camera-robot misalignment. Modern deep learning variants train neural networks to output servoing velocity commands directly from raw images, enabling robust alignment to novel objects.

ControlPerceptionClosed-loop

W

Waypoint

A waypoint is an intermediate configuration (joint angles or Cartesian pose) that a robot's trajectory must pass through on the way from start to goal. Waypoints allow programmers and planners to guide the robot's path through specific poses — for example, to avoid an obstacle, approach an object from a safe direction, or sequence through a multi-step assembly procedure. In robot learning, high-level policies sometimes output waypoints that a lower-level motion planner interpolates into smooth joint trajectories, combining the generalization benefits of learned policies with the safety guarantees of classical planning.

PlanningTrajectory

Whole-body Control (WBC)

Whole-body control coordinates all joints of a legged or humanoid robot simultaneously to satisfy multiple competing objectives — maintaining balance, tracking end-effector targets, avoiding joint limits, and managing contact forces — solved as a real-time constrained optimization problem (typically a QP). WBC is essential for humanoids and legged manipulators because the base is not fixed: arm motion shifts the center of mass and must be compensated by leg and torso adjustments. WBC frameworks like Drake, Pinocchio, and OCS2 are commonly used in humanoid research. The Mobile ALOHA platform and Boston Dynamics Atlas rely on whole-body controllers for loco-manipulation. See WBC article.

ControlHumanoidLocomotion

Workspace

A robot's workspace is the set of all positions (and orientations) that the end-effector can reach given the robot's kinematic structure and joint limits. The reachable workspace is all positions the end-effector can reach in at least one orientation; the dexterous workspace is the smaller subset reachable in every orientation — the most useful region for manipulation tasks requiring arbitrary approach angles. Workspace analysis informs cell layout (how far apart robots and parts should be), robot selection (matching reach to task layout), and motion planning (identifying singularity-free paths across the workspace).

KinematicsHardwarePlanning

Z

Zarr (data format)

Zarr is an open-source format for storing n-dimensional arrays in chunked, compressed form, designed for cloud-native and parallel I/O workloads. In robotics, Zarr is used to store large robot demonstration datasets (images, joint states, actions) in a format that can be read efficiently from object storage (S3, GCS) without downloading entire files. Unlike HDF5, Zarr supports concurrent writes, making it suitable for distributed data collection pipelines. Zarr v3 standardized the format and added support for sharding (combining many small chunks into fewer large files), which improves cloud storage efficiency. Projects like LeRobot and several autonomous vehicle datasets have adopted Zarr for large-scale dataset hosting.

DataStorageEngineering

Zero-shot Generalization

Zero-shot generalization is the ability of a trained policy to successfully perform on tasks, objects, or environments it has never explicitly seen during training, without any additional fine-tuning or demonstrations. True zero-shot transfer is a major goal of robot foundation models — a policy that generalizes zero-shot to novel household objects or new language instructions would dramatically reduce the data collection burden. Current VLA models show promising zero-shot language generalization (understanding novel phrasings of known task types) but still struggle with truly novel object categories or completely new manipulation skills. Improving zero-shot performance is the central motivation for scaling robot datasets and model sizes. See also Zero-shot Transfer article.

GeneralizationFoundation ModelResearch Frontier