Data Collection Headset

Problem

The development of humanoids and general-purpose robots is constrained by limited access to large-scale data. To build models that truly generalize, we need to construct a diverse dataset of task demonstrations [1]. Just as LLMs achieved breakthroughs via internet-scale data, complex humanoid systems will require a similar scale of data.

The current industry processes for collecting training data for humanoid platforms involve teleoperation, puppeting and simulation. Teleoperation and puppeting require shipping and deploying expensive hardware to each new environment—driving up costs. Simulation avoids hardware deployment but demands the time‐consuming creation of high‐fidelity virtual assets, limiting the diversity and realism of the data; this is for both human-controlled and reinforcement-learning simulation environments.

As a result, we see that teleoperation/puppeting and simulation are at two different ends of the cost to data diversity spectrum. Teleoperation and puppeting yield rich, diverse datasets but are bottlenecked by deployment complexity and expense, whereas simulations are easy to distribute yet struggle to capture the full breadth of real‐world variation.

Proposed Solution

At the core, humans rely on a combination of vision, spatial awareness, audio and tactile feedback to perform manipulation tasks. Thus we propose a head-mounted, sensor rig to capture all data modalities while a human completes their work. This will be paired with gloves that will collect tactile data along with hand pose estimations.

As participants perform their tasks, (cleaning, manufacturing, folding clothes, etc.) the headset records what they see while the gloves record tactile and kinematic hand data. We then further augment this data by incentivising the human to dictate their current task. It is important to underscore the uniqueness of collecting audio annotations and tactile sensing data at such a large scale. There are currently no datasets of scale with tactile data and task annotations [1].

Current State

Timeline

Week 19 2025

Developed the web UI for displaying the captured data
Completed the end to end data pipeline from device to cloud to annotation

Week 18 2025

Developed proper head mount

Week 17 2025

Started putting together the data annotation pipeline
Converting the audio and camera feeds into annotated actions

Week 16 2025

Basic testing of prototype mounted on head
Collected stereo vision for the first time

Week 15 2025

Started designing V1
Testing the basic software to run on the device