A lightweight toolkit for quantitatively scoring LeRobot episodes. LeRobot Episode Scoring Toolkit A comprehensive toolkit for evaluating and filtering LeRobot episode datasets based on multiple quality dimensions. It combines classic Computer Vision heuristics (blur/exposure tests, kinematic smoothness, collision spikes) with optional Gemini-powered vision-language checks to give each episode a 0–1 score across multiple quality dimensions. Use this toolkit to: Automatically score robot demonstration episodes on visual clarity, motion smoothness, collision detection, and more robot demonstration episodes on visual clarity, motion smoothness, collision detection, and more Filter low-quality episodes to improve downstream training performance low-quality episodes to improve downstream training performance Train and compare baseline vs. filtered dataset models baseline vs. filtered dataset models Visualize score distributions and identify problematic episodes Table of Contents ✨ Features Dimension Function What it measures Visual clarity score_visual_clarity Blur, over/under-exposure, low-light frames Smoothness score_smoothness 2nd derivative of joint angles Path efficiency score_path_efficiency Ratio of straight-line vs. actual joint-space path Collision / spikes score_collision Sudden acceleration outliers (proxy for contacts) Joint stability (final 2 s) score_joint_stability Stillness at the goal pose Gripper consistency score_gripper_consistency Binary "closed vs. holding" agreement Actuator saturation score_actuator_saturation Difference between commanded actions and achieved states Task success (VLM) score_task_success (via VLMInterface ) Gemini grades whether the desired behaviour happened Task success (VLM) score_task_success (via VLMInterface ) Gemini grades whether the desired behavior happened Runtime penalty / outliers score_runtime + build_time_stats , is_time_outlier Episode length vs. nominal / Tukey-IQR / Z-score fences ⚙️ Installation Prerequisites Python 3.8 or higher pip package manager Setup Clone the repository git clone https://github.com/RoboticsData/score_lerobot_episodes.git cd score_lerobot_episodes Install dependencies pip install -r requirements.txt Set up API keys (optional) Only required if using VLM-based scoring with Gemini: export GOOGLE_API_KEY= " your-api-key-here " Note: The free tier rate limits of the Gemini API are fairly restrictive and might need to be upgraded depending on episode length. Check Gemini API rate limits for more info. 🚀 Quick Start Score a dataset and save results: python score_dataset.py \ --repo_id lerobot/aloha_static_pro_pencil \ --output ./output/lerobot/aloha_static_pro_pencil \ --threshold 0.5 This will: Download and load the dataset from HuggingFace Score each episode across multiple quality dimensions Save scores to output path Filter episodes with aggregate score >= 0.5 Save the filtered dataset to the output directory 📖 Usage Command-line Arguments Required Arguments --repo_id : HuggingFace repository ID for the dataset (e.g., username/dataset-name ) Optional Arguments --root : Local path to dataset root (default: downloads from HuggingFace Hub) : Local path to dataset root (default: downloads from HuggingFace Hub) --output : Output directory for filtered dataset (default: None, no filtering) : Output directory for filtered dataset (default: None, no filtering) --threshold : Minimum aggregate score to keep episodes (default: 0.5, range: 0.0-1.0) : Minimum aggregate score to keep episodes (default: 0.5, range: 0.0-1.0) --nominal : Expected episode duration in seconds (used for runtime scoring) : Expected episode duration in seconds (used for runtime scoring) --vision_type : Vision scoring method, choices: opencv (default), vlm_gemini : Vision scoring method, choices: (default), --policy_name : Policy type for training (default: act ) : Policy type for training (default: ) --overwrite : Overwrite existing filtered dataset (default: True) : Overwrite existing filtered dataset (default: True) --overwrite_checkpoint : Overwrite existing training checkpoints (default: False) : Overwrite existing training checkpoints (default: False) --train-baseline : Train model on unfiltered dataset (default: False) : Train model on unfiltered dataset (default: False) --train-filtered : Train model on filtered dataset (default: False) : Train model on filtered dataset (default: False) --plot : Display score distribution plots in terminal (default: False) Examples 1. Basic scoring (no filtering) python score_dataset.py --repo_id username/my-robot-dataset 2. Score and filter dataset python score_dataset.py \ --repo_id username/my-robot-dataset \ --output ./output/username/my-robot-dataset \ --threshold 0.6 3. Score with VLM-based vision analysis export GOOGLE_API_KEY= " your-key " python score_dataset.py \ --repo_id username/my-robot-dataset \ --vision_type vlm_gemini \ --output ./filtered_data 4. Score, filter, and train both baseline and filtered models python score_dataset.py \ --repo_id username/my-robot-dataset \ --output ./output/username/my-robot-dataset \ --threshold 0.5 \ --train-baseline True \ --train-filtered True \ --policy_name act 5. Visualize distributions python score_dataset.py \ --repo_id username/my-robot-dataset \ --threshold 0.7 \ --plot True 6. Use local dataset instead of downloading python score_dataset.py \ --repo_id username/my-robot-dataset \ --root /path/to/local/dataset \ --output ./filtered_output 📁 Output Format JSON Scores File Saved to results/{repo_id}_scores.json : [ { "episode_id" : 0 , "camera_type" : " camera_0 " , "video_path" : " /path/to/video.mp4 " , "aggregate_score" : 0.752 , "per_attribute_scores" : { "visual_clarity" : 0.85 , "smoothness" : 0.78 , "collision" : 0.92 , "runtime" : 0.65 } }, ... ] Console Output Displays a formatted table showing scores for each episode: Episode scores (0–1 scale) ───────────────────────────────────────────────────────────────── Episode Camera visual_clarity smoothness collision runtime Aggregate Status 0 camera_0 0.850 0.780 0.920 0.650 0.752 GOOD 1 camera_1 0.420 0.650 0.710 0.580 0.590 BAD ... ───────────────────────────────────────────────────────────────── Average aggregate over 20 videos: 0.671 Percentage of episodes removed: 0.25, total: 5 Filtered Dataset When using --output , a new filtered dataset is created with only episodes scoring above the threshold, maintaining the original LeRobot dataset structure. 📂 Repository Structure score_lerobot_episodes/ ├── score_dataset.py # Main scoring script ├── data.py # Dataset loading and filtering utilities ├── vlm.py # Vision-Language Model interface (Gemini) ├── train.py # Training pipeline integration ├── evaluation.py # Evaluation utilities ├── corrupt.py # Data corruption tools for robustness testing ├── ui.py # Streamlit web interface (if available) ├── requirements.txt # Python dependencies ├── README.md # This file ├── CONTRIBUTING.md # Contribution guidelines ├── LICENSE # Apache 2.0 license ├── results/ # Generated score JSON files ├── output/ # Filtered datasets └── checkpoints/ # Training checkpoints 🤖 Training and Evaluation The toolkit integrates with LeRobot's training pipeline to compare baseline vs. filtered dataset performance. Training Workflow Baseline Training: Train on the original unfiltered dataset python score_dataset.py \ --repo_id username/dataset \ --train-baseline True Filtered Training: Train on the quality-filtered dataset python score_dataset.py \ --repo_id username/dataset \ --output ./filtered_data \ --threshold 0.6 \ --train-filtered True Compare Both: Run both training pipelines in one command python score_dataset.py \ --repo_id username/dataset \ --output ./filtered_data \ --train-baseline True \ --train-filtered True Training Configuration Default policy: ACT (Action Chunking Transformer) Default steps: 10,000 Batch size: 4 Checkpoints saved to ./checkpoints/{job_name}/ WandB logging enabled by default You can customize training parameters by modifying train.py . 🔧 Troubleshooting Common Issues 1. ModuleNotFoundError: No module named 'google.generativeai' Solution : Install dependencies with pip install -r requirements.txt : Install dependencies with If using VLM scoring, ensure google-generativeai is installed 2. API rate limit errors with Gemini Solution : The free tier has restrictive limits. Consider: Using --vision_type opencv instead Upgrading to a paid Gemini API tier Processing smaller batches : The free tier has restrictive limits. Consider: 3. All episodes filtered out Error : ValueError: All episodes filtered out, decrease threshold to fix this : Solution: Lower the --threshold value (e.g., from 0.5 to 0.3) 4. Dataset not found Solution : Verify the --repo_id is correct Check internet connection for HuggingFace Hub access Use --root to specify a local dataset path : 5. Out of memory during training Solution: Reduce batch_size in train.py:44 or use a smaller model 6. Permission errors when overwriting Solution: Use --overwrite True or manually delete the output directory 🤝 Contributing We welcome contributions! Please see CONTRIBUTING.md for guidelines on: Setting up a development environment Code style and conventions Submitting pull requests Reporting issues Quick Contribution Steps Fork the repository Create a feature branch ( git checkout -b feature/amazing-feature ) Commit your changes ( git commit -m 'Add amazing feature' ) Push to the branch ( git push origin feature/amazing-feature ) Open a Pull Request ⭐ Star History 📄 License LeRobot Episode Scoring Toolkit is distributed under the Apache 2.0 License. See LICENSE for more information. 📧 Support