Upload folder using huggingface_hub
Browse files- README.md +65 -116
- trainer_state.json +0 -9
- training_args.bin +2 -2
README.md
CHANGED
|
@@ -1,144 +1,93 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
|
| 5 |
-
|
| 6 |
|
| 7 |
-
|
| 8 |
-
- **Training Steps**: 6,000 (optimal checkpoint based on evaluation analysis)
|
| 9 |
-
- **Task**: Fruit manipulation and handling using robotic arm
|
| 10 |
-
- **Dataset**: Wholettheducksout dataset with single-arm configuration
|
| 11 |
-
- **Model Size**: ~7.58 GB total model weights
|
| 12 |
|
| 13 |
-
|
| 14 |
|
| 15 |
-
|
| 16 |
-
- **Training Configuration**: Single-arm embodiment with front and wrist camera setup
|
| 17 |
-
- **Action Dimension**: 32 (single arm + gripper control)
|
| 18 |
-
- **Vision Input**: Dual-camera setup (640x480 resolution, 30 FPS)
|
| 19 |
-
- **Action Horizon**: 16 timesteps
|
| 20 |
-
- **Diffusion Steps**: 4 inference timesteps
|
| 21 |
|
| 22 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
|
| 24 |
-
|
| 25 |
-
- **Diffusion Model**: 16 layers, 32 attention heads
|
| 26 |
-
- **Hidden Size**: 1024
|
| 27 |
-
- **Cross-attention Dimension**: 2048
|
| 28 |
-
- **Backbone Embedding**: 2048 dimensions
|
| 29 |
-
- **State/Action Encoding**: Multi-layer projections
|
| 30 |
|
| 31 |
-
|
| 32 |
-
- **Eagle Model**: Qwen3.1-7B + SigLIP-400M hybrid architecture
|
| 33 |
-
- **Visual Layers**: 27 encoder layers with self-attention
|
| 34 |
-
- **Language Model**: 12 layers with RMSNorm and SwiGLU activation
|
| 35 |
-
- **Flash Attention**: Enabled for efficient processing
|
| 36 |
|
| 37 |
-
|
| 38 |
|
|
|
|
| 39 |
```
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
├── optimizer.pt # Optimizer state
|
| 47 |
-
├── scheduler.pt # Learning rate scheduler
|
| 48 |
-
├── rng_state.pth # Random number generator state
|
| 49 |
-
└── experiment_cfg/
|
| 50 |
-
└── metadata.json # Embodiment and modality configuration
|
| 51 |
-
|
| 52 |
-
tensorboard_logs/
|
| 53 |
-
└── Oct17_23-05-33_ip-172-31-3-77/
|
| 54 |
-
└── events.out.tfevents.* # TensorBoard training metrics
|
| 55 |
```
|
| 56 |
|
| 57 |
-
##
|
| 58 |
-
|
| 59 |
-
Based on comprehensive evaluation, the 6K checkpoint demonstrates superior generalization compared to longer-trained models (200K steps). Key findings:
|
| 60 |
-
|
| 61 |
-
### Why 6K Steps is Optimal
|
| 62 |
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
### Training Loss Evolution
|
| 69 |
-
- **Initial Rapid Learning** (0-2K steps): Loss dropped from 0.778 to ~0.2
|
| 70 |
-
- **Steady Refinement** (2K-5K steps): Gradual improvement to ~0.05
|
| 71 |
-
- **Fine-tuning Phase** (5K-6K steps): Final optimization to 0.036
|
| 72 |
|
| 73 |
## Usage
|
| 74 |
|
| 75 |
-
|
| 76 |
|
| 77 |
```python
|
| 78 |
-
|
|
|
|
| 79 |
|
| 80 |
-
|
| 81 |
-
|
| 82 |
```
|
| 83 |
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
To view the training metrics:
|
| 87 |
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
tensorboard --logdir=tensorboard_logs/
|
| 95 |
-
```
|
| 96 |
|
| 97 |
-
|
| 98 |
-
- Training loss curves
|
| 99 |
-
- Learning rate schedules
|
| 100 |
-
- Gradient norms
|
| 101 |
-
- Step-by-step training progress
|
| 102 |
|
| 103 |
-
|
|
|
|
|
|
|
|
|
|
| 104 |
|
| 105 |
-
|
| 106 |
-
- **Superior test performance** vs 200K step model
|
| 107 |
-
- **Efficient training curve** with clear convergence
|
| 108 |
-
- **Stable gradient norms** throughout training
|
| 109 |
-
- **Optimal stopping point** identified at 6K steps
|
| 110 |
|
| 111 |
-
|
| 112 |
|
| 113 |
-
|
| 114 |
-
- **Action Space**: 6D (5D single arm + 1D gripper)
|
| 115 |
-
- **Vision Modalities**: Front camera + wrist camera
|
| 116 |
-
- **Control Frequency**: 30 Hz
|
| 117 |
-
- **Planning Horizon**: 16 timesteps
|
| 118 |
-
|
| 119 |
-
## Technical Specifications
|
| 120 |
-
|
| 121 |
-
- **Model Type**: `gr00t_n1_5`
|
| 122 |
-
- **Compute Type**: `bfloat16`
|
| 123 |
-
- **Model Precision**: `float32`
|
| 124 |
-
- **Flash Attention**: Enabled
|
| 125 |
-
- **Vision Tuning**: Enabled (backbone frozen, visual layers tuned)
|
| 126 |
-
- **Diffusion Inference**: 4 timesteps with noise scheduling
|
| 127 |
-
|
| 128 |
-
## Performance Characteristics
|
| 129 |
-
|
| 130 |
-
This 6K checkpoint represents the optimal balance between:
|
| 131 |
-
- **Learning Capability**: Sufficient training to master fruit manipulation
|
| 132 |
-
- **Generalization**: Avoids overfitting to specific training examples
|
| 133 |
-
- **Computational Efficiency**: Minimal training time for maximum performance
|
| 134 |
-
- **Deployment Readiness**: Stable, production-ready model weights
|
| 135 |
-
|
| 136 |
-
## Citation
|
| 137 |
-
|
| 138 |
-
If you use this model in your research, please consider citing the original GR00T paper and this fine-tuned variant.
|
| 139 |
-
|
| 140 |
-
---
|
| 141 |
|
| 142 |
-
|
| 143 |
-
**Training Date**: October 2024
|
| 144 |
-
**Optimization**: 6K steps identified as optimal through systematic evaluation
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
tags:
|
| 4 |
+
- robotics
|
| 5 |
+
- embodied-ai
|
| 6 |
+
- fruit-manipulation
|
| 7 |
+
- gr00t
|
| 8 |
+
- nvidia
|
| 9 |
+
- pytorch
|
| 10 |
+
- fine-tuned
|
| 11 |
+
datasets:
|
| 12 |
+
- aaronsu11/so101_fruit
|
| 13 |
+
library_name: transformers
|
| 14 |
+
pipeline_tag: robotics
|
| 15 |
+
base_model: nvidia/GR00T-N1.5-3B
|
| 16 |
+
model_type: gr00t
|
| 17 |
+
language:
|
| 18 |
+
- en
|
| 19 |
+
---
|
| 20 |
|
| 21 |
+
# GR00T Fruit Manipulation Model
|
| 22 |
|
| 23 |
+
## Model Description
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
|
| 25 |
+
This is a GR00T model fine-tuned for fruit manipulation tasks. The model has been trained for 6,000 steps on fruit handling and manipulation scenarios.
|
| 26 |
|
| 27 |
+
## Training Details
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
|
| 29 |
+
- **Model Architecture**: GR00T-N1.5-3B
|
| 30 |
+
- **Training Steps**: 6,000
|
| 31 |
+
- **Training Duration**: ~2 hours
|
| 32 |
+
- **Batch Size**: 32
|
| 33 |
+
- **Data Configuration**: so100_dualcam
|
| 34 |
+
- **Embodiment**: New embodiment configuration
|
| 35 |
|
| 36 |
+
## Dataset
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
|
| 38 |
+
This model was trained using the **so101_fruit** dataset, which contains fruit manipulation demonstrations.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 39 |
|
| 40 |
+
**Original Dataset Source**: [https://huggingface.co/datasets/aaronsu11/so101_fruit](https://huggingface.co/datasets/aaronsu11/so101_fruit)
|
| 41 |
|
| 42 |
+
Please cite the original dataset when using this model:
|
| 43 |
```
|
| 44 |
+
@dataset{aaronsu11_so101_fruit,
|
| 45 |
+
title={SO101 Fruit Dataset},
|
| 46 |
+
author={aaronsu11},
|
| 47 |
+
url={https://huggingface.co/datasets/aaronsu11/so101_fruit},
|
| 48 |
+
year={2024}
|
| 49 |
+
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
```
|
| 51 |
|
| 52 |
+
## Capabilities
|
|
|
|
|
|
|
|
|
|
|
|
|
| 53 |
|
| 54 |
+
This model is designed for:
|
| 55 |
+
- Fruit handling and manipulation tasks
|
| 56 |
+
- Object grasping and placement
|
| 57 |
+
- Robotic manipulation in kitchen/food preparation scenarios
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 58 |
|
| 59 |
## Usage
|
| 60 |
|
| 61 |
+
Load the model using the standard GR00T inference pipeline:
|
| 62 |
|
| 63 |
```python
|
| 64 |
+
# Example usage with GR00T inference
|
| 65 |
+
from gr00t_inference import GR00TModel
|
| 66 |
|
| 67 |
+
model = GR00TModel.from_pretrained("cagataydev/gr00t-fruit-6k")
|
| 68 |
+
# Use for fruit manipulation tasks
|
| 69 |
```
|
| 70 |
|
| 71 |
+
## Model Files
|
|
|
|
|
|
|
| 72 |
|
| 73 |
+
The repository contains:
|
| 74 |
+
- `model-00001-of-00002.safetensors` & `model-00002-of-00002.safetensors`: Model weights
|
| 75 |
+
- `config.json`: Model configuration
|
| 76 |
+
- `model.safetensors.index.json`: Model index
|
| 77 |
+
- `trainer_state.json`: Training state information
|
| 78 |
+
- `training_args.bin`: Training arguments
|
|
|
|
|
|
|
| 79 |
|
| 80 |
+
## Training Infrastructure
|
|
|
|
|
|
|
|
|
|
|
|
|
| 81 |
|
| 82 |
+
- **Platform**: Ubuntu
|
| 83 |
+
- **Compute**: Single GPU
|
| 84 |
+
- **Framework**: GR00T training pipeline
|
| 85 |
+
- **Checkpoints**: Saved every 2,000 steps
|
| 86 |
|
| 87 |
+
## License
|
|
|
|
|
|
|
|
|
|
|
|
|
| 88 |
|
| 89 |
+
Please refer to the original dataset license and GR00T model license for usage terms.
|
| 90 |
|
| 91 |
+
## Acknowledgments
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 92 |
|
| 93 |
+
Special thanks to the creators of the original SO101 Fruit dataset for providing high-quality training data for robotic manipulation research.
|
|
|
|
|
|
trainer_state.json
CHANGED
|
@@ -4208,15 +4208,6 @@
|
|
| 4208 |
"learning_rate": 7.594339912486703e-12,
|
| 4209 |
"loss": 0.0155,
|
| 4210 |
"step": 6000
|
| 4211 |
-
},
|
| 4212 |
-
{
|
| 4213 |
-
"epoch": 4.87012987012987,
|
| 4214 |
-
"step": 6000,
|
| 4215 |
-
"total_flos": 0.0,
|
| 4216 |
-
"train_loss": 0.036353461609532435,
|
| 4217 |
-
"train_runtime": 6781.7376,
|
| 4218 |
-
"train_samples_per_second": 28.311,
|
| 4219 |
-
"train_steps_per_second": 0.885
|
| 4220 |
}
|
| 4221 |
],
|
| 4222 |
"logging_steps": 10,
|
|
|
|
| 4208 |
"learning_rate": 7.594339912486703e-12,
|
| 4209 |
"loss": 0.0155,
|
| 4210 |
"step": 6000
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4211 |
}
|
| 4212 |
],
|
| 4213 |
"logging_steps": 10,
|
training_args.bin
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:8e1039c945ace050bc2c045345c6c9addb640349becd4812d2abb5daf8f02feb
|
| 3 |
+
size 129
|