Capstone Project: Autonomous Humanoid VLA System
Duration: Week 13 (Final week) Weight: 30% of final grade Due Date: End of Week 13
Project Overview
Build a complete autonomous humanoid robot system that integrates voice commands, LLM-based planning, multi-modal perception, navigation, manipulation, and balance control into a unified VLA (Vision-Language-Action) pipeline.
Learning Objectives
By completing this capstone, you will demonstrate:
- Integration: Combining all modules (ROS 2, Gazebo/Isaac Sim, VSLAM, Nav2, VLA) into one system
- System Design: Architecting complex multi-component robotics systems
- Problem Solving: Debugging and optimizing end-to-end pipelines
- Documentation: Creating comprehensive technical documentation
- Presentation: Demonstrating system capabilities effectively
Project Requirements
Functional Requirements
-
Voice Interface:
- Accept natural language commands via microphone
- Transcribe speech using OpenAI Whisper
- Handle multiple languages (at least 2)
- Process commands in real-time (< 3 seconds latency)
-
LLM Planning:
- Decompose complex commands into action sequences
- Handle ambiguous references ("the cup" → resolve via vision)
- Replan on action failures
- Maintain context across multiple commands
-
Multi-Modal Perception:
- Object detection via camera
- Gesture recognition (pointing, waving)
- Fuse speech + vision + gesture for robust understanding
- Resolve ambiguous commands using multi-modal context
-
Navigation:
- VSLAM-based localization
- Nav2 path planning and execution
- Obstacle avoidance
- Goal reaching with < 0.5m accuracy
-
Manipulation:
- Pick up objects (cups, books, tools)
- Place objects at specified locations
- Handle different object sizes and shapes
- Maintain balance during manipulation
-
System Integration:
- All components communicate via ROS 2
- End-to-end pipeline: Voice → Plan → Execute → Feedback
- Error handling and recovery
- Graceful degradation (continue with reduced capabilities if component fails)
Technical Requirements
-
Code Quality:
- Clean, modular architecture
- Well-documented code (docstrings, comments)
- Follows ROS 2 conventions
- Proper error handling
-
ROS 2 Integration:
- All components as ROS 2 nodes
- Proper topic/service/action usage
- Launch files for easy execution
- Parameter configuration files
-
Simulation:
- Works in Gazebo or Isaac Sim
- Realistic physics and sensors
- Reproducible test scenarios
-
Documentation:
- Complete setup guide
- System architecture diagram
- API documentation
- Troubleshooting guide
Deliverables
1. GitHub Repository
Repository structure:
humanoid-vla-system/
├── README.md # Project overview and setup
├── docs/
│ ├── architecture.md # System architecture
│ ├── api.md # API documentation
│ ├── setup.md # Detailed setup guide
│ └── troubleshooting.md # Common issues and solutions
├── src/
│ ├── voice_commands/ # Whisper + command processing
│ ├── llm_planner/ # LLM task planning
│ ├── multimodal_fusion/ # Multi-modal integration
│ ├── action_executor/ # ROS 2 action execution
│ ├── kinematics/ # FK/IK and balance control
│ └── navigation/ # Nav2 integration
├── launch/
│ ├── complete_system.launch.py # Main launch file
│ ├── simulation.launch.py # Gazebo/Isaac Sim
│ └── hardware.launch.py # Physical robot
├── config/
│ ├── llm_prompts.yaml # LLM prompt templates
│ ├── action_mappings.yaml # Action mappings
│ └── robot_params.yaml # Robot parameters
├── worlds/
│ └── capstone_world.world # Simulation world
└── demo_video.mp4 # System demonstration
2. Demo Video (7 to 10 minutes)
Video must show:
-
System Overview (1 minute):
- Architecture diagram
- Component overview
- Setup demonstration
-
Voice Interface (1 minute):
- Microphone input
- Whisper transcription
- Command processing
-
LLM Planning (1 minute):
- Complex command input
- LLM task decomposition
- Plan visualization
-
Multi-Modal Fusion (1 minute):
- Ambiguous command ("pick up the cup")
- Vision object detection
- Gesture recognition
- Fused command resolution
-
Navigation (2 minutes):
- VSLAM localization visualization
- Nav2 path planning
- Obstacle avoidance
- Goal reaching
-
Manipulation (2 minutes):
- Pick up object
- Place object
- Balance maintenance
-
End-to-End Demo (2 minutes):
- Complete task: "Pick up the red cup and bring it to me"
- Show all components working together
- Error handling demonstration
-
Conclusion (1 minute):
- System capabilities summary
- Future improvements
- Lessons learned
Video quality requirements:
- Clear audio (narrate what's happening)
- High resolution (1080p minimum)
- Screen recordings for code/visualizations
- Real-time robot footage (simulation or hardware)
3. Technical Documentation
README.md must include:
- Project description
- Quick start guide
- System requirements
- Installation instructions
- Usage examples
- Demo video link
Architecture documentation must include:
- System architecture diagram
- Component descriptions
- Data flow diagrams
- Topic/service/action documentation
- Integration points
API documentation must include:
- All ROS 2 topics, services, actions
- Node interfaces
- Parameter descriptions
- Message formats
Evaluation Criteria
Functionality (40%)
| Criterion | Excellent (90 to 100%) | Good (75 to 89%) | Needs Work (less than 75%) |
|---|---|---|---|
| Voice Recognition | High accuracy (>90%), handles noise, multi-language | Good accuracy (>80%), basic noise handling | Low accuracy, frequent errors |
| LLM Planning | Sophisticated prompts, robust planning, handles ambiguity | Basic prompts, simple planning | Poor planning, frequent failures |
| Multi-Modal Fusion | Seamless integration, robust fusion | Basic integration, some gaps | Poor integration |
| Navigation | Accurate (less than 0.3m error), smooth paths | Good accuracy (less than 0.5m), mostly smooth | Poor accuracy, jerky motion |
| Manipulation | Successful pick/place, maintains balance | Mostly successful, occasional failures | Frequent failures |
| Integration | All components work together seamlessly | Most components integrated | Poor integration, frequent issues |
Code Quality (20%)
- Architecture: Modular, well-organized, follows ROS 2 conventions
- Documentation: Comprehensive docstrings, comments, README
- Error Handling: Robust error handling and recovery
- Testing: Unit tests or validation scripts
Documentation (20%)
- Completeness: All required documentation present
- Clarity: Easy to understand and follow
- Examples: Code examples and usage scenarios
- Troubleshooting: Common issues documented
Demo Presentation (20%)
- Video Quality: Professional, clear, well-edited
- Coverage: Shows all required features
- Explanation: Clear narration of what's happening
- Demonstration: Successfully demonstrates system capabilities
Grading Rubric
| Component | Weight | Excellent | Good | Needs Work |
|---|---|---|---|---|
| Functionality | 40% | All features working, robust | Most features working | Missing features |
| Code Quality | 20% | Clean, documented, modular | Readable, some docs | Poor structure |
| Documentation | 20% | Complete, clear, examples | Basic, mostly clear | Incomplete |
| Demo | 20% | Professional, comprehensive | Good, covers main features | Unclear, missing features |
Total: 100 points
Submission Guidelines
Submission Format
-
GitHub Repository:
- Public repository (or provide access)
- Link submitted via course LMS
- Repository must be complete at submission time
-
Demo Video:
- Upload to YouTube (unlisted) or similar platform
- Link submitted via course LMS
- Video must be accessible for grading
-
Documentation:
- All documentation in repository
/docsfolder - README.md in repository root
- PDF export optional (not required)
- All documentation in repository
Submission Checklist
- GitHub repository created and accessible
- All code committed and pushed
- README.md complete
- Architecture documentation complete
- API documentation complete
- Demo video uploaded and accessible
- Video link submitted via LMS
- Repository link submitted via LMS
- All team members listed (if group project)
Late Policy
- On time: Full credit
- 1 day late: -10%
- 2 days late: -20%
- 3 days late: -30%
- >3 days late: Not accepted (unless extenuating circumstances)
Example Capstone Scenarios
Scenario 1: "Pick Up and Deliver"
Command: "Pick up the red cup from the table and bring it to me"
Expected behavior:
- Whisper transcribes command
- LLM decomposes: ["look_at red cup", "navigate_to table", "pick_up red cup", "navigate_to human", "place red cup"]
- Vision identifies red cup on table
- Navigation moves robot to table
- Manipulation picks up cup
- Navigation moves to human
- Manipulation places cup
- System confirms completion
Scenario 2: "Multi-Object Task"
Command: "Organize the room: put books on the shelf and cups on the table"
Expected behavior:
- LLM plans: Multiple sub-tasks for books and cups
- Vision detects all books and cups
- Navigation + manipulation executes each sub-task
- System tracks progress and reports completion
Scenario 3: "Ambiguous Command Resolution"
Command: "Pick up that cup" (while pointing)
Expected behavior:
- Speech: "Pick up that cup"
- Gesture: Pointing detected
- Vision: Multiple cups detected
- Fusion: Matches pointing direction to cup in vision
- Execution: Picks up correct cup
Getting Help
Resources:
- Course Materials: All modules and chapters
- ROS 2 Documentation: docs.ros.org
- AI Book Assistant: Trained on course content
- Office Hours: TA support (see schedule)
- Peer Discussion: Course forum (if available)
Common Issues:
- See troubleshooting guides in each module
- Check GitHub Issues in component repositories
- Review ROS 2 Answers forum
Success Tips
- Start Early: Capstone requires significant integration work
- Test Incrementally: Test each component before integrating
- Use Simulation: Develop and test in simulation first
- Document as You Go: Don't wait until the end
- Create Demo Scenarios: Plan your demo video early
- Ask for Help: Use office hours and forums
Next Steps After Capstone
Possible extensions:
- Deploy to physical hardware (Unitree G1, etc.)
- Add more manipulation capabilities
- Implement learning from demonstration
- Add emotion recognition and expression
- Extend to multi-robot coordination
Career Applications:
- Robotics engineer positions
- AI/ML engineer roles
- Research opportunities
- Startup projects
Congratulations!
Completing this capstone demonstrates mastery of:
- ✅ ROS 2 robotics development
- ✅ Simulation (Gazebo, Isaac Sim)
- ✅ AI integration (LLMs, vision)
- ✅ System integration
- ✅ Technical documentation
You've built a complete autonomous humanoid robot system!
Return to: Module 4 Introduction →