Skip to main content

Capstone Project: Autonomous Humanoid VLA System

Duration: Week 13 (Final week) Weight: 30% of final grade Due Date: End of Week 13

Project Overview

Build a complete autonomous humanoid robot system that integrates voice commands, LLM-based planning, multi-modal perception, navigation, manipulation, and balance control into a unified VLA (Vision-Language-Action) pipeline.

Learning Objectives

By completing this capstone, you will demonstrate:

  • Integration: Combining all modules (ROS 2, Gazebo/Isaac Sim, VSLAM, Nav2, VLA) into one system
  • System Design: Architecting complex multi-component robotics systems
  • Problem Solving: Debugging and optimizing end-to-end pipelines
  • Documentation: Creating comprehensive technical documentation
  • Presentation: Demonstrating system capabilities effectively

Project Requirements

Functional Requirements

  1. Voice Interface:

    • Accept natural language commands via microphone
    • Transcribe speech using OpenAI Whisper
    • Handle multiple languages (at least 2)
    • Process commands in real-time (< 3 seconds latency)
  2. LLM Planning:

    • Decompose complex commands into action sequences
    • Handle ambiguous references ("the cup" → resolve via vision)
    • Replan on action failures
    • Maintain context across multiple commands
  3. Multi-Modal Perception:

    • Object detection via camera
    • Gesture recognition (pointing, waving)
    • Fuse speech + vision + gesture for robust understanding
    • Resolve ambiguous commands using multi-modal context
  4. Navigation:

    • VSLAM-based localization
    • Nav2 path planning and execution
    • Obstacle avoidance
    • Goal reaching with < 0.5m accuracy
  5. Manipulation:

    • Pick up objects (cups, books, tools)
    • Place objects at specified locations
    • Handle different object sizes and shapes
    • Maintain balance during manipulation
  6. System Integration:

    • All components communicate via ROS 2
    • End-to-end pipeline: Voice → Plan → Execute → Feedback
    • Error handling and recovery
    • Graceful degradation (continue with reduced capabilities if component fails)

Technical Requirements

  1. Code Quality:

    • Clean, modular architecture
    • Well-documented code (docstrings, comments)
    • Follows ROS 2 conventions
    • Proper error handling
  2. ROS 2 Integration:

    • All components as ROS 2 nodes
    • Proper topic/service/action usage
    • Launch files for easy execution
    • Parameter configuration files
  3. Simulation:

    • Works in Gazebo or Isaac Sim
    • Realistic physics and sensors
    • Reproducible test scenarios
  4. Documentation:

    • Complete setup guide
    • System architecture diagram
    • API documentation
    • Troubleshooting guide

Deliverables

1. GitHub Repository

Repository structure:

humanoid-vla-system/
├── README.md # Project overview and setup
├── docs/
│ ├── architecture.md # System architecture
│ ├── api.md # API documentation
│ ├── setup.md # Detailed setup guide
│ └── troubleshooting.md # Common issues and solutions
├── src/
│ ├── voice_commands/ # Whisper + command processing
│ ├── llm_planner/ # LLM task planning
│ ├── multimodal_fusion/ # Multi-modal integration
│ ├── action_executor/ # ROS 2 action execution
│ ├── kinematics/ # FK/IK and balance control
│ └── navigation/ # Nav2 integration
├── launch/
│ ├── complete_system.launch.py # Main launch file
│ ├── simulation.launch.py # Gazebo/Isaac Sim
│ └── hardware.launch.py # Physical robot
├── config/
│ ├── llm_prompts.yaml # LLM prompt templates
│ ├── action_mappings.yaml # Action mappings
│ └── robot_params.yaml # Robot parameters
├── worlds/
│ └── capstone_world.world # Simulation world
└── demo_video.mp4 # System demonstration

2. Demo Video (7 to 10 minutes)

Video must show:

  1. System Overview (1 minute):

    • Architecture diagram
    • Component overview
    • Setup demonstration
  2. Voice Interface (1 minute):

    • Microphone input
    • Whisper transcription
    • Command processing
  3. LLM Planning (1 minute):

    • Complex command input
    • LLM task decomposition
    • Plan visualization
  4. Multi-Modal Fusion (1 minute):

    • Ambiguous command ("pick up the cup")
    • Vision object detection
    • Gesture recognition
    • Fused command resolution
  5. Navigation (2 minutes):

    • VSLAM localization visualization
    • Nav2 path planning
    • Obstacle avoidance
    • Goal reaching
  6. Manipulation (2 minutes):

    • Pick up object
    • Place object
    • Balance maintenance
  7. End-to-End Demo (2 minutes):

    • Complete task: "Pick up the red cup and bring it to me"
    • Show all components working together
    • Error handling demonstration
  8. Conclusion (1 minute):

    • System capabilities summary
    • Future improvements
    • Lessons learned

Video quality requirements:

  • Clear audio (narrate what's happening)
  • High resolution (1080p minimum)
  • Screen recordings for code/visualizations
  • Real-time robot footage (simulation or hardware)

3. Technical Documentation

README.md must include:

  • Project description
  • Quick start guide
  • System requirements
  • Installation instructions
  • Usage examples
  • Demo video link

Architecture documentation must include:

  • System architecture diagram
  • Component descriptions
  • Data flow diagrams
  • Topic/service/action documentation
  • Integration points

API documentation must include:

  • All ROS 2 topics, services, actions
  • Node interfaces
  • Parameter descriptions
  • Message formats

Evaluation Criteria

Functionality (40%)

CriterionExcellent (90 to 100%)Good (75 to 89%)Needs Work (less than 75%)
Voice RecognitionHigh accuracy (>90%), handles noise, multi-languageGood accuracy (>80%), basic noise handlingLow accuracy, frequent errors
LLM PlanningSophisticated prompts, robust planning, handles ambiguityBasic prompts, simple planningPoor planning, frequent failures
Multi-Modal FusionSeamless integration, robust fusionBasic integration, some gapsPoor integration
NavigationAccurate (less than 0.3m error), smooth pathsGood accuracy (less than 0.5m), mostly smoothPoor accuracy, jerky motion
ManipulationSuccessful pick/place, maintains balanceMostly successful, occasional failuresFrequent failures
IntegrationAll components work together seamlesslyMost components integratedPoor integration, frequent issues

Code Quality (20%)

  • Architecture: Modular, well-organized, follows ROS 2 conventions
  • Documentation: Comprehensive docstrings, comments, README
  • Error Handling: Robust error handling and recovery
  • Testing: Unit tests or validation scripts

Documentation (20%)

  • Completeness: All required documentation present
  • Clarity: Easy to understand and follow
  • Examples: Code examples and usage scenarios
  • Troubleshooting: Common issues documented

Demo Presentation (20%)

  • Video Quality: Professional, clear, well-edited
  • Coverage: Shows all required features
  • Explanation: Clear narration of what's happening
  • Demonstration: Successfully demonstrates system capabilities

Grading Rubric

ComponentWeightExcellentGoodNeeds Work
Functionality40%All features working, robustMost features workingMissing features
Code Quality20%Clean, documented, modularReadable, some docsPoor structure
Documentation20%Complete, clear, examplesBasic, mostly clearIncomplete
Demo20%Professional, comprehensiveGood, covers main featuresUnclear, missing features

Total: 100 points

Submission Guidelines

Submission Format

  1. GitHub Repository:

    • Public repository (or provide access)
    • Link submitted via course LMS
    • Repository must be complete at submission time
  2. Demo Video:

    • Upload to YouTube (unlisted) or similar platform
    • Link submitted via course LMS
    • Video must be accessible for grading
  3. Documentation:

    • All documentation in repository /docs folder
    • README.md in repository root
    • PDF export optional (not required)

Submission Checklist

  • GitHub repository created and accessible
  • All code committed and pushed
  • README.md complete
  • Architecture documentation complete
  • API documentation complete
  • Demo video uploaded and accessible
  • Video link submitted via LMS
  • Repository link submitted via LMS
  • All team members listed (if group project)

Late Policy

  • On time: Full credit
  • 1 day late: -10%
  • 2 days late: -20%
  • 3 days late: -30%
  • >3 days late: Not accepted (unless extenuating circumstances)

Example Capstone Scenarios

Scenario 1: "Pick Up and Deliver"

Command: "Pick up the red cup from the table and bring it to me"

Expected behavior:

  1. Whisper transcribes command
  2. LLM decomposes: ["look_at red cup", "navigate_to table", "pick_up red cup", "navigate_to human", "place red cup"]
  3. Vision identifies red cup on table
  4. Navigation moves robot to table
  5. Manipulation picks up cup
  6. Navigation moves to human
  7. Manipulation places cup
  8. System confirms completion

Scenario 2: "Multi-Object Task"

Command: "Organize the room: put books on the shelf and cups on the table"

Expected behavior:

  1. LLM plans: Multiple sub-tasks for books and cups
  2. Vision detects all books and cups
  3. Navigation + manipulation executes each sub-task
  4. System tracks progress and reports completion

Scenario 3: "Ambiguous Command Resolution"

Command: "Pick up that cup" (while pointing)

Expected behavior:

  1. Speech: "Pick up that cup"
  2. Gesture: Pointing detected
  3. Vision: Multiple cups detected
  4. Fusion: Matches pointing direction to cup in vision
  5. Execution: Picks up correct cup

Getting Help

Resources:

  • Course Materials: All modules and chapters
  • ROS 2 Documentation: docs.ros.org
  • AI Book Assistant: Trained on course content
  • Office Hours: TA support (see schedule)
  • Peer Discussion: Course forum (if available)

Common Issues:

  • See troubleshooting guides in each module
  • Check GitHub Issues in component repositories
  • Review ROS 2 Answers forum

Success Tips

  1. Start Early: Capstone requires significant integration work
  2. Test Incrementally: Test each component before integrating
  3. Use Simulation: Develop and test in simulation first
  4. Document as You Go: Don't wait until the end
  5. Create Demo Scenarios: Plan your demo video early
  6. Ask for Help: Use office hours and forums

Next Steps After Capstone

Possible extensions:

  • Deploy to physical hardware (Unitree G1, etc.)
  • Add more manipulation capabilities
  • Implement learning from demonstration
  • Add emotion recognition and expression
  • Extend to multi-robot coordination

Career Applications:

  • Robotics engineer positions
  • AI/ML engineer roles
  • Research opportunities
  • Startup projects

Congratulations!

Completing this capstone demonstrates mastery of:

  • ✅ ROS 2 robotics development
  • ✅ Simulation (Gazebo, Isaac Sim)
  • ✅ AI integration (LLMs, vision)
  • ✅ System integration
  • ✅ Technical documentation

You've built a complete autonomous humanoid robot system!


Return to: Module 4 Introduction →