Overview & Contributions This course project was focused on solving a humanoid robotics grand challenge. Our group chose to attempt to solve the synchronization of multimodal behaivors. Our task was to create a simulated humanoid and allow it to freely interact with a human user in VR. We created an environment in Unreal Engine 5 and our humanoid was designed with MetaHuman.

My roles on this project were:
  • Unreal Engine 5 environment design
  • Metahuman design
  • Incorporating chatGPT with our MetaHuman for speech capabilities
  • MetaHuman motion targeting in Unreal Engine 5
  • Abstracting room attributes for less of a 'hard coded' solution
More details about the project are below.
Introduction One of the biggest current challenges for humanoid robot development is conquering the Uncanny Valley. It is a phenomenon that robots' affinity decreases as it is becoming more and more like a human. Before solving this problem, we need to first conquer the problem of synchronizing the robot's multimodal behaviors since it is essential for information exchange between it and the user. To solve this grand challenge, we propose a solution that mainly uses neural networks that considers different scenarios. The method combines GPT3 (third-generation Generative Pre-trained Transformer), predefined keywords trigger command, and Mixamo, a database that uses motion-captured animations. Our final result allows the user to interact with the humanoid in Unreal Engine 5 and give it commands. The humanoid is able to execute these commands.
Methods Our method requires a combination of six sub-steps, and each requires a different technique or software. The first step is to set up the environment, acquiring the coordinates of items. The second step is using a neural network to train the walking movement of the humanoid robot. The third step is using predefined action scripts to simulate the action of picking up the specified item. The fourth step is to combine walking with the action of picking things up, making the entire process smooth. The fifth step is to reconstruct the entire action in a VR environment, and the user (human) should be able to clearly see the action of the robot. The final step is to integrate GPT3 which allows the robot to verbally interact with the user.

  • The environment used for the project is AI2THOR, which is an open-source virtual environment for AI agents that simulates realistic 3D scenes. AI2THOR records the coordinates of each item, including the item being picked up, and these coordinates are then imported into Unreal Engine 5 for further processing. The combination of these two tools allows for the creation of a highly detailed and realistic virtual environment for the robot to operate in.

  • To generate movement variants for the robot's walking, a neural network is used. This neural network is trained using data obtained from the DeepMimic database, which contains a vast array of motion-capture data. The neural network is trained to generate variations in walking patterns that are both efficient and safe for the robot to perform.

  • To simulate the action of picking up the item, predefined action scripts are used. These scripts dictate the movements that the robot needs to perform to successfully pick up an object. The walking and picking actions are then combined for smooth execution, ensuring that the robot can move around the environment with ease and complete its tasks efficiently.

  • The entire process is reconstructed in a virtual reality (VR) environment. This allows for a highly immersive and interactive experience, where the user can see the robot in action and even interact with it verbally. To enable verbal interaction, GPT-3 is integrated into the system. This powerful language model can be used to generate responses to user queries, allowing for a seamless and natural interaction between the user and the robot. Overall, the combination of these technologies creates a highly advanced and realistic simulation environment that is ideal for testing and training robotic systems.
Results Our blueprint enables the user to chat and command our humanoid in Unreal Engine 5, providing a quick and accurate response from the robot, and making the user feel comfortable interacting with it. The integration of GPT3 provides a prompt response to user input, though response time may vary depending on the complexity of the question. Keyword-sensitive commands allow for accurate execution of predefined actions, and the robot can dynamically adjust its path to avoid obstacles. Each action is linked to a specific Mixamo action neural network for precise execution. However, the accuracy of GPT3's responses to commands may vary.

Demonstration of trained Deep Mimic network


Linked MetaHuman body motion and facial motion matching speech


Discussion Overall, our approach offers a significant advancement in the development of humanoid robots, addressing the challenges of synchronization of the robot's multimodal behaviors and the generation of movement variants. Our solution offers a highly realistic and immersive environment for testing and training robotic systems. However, further research is necessary to address the limitations of our approach, such as the accuracy of GPT-3's responses to user commands. Future research could also focus on the development of more sophisticated language models that offer more accurate responses to user queries.