In the continuing evolution of human-robot interaction, one of the central goals of research remains to enable robots to understand specified instructions in natural language. Most of today’s real-world robotic systems are either designed to solve a particular task in a specific way (e.g. robot vacuums) or use specialized controllers that require expertise and training to be used (e.g. manufacturing robots). For robots to be useful companions in our daily life, a layman – without prior training – must be able to tell the robot with natural human language what he wants the robot to do. And the robot should then do it.
The technology for achieving vision requires robots to “reason” about sentence structures, and more about how words and sentences correspond to objects and places in the world. This requires the robots to “infer” what changes in the environment need to be made to meet the user’s goal and determine which sequence of actions will achieve it. Recently, a team of Cornell researchers fought to create new and better conditions for improving human-robot interaction outcomes through natural language – and they have proven to be victorious.
The team including Yoav Artzi, Associate Professor of Computer Science at Cornell Tech and Valts Blukis, a fifth-year doctorate. computer science candidate, with Chris Paxton, Dieter Fox and Animesh Garg at NVIDIA, won first place in ALFRED Challenge (Learning by doing from realistic environments and guidelines) at 2021 EAI @ CVPR (Artificial Intelligence Workshop Incorporated in the Conference on Computer Vision and Pattern Recognition). The teams were invited to compete on âembodied visual tasks that require the anchoring of language to actions in real contextsâ. “ALFRED”, as explained in this video, offers “a new benchmark to learn a mapping of natural language instructions and egocentric vision to action sequences for household chores.
With their award-winning machine learning solution, Artzi, Blukis and their teammates made further progress in helping to bridge disparate computing domains and improved the ability of robots to interact with humans and be useful companions in our daily lives. . Their work will be showcased at the 2021 Robotics Learning Conference, a conference that brings together the world’s leading researchers at the intersection of machine learning and robotics.
Embodied Artificial Intelligence (AI), which involves vision and language and vision and robotics, sits at the interface of three different areas of inquiry: vision, robotics, and natural language processing. At the intersection of these domains, some of the issues facing researchers and designers include partial observability, continuous state spaces, and irrevocable actions for language-guided agents in visual environments. Currently, current datasets do not capture these types of instructions and operations. Thus, this workshop-competition aims to encourage the further development of embodied vision and language.
David LaRocca is a communications specialist at Cornell Ann S. Bowers College of Computing and Information Science.