Robotics and AI in 2024 and beyond — part 1

Eugenio Culurciello
5 min readFeb 13, 2024
a robot generated by a robot (DALL-E2)

Robots, such as the one populating sci-fi movies, are not yet part of our lives. We do not have robots that can cook, or dust our surfaces, fold our clothes, and do laundry, clean the house, nor we have robot brains that can drive our car in any environmental conditions. The reason is simply we have not been able to create an artificial robot brain “controller” that can deal with the multi-faceted complexity of our world.

In this article we will explore ideas to create a robot brain and how to train it.

Physical world multi-modal learning

Robots need a brain that can understand the world in its entirety and based on a complex knowledge graph. The complexity in robotics is that both an understanding of the world and a connection to goal-oriented applications is required to be effective.

For a long time, we have tried to create robot brains by patching separate modules and algorithms, and this approach has not worked. Learning to recognize objects and to grasp them, or to move in space have all the same roots in understanding the three-dimensionality of space and the relationship between spaces and object visual perception. At the same time planning motion or a sequence of process steps requires the basic understanding of the impact of the robot action in the same space and onto the same objects.

What is required is a common model that can learn multi-modal experiences. A robot will have a set of motion capabilities, a vector of actions, and a set of sensors, for example vision, audio, proprioception. Learning needs to correlate all these modalities with the flow of actions and changes in the environment. This means that we need a robot brain capable of evaluating the sequence of sensory inputs and providing a sequence of outputs. A unified model will be able to understand space and objects, move, and manipulate objects.

A robot brain needs to learn a knowledge graph in the physical world. What is a knowledge graph? It is a system that learns real-world “concepts” and can inter-link them based on their semantic and physical meaning and relationships. For example, the concept “cat” in our brain is created by seeing cats, hearing cats, interacting with cats. The fact that we can link worlds like Egyptian or cat or Mann happens because these concepts are linked together in our knowledge. We need to build the same knowledge graph into a robot brain, so they can understand the world and all relationships between its components.

Goals and instructions

We need robots to help us in our everyday activities and support us in tedious repetitive tasks. Folding dry clothes after a laundry session is an example; it is a repetitive activity that is nevertheless never the same. There is always a difference in what clothes to fold, what their state is, where you pick them up, and where you lay them down. There are infinite configurations that require high levels of generalization by a robot brain. More of this in the following section (learning).

A robot needs to listen to some instruction and execute it. Large language models make understanding human verbal instructions possible. The robot brain now needs to break down the instructions into a plan or successive steps to achieve the goal. Consider the example, for completing a goal: “pick up the blue bottle” when standing at a corner of a room. This simple command in text needs to unleash a sequence of planned motions and real-world interactions: figure out what are traversable areas in the room, where are obstacles, where is the target bottle, and planning motion towards it and an approach to grasping the item.

The robot brain sequential planning is always the same: observe the environment, consult on the goal of the instructions, then perform some actions. This loop repeats many times until the goal is reached.

Planning may be recalling steps in an example or previously completed sequence. Or it may involve mental simulation in the representation space to evaluate multiple possible action sequences.


Imagine we have a video demonstration of a task. Learning to repeat the same task requires to learn a correlation between what we observe and the sequence of actions we want to perform. Here comes the first big problem: we need this data to come from the first-person perspective of the robot. In other word the sensing and actions need to be coming from the robot body. Learning would be much impossible if we cannot correlate perception and action. This requires the robot and learning experience to be embodied. The complexity is that we cannot easily get training data for such a robot unless we use an external “oracle” controller to gather them. Think of a human piloting a character in a videogame. Therefore, we need to control the robot and create a few examples of the task. This of course is a time-consuming data collection problem, given that we cannot train a robot brain with just 100 examples or maybe even 1000. We may need a large amount, maybe in the same order of magnitude as the samples used to train an LLM.

It would be much easier if the robot could learn to imitate actions from a video, just like we do with YouTube videos. But the chicken-in-the-egg problem is that the robot has not learned to control its own body and does not know how to “imitate” an action.

Where is the mirror-neuron for robots? How do we learn the capability of imitation?

We need to study how to develop a curriculum to learn progressively vision, 3d space, mobility, attention to other agents such as humans, and finally the ability to imitate them.

… end of part I

about the author

I have more than 20 years of experience in neural networks in both hardware and software (a rare combination). About me: Medium, webpage, Scholar, LinkedIn.

If you found this article useful, please consider a donation to support more tutorials and blogs. Any contribution can make a difference!