A picture is worth a 1000 words

Eugenio Culurciello
6 min readMar 30, 2023

Evolution of large language models into artificial brains through multiple sensory modalities

As surprising as it is our very human predictive abilities hide a path to understanding our intelligence. How do we solve so many tasks, learn so many new experiences, how do we so easily adapt to new environments?

The recent (2022–23) success of large language models (LLM) like ChatGPT and GPT-X paints a story. Trained to predict the next words in a sentence, these models showed superb language proficiency. The key ingredients are large amounts of data — more than you can read in a lifetime, and a simple predictive algorithm.

When we think of language we often forget it is an abstract representation of reality, of the world we live in, used to tag concepts and ideas of the real world into a set of labels we can use to communicate. But “the rose exist before its name”, meaning that any object in the real world, like a rose, is real even without a word for it. What is a “rose” then?

A picture is worth a thousand words

A concept in our environment is just a collection of data from our sensors. For us humans a rose is equivalent to its image, its perfume, its delicate touch, its movement in the wind. That “concept” is the essence of an object, or our qualia, our subjective ensemble of sensory information fused into one token of knowledge.

This is the future of artificial brains and the evolution of LLM: being able to represent “concepts” from their multi-modal information.

Multiple modalities

LLM are trained with text only, here instead we propose to use:

  • visual information
  • audio
  • touch, proprioception
  • text, tags, etc.

This requires the artificial brain to be embodied, or at very least have a fixed set of sensors that maintain consistency over the entire learning life.