Drop your dataset — part 2

Eugenio Culurciello
6 min readDec 6, 2022

In the previous post I have given some ideas on how we can overcome the limitation of current neural network training without using supervised datasets. I want to go into more details here to provide more ideas and insights into the evolution of neural network training: neural nets v2.0 or software v3.0.


First of all let us remember that in machine learning the focus is to learn relationships in the data. Training data is in the form of one or multiple samples: {input, desired_output} and the machine learning algorithm has to learn a relationship between “input” and “desired_output”. That is to say that if you give it: “input”, the machine learning algorithm will provide you with: “desired_output”.

This is simple dataset and a point-to-point application of machine learning. A neural network model can be used as “black box” to learn the above relationship. Other machine learning models can be used also but I will not discuss those here.


Neural networks today are mostly trained using the “relationships” approach of learning, fed with supervised data from a static dataset. We have hoped that point-to-point application of our dataset would generalize and interpret a large number of new data even if it was not identical to the exact examples in the dataset.

The issues with a static dataset and current neural network training techniques are:

  • dataset generally do not evolve, are static
  • neural network training cannot use new samples because of forgetting
  • we do not have labels for online learning — who labels unseen data?

These issues limit applicability only in confined and fixed domains where a dataset can be provided and such dataset is applicable to future online data.

label “cat”

what is a label anyway?

But it may be interesting to try to first think of what is a “label” for our data. For example the input is a picture of a cat, and the label is the word “cat”.

The concept of “cat” exists even without the word and if you happened to grow up alone observing the world, you would likely know what a cat is, but maybe have no word for it. Humans use word to communicate between each other because they all have disconnected brains and need words to share knowledge.

So in a way, we are training neural networks with supervised dataset just to learn our language! So that the neural network output is something we can understand. But this is not the only way. We can use unsupervised learning techniques like clustering which will, for example, put all pictures of cats in a category, and other pictures of dogs in another, and so forth. But these categories will not have a name because we have not given it one yet. But they will exist and be useful nevertheless. After all all we need is to see one sample input from the cat category to recognize: yes, that is a picture of a cat, and all pictures in this category are pictures of a cat, so this category is “cat”!


At the beginning of neural networks, late 2000s we started learning the relationship {images, label}. In teh dataset MNIST, for example, there are pictures of digits and the labels are “0”, ”1”, ”2”, etc. Later in 2002 onward, we had ImageNet with picture of dogs and the labels were “terrrier”, “pomeranian”, etc. Then in 2014 or so we moved to image captioning, to learn more complex relationship between our human language and some data (pictures). We are thus forming a multi-modal interaction of images and words.

The large language models neural networks that are popular these days (12/2022 ChatGPT, for example) are not trained supervised on couple of {input, desired_outputs}. They are trained to predict the next word in a sentence, unsing single-modal training on text only. This self-supervised technique is much preferred to a curated dataset, but will require more data to learn — not a problem for text, as we have huge amount of human data online and digitized!

The large neural network language models are impressive because they somehow look like conversational entities that are human-like, and can report and provide useful information through dialogues-like interactions and even some reasoning!

But these models were trained on text only, so we cannot the expect to really understand the world we live in without all the rest of sensing modalities. So if you find the recent large language models impressive, imagine what they could be if they were trained on text, images, videos, audio, more all together!

multiple modalities

We want to use neural networks applied to our world, and therefore we will likely start from a human-like perception. Vision, auditory, touch, smell, etc. there will be many sensing modalities that we need to use to acquire a large knowledge of the world in a way that we can scale it and continue to adapt it to the world during the neural network “life”.

I proposed the architecture Mix-Match before in another post to be able to absorb all the world’s data into a single neural network. This network is trained on a dataset, yes, but not a curated dataset of inputs/outputs samples. Rather is fed multi-modal data that is sparsely labeled, and mostly self-labeled by co-occurrence of multi-modal data in pace or time.

I believe that the future of applicable neural networks is in the training with multi-modal data, where the relationship between different data modalities are combined into powerful embedding of concepts. This can be done by simply applying a Hebbian-like learning regularization. This regularization strengthens the connection of multiple concepts occurring closely in space or in time, as for example a figure in a paper and its caption, or a video of walking and the sound of feet hitting the ground.

Supervision happens automatically in multi-modal data trained this way because all the modality, including textual descriptions are embedded in the same space. This is a great advantage to a curated but limited dataset.


Suppose we define intelligence as the ability to solve new problem based on our acquired knowledge. Without knowledge of any sort there is no intelligence, so we will have to start from somewhere.

The question one can ask on the Mix-Match model is how do we encode data from ultiple modalities, can we use pre-trained encoders. And the answer is we should not, because by doing so we are constrained to a set of neural networks that are trained supervised. But if we fine-tune or let these encoder evolve, maybe… I think these encoders should be trained from scratch on the new multi-modal set of data, and maybe shared across multiple training systems.

The other bug question is: will Mix-Match experience final concepts embeddings drift when trained? For example if we embed the knowledge of “cat” via image, videos, text, will a new set of documents and data move this embedding so much that it will be “forgotten”? I do not not yet have an answer to this question, but I see that data trained with contrastive embeddings in a similar way is more robust to learning shifts than training on a standard supervised dataset (see here for example).

If we think that new large natural language models with neural network provide some kind of intelligent behavior (for example solving simple math problems), I think you can understand that Mix-Match models will be able to do much more of that, and also reason about what they see and hear, thus providing a real robot brain!

about the author

I have more than 20 years of experience in neural networks in both hardware and software (a rare combination). About me: Medium, webpage, Scholar, LinkedIn.

If you found this article useful, please consider a donation to support more tutorials and blogs. Any contribution can make a difference!