Video: Teaching AI to learn language like a human child to advance understanding of human development

maxresdefault
Credit: NYU
AI systems, such as GPT-4, can now learn and use human language, but they learn from astronomical amounts of language inputโ€”much more than children receive when learning how to understand and speak a language. The best AI systems train on text with a word count in the trillions, whereas children receive just millions per year.

Due to this enormous data gap, researchers have been skeptical that recent AI advances can tell us much about human learning and development. An ideal test for demonstrating a connection would involve training an AI model, not on massive data from the web, but on only the input that a single child receives. What would the model be able to learn then?

A team of New York University researchers ran this exact experiment. They trained a multimodal AI system through the eyes and ears of a single child, using headcam video recordings from when the child was six months and through their second birthday. They examined if the AI model could learn words and concepts present in a childโ€™s everyday experience.

Theirย findings, reported in the latest issue of the journalย Science, showed that the model, or neural network, could, in fact, learn a substantial number of words and concepts using limited slices of what the child experienced. That is, the video only captured about 1% of the childโ€™s waking hours, but that was sufficient for genuine language learning.

โ€œWe show, for the first time, that a neural network trained on this developmentally realistic input from a single child can learn to link words to their visual counterparts,โ€ says Wai Keen Vong, a research scientist at NYUโ€™s Center for Data Science and the paperโ€™s first author. โ€œOur results demonstrate how recent algorithmic advances paired with one childโ€™s naturalistic experience has the potential to reshape our understanding of early language and concept acquisition.โ€

โ€œBy using AI models to study the real language-learning problem faced by children, we can address classic debates about what ingredients children need to learn wordsโ€”whether they need language-specific biases, innate knowledge, or just associative learning to get going,โ€ adds Brenden Lake, an assistant professor in NYUโ€™s Center for Data Science and Department of Psychology and the paperโ€™s senior author. โ€œIt seems we can get more with just learning than commonly thought.โ€

Vong, Lake, and their NYU colleagues, Wentao Wang and Emin Orhan, analyzed a childโ€™s learning process captured on first-person videoโ€”via a light, head-mounted cameraโ€”on a weekly basis beginning at six months and through 25 months, using more than 60 hours of footage. The footage contained approximately a quarter of a million word instances (i.e., the number of words communicated, many of them repeatedly) that are linked with video frames of what the child saw when those words were spoken and included a wide range of different activities across development, including mealtimes, reading books, and the child playing.

NYU researchers analyzed a childโ€™s learning process captured on first-person videoโ€”via a light, head-mounted camera (similiar to the one here)โ€”and trained a multimodal AI system through the eyes and ears of a single child. Video by Jonathan King/NYU’s Office of Public Affairs.

The NYU researchers then trained a multimodal neural network with two separate modules: one that takes in single video frames (the vision encoder) and another that takes in the transcribed child-directed speech (the language encoder). These two encoders were combined and trained using an algorithm called contrastive learning, which aims to learn useful input features and their cross-modal associations. For instance, when a parent says something in view of the child, it is likely that some of the words used are likely referring to something that the child can see, meaning comprehension is instilled by linking visual and linguistic cues.

โ€œThis provides the model a clue as to which words should be associated with which objects,โ€ explains Vong. โ€œCombining these cues is what enables contrastive learning to gradually determine which words belong with which visuals and to capture the learning of a childโ€™s first words.โ€

After training the model, the researchers tested it using the same kinds of evaluations used to measure word learning in infantsโ€”presenting the model with the target word and an array of four different image options and asking it to select the image that matches the target word. Their results showed that the model was able to learn a substantial number of the words and concepts present in the childโ€™s everyday experience. Furthermore, for some of the words the model learned, it could generalize them to very different visual instances than those seen at training, reflecting an aspect of generalization also seen in children when they are tested in the lab.

โ€œThese findings suggest that this aspect of word learning is feasible from the kind of naturalistic data that children receive while using relatively generic learning mechanisms such as those found in neural networks,โ€ observes Lake.

This is an excerpt. Read the full article here

{{ reviewsTotal }}{{ options.labels.singularReviewCountLabel }}
{{ reviewsTotal }}{{ options.labels.pluralReviewCountLabel }}
{{ options.labels.newReviewButton }}
{{ userData.canReview.message }}

Related Articles

Infographic: Global regulatory and health research agencies on whether glyphosate causes cancer

Infographic: Global regulatory and health research agencies on whether glyphosate causes cancer

Does glyphosateโ€”the world's most heavily-used herbicideโ€”pose serious harm to humans? Is it carcinogenic? Those issues are of both legal and ...

Most Popular

Screenshot-2026-04-20-at-2.26.27-PM
Viewpoint โ€” Food-fear world: The latest activist scientists campaign: Cancer-causing additives
Screenshot-2026-03-13-at-12.14.04-PM
The FDA wants to make many popular prescription drugs OTCโ€”a great idea. Hereโ€™s why itโ€™s unlikely to happen
ChatGPT-Image-May-1-2026-02_20_13-PM
How RFK, Jr.โ€™s false vaccine claims are holding up $600 million to fight diseases in poor countries
Screenshot-2026-05-01-at-11.56.24-AM
โ€˜Science moves forward when people are willing to think differentlyโ€™: Memories of DNA maverick Craig Venter
Screenshot-2026-04-03-at-11.15.51-AM
Paraben panic: How a flawed study, media hype, and chemophobia convinced the public of the danger of one of the safest classes of preservatives
viva-la-vida-watermelons
Misinformation and climate change are endangering summer watermelons
Screenshot-2026-04-30-at-2.19.37-PM
5 myths about summer dehydration that could damage your health โ€” or even kill you
ChatGPT-Image-Mar-27-2026-11_27_05-AM
The myths of โ€œprocessโ€: What science says about the โ€œdangersโ€™ of synthetic products and ultra-processed foods
Screenshot-2026-05-04-at-12.54.32-PM
How Utah became the countryโ€™s supplement capitalย  โ€” and a haven for unregulated, ineffective and fake products
Drinking lots of water can help reduce the effects of aging
Nanoplastics in drinking water: MAHA activists forge science-based bipartisan coalitionย 
ChatGPT-Image-Mar-10-2026-01_39_01-PM
Viewpointโ€”โ€œMiracle moleculeโ€ debunked: Why acemannan supplements donโ€™t work
79d03212-2508-45d0-b427-8e9743ff6432
Viewpoint: The Casey Means hustleโ€”Wellness woo opportunism dressed up as medical wisdom
circular-bioeconomy-should-focus-on-sustainable-wellbeing
GLP podcast: What's wrong with 'doomsday' environmentalism? It's false.
glp menu logo outlined

Get news on human & agricultural genetics and biotechnology delivered to your inbox.