Natural Language Processing in 2025
I have been wondering a bit why there is so much collective hype over the latest, biggest Large Language Models (LLMs). Before we get deep into this topic, I wanted to go over the last few decades of AI research in Natural Language Processing (NLP) to lay the groundwork of how we got here and to share my own thoughts on why I think we are going in the wrong direction for meaningful breakthroughs in the space.
An Abridged History of Language Models
Natural Language Processing is the branch of computer science that studies processing of ordinary human languages. These are the languages we speak, sign, and write. Alan Turing wrote about this topic in 1950 in Computing Machinery and Intelligence with some of the first work on NLP being in the domain of machine translation in that decade. At the time, researchers were translating Russian to English and thought machine translation would be solved within the decade. Turns out it was a harder problem than that. 😂
60s
In the 60s is when we got ELIZA, the first chatbot (called "chatterbot" at the time). ELIZA used substitution methods to reply in a natural way. There were some other notable projects in this decade, like SHRDLU, which was a natural language system that worked within a constrained world (called blocks worlds). In both cases, the vocabulary was restricted in these tools due to the memory limitations of the time. Despite this, ELIZA would mimic language well enough that people grew emotionally attached to it (calling it the ELIZA effect).
70s
In the 70s, research into ontologies and formalizing these within computer systems commenced. Ontologies (in computer science/information theory) are the formal representation of knowledge which encompass the definitions, properties, and relationships between concepts, data, and other entities. This is where we started to delve more into the formal deconstruction of natural language and other communication and mind concepts relate to the creation and understanding of language. By formalizing the relationships between these different concepts, better computer programs could be written by encoding these concepts. One such chatterbot written in this time was PARRY, a bot which was built to simulate a person with paranoid schizophrenia. PARRY and ELIZA had a chat over ARPANET in 1972 at the ICCC. ARPANET was one of the first TCP/IP programs, predating the internet as we know it today.
80s
In the 80s, research exploded in different directions trying to solve different problems within the space. For instance, there was research into generative grammar and rule-based parsing. This explored the concept that our human subconciousness has some innate grammatical knowledge and determining what parts are actually innate vs learned. One of the subfields here is semantics, which studies the way that language is composed and lead to algorithms like "word sense disambiguation" where computer programs figure out the meaning of an ambiguous word based upon the context. Other research went into building upon the formalized structures giving rise to better text summarization algorithms. Also very notably, this is where the importance of quantitative evaluation started to arise such as this 1986 paper discussing the state of NLP system evaluations.
90s
In the 90s is where we started to use statistical systems for NLP which formed the baseline for the current Large Language Model approach. Everything before this was based on hand-written rules, but with increased compute power in the late 80s, and access to large unstructured language sets (text corpus), research was able to start applying more information and enabling AI models to infer their grammars, and other natural language components rather than hand-defining them. In the early 90s for instance, machine translation with statistical methods was made possible because of the Canadian and EU multi-lingual laws and governmental proceedings being available. Without the right data set, statistical methods cannot be realized.
2000s
In the early 2000s, while the web was growing in popularity, access to huge amounts of raw written language became available leading to supervised and semi-supervised learning algorithms. Multilayer perceptrons started becoming popular. While backpropagation was invented in the 70s, it took the rise of unstructured data becoming available before these unsupervised algorithms could out perform other methods. This was where we started to see the more data and compute makes better NLP system theory start to emerge and give results.
2010s
In the 2010s, we see recurrent neural networks (RNNs) being applied. These have a single hidden layer which serves as the memory of the system as opposed to deep learning methods which have many hidden layers. These are also not fully connected layers as is the case with multilayer perceptrons. RNNs led to the creation of Word2Vec in this time, enabling a highly multidimensional vector representation of natural text. Since that vector represents some sort of "meaning", vectors close in this space have similar "meanings" (vector databases can use this to power semantic search algorithms). Deep learning approaches with text corpus showed to be really powerful, leading to an increase in research showing state of the art results in different aspects of NLP tasks.
As an aside, I worked in an NLP research lab in the 2010s as one of my first internships where we took unstructured text, labeled it for parts of speech, n-grams, and utilized word sense disambiguation algorithms, then fed the 600 or so "features" through classifiers to enable styllometry on the written works. My specific project was taking those features that identified the subject's writing style and trying to hide the writing from the classifiers. It was very hard at the time to do this, especially with my still developing computer science skills...
If you have been studying the space more recently you are certainly aware of the 2017 paper, Attention Is All You Need. RNNs (or Convolutional neural networks in the case of non-linear processing like images) were being used in an encoder/decoder configuration. From this, the transformer was developed and an attention mechanism was introduced to these systems. This paper used multihead attention in a self-attentive way based on the inputs to both encoder and decoder. This led to drastically decreased training costs and an ability to scale training, thus giving us the start to the LLM trend. After this paper, we started throwing as much data as can be found on the internet at these types of algorithms to produce the models we see today.
History Leads to Today (2020s)
You can also see that in the last 70 years, we have encountered many of the aspects we think only exist in today's world:
- Human attachment and anthropromorphisation of computers based on their ability to mimic language
- Reinforce our own thinking patterns through mirroring which we now call AI Psychosis
- Computers communicating with each other in natural language as opposed to computer symbols.
- Applications of deep learning algorithms in natural language.
- Huge text corpus leads to better results.
The history of NLP started more than 70 years ago. Of course, it makes sense that our earliest applications of computers was to understanding and mimicking our human languages. This is the primary way we communicate, after all, and computers have become more and more central over that timeframe.
Language is Important to AI
"Language stands as humanity’s most potent tool, enabling the expression of thoughts and emotions and facilitating communication with others." - Zichong Wang, et al in History, Development, and Principles of Large Language Models -- An Introductory Survey
Language is our primary way of communication, especially today, where we have the sum of books written throughout time, podcasts, videos, websites with written content, and of course person to person speaking and listening. These provide us all with a level of knowledge access that we could not even fathom to have 1000+ years ago. Giving this collective data to deep learning algorithms to infer parts of our language creation and mimic it back has shown to be incredibly powerful in generalized NLP tasks like summarization, translation and generation of text. All that said...
Language is something that is innately human. While we have passed language to other species, like some other great apes who have learned sign language, LLMs are more like parrots in their usage of language. According to some research, parrots can understand words they are using within context, but that does not mean they are capable of human-level intelligence. Language understanding and generation is important to human-level AI. That, alone, does not make a machine that can work with language capable of human-level intelligence, but it sure can become good at convincing us.
The Wall for LLMs
While the tools we currently have are incredibly powerful, I think we are starting to hit the wall in generative AI models. We have not much more language or human data to pull from to make bigger, better models than has already been processed. While we do have synthetic language (text corpus generated via existing LLMs or other methods), I think we need to go back to some of the basics in understanding human language more deeply with grammars, syntax and the relationships between language and how we operate in the world to make meaningfully better progress in the space of generalizable AI models. At the end of the day, we humans do not solely rely on language even if it is our primary way of communication. Despite language being so central to our world and understanding of it, our vision is our most relied on sense and the one which intakes the most information in terms of raw bytes. And, our emotions are what most guide our actions that we take. Large language models do not have either of these and so will never be operating like humans given how much our sight and emotions lead to our understandings, actions, and even ways of communicating. These are fundamental limitations in building "AGI" with bigger language models by themselves.
I am not alone in this thinking:
"One consequence of the age of scaling is that scaling sucked out all the air in the room. Because scaling sucked out all the air in the room, everyone started to do the same thing." - Ilya Sutskever on Dwarkesh Podcast.
Scaling is the approach that led us to the Large Language Models.
"If you are interested in human-level AI, don't work on LLMs" - Yann LeCun
Both Ilya Sutsekever and Yann LeCun have left (or are leaving) their companies of OpenAI and Meta, respectively, to start their own new ventures. Notably, both are scientists/researchers who have seen the inner workings of the companies that are pushing the scale is everything narrative. Don't take my word for it, listen to the experts.
Conclusion
The path of NLP research has led to the scaling of language models using the accrued written texts of all time. The explosion in the popularity of the internet enabled us to build text corpus that enables deep learning algorithms to infer different natural language structures in unsupervised ways. The multi-headed, self-attention mechanism enabled the vast scaling to utilize that information to create generalizable language models that perform well in many NLP tasks. But, this approach has hit the wall of our produced language and uses ever larger amounts of power and hardware to scale beyond this. There are many more paths we can take outside of scaling to further build upon the research of the past and the recent success of LLMs. This includes using higher quality and smaller corpus, distilling Language models to smaller more focused systems, and (most importantly) exploring other areas of research in NLP that have plenty of space to grow. While I personally believe that LLMs are awesome at converting one form of language to another and I will continue to use them in this way, I do not think they have wide understanding, so will never be more than a stochastic parrot, alone.
Next up, I explore Why LLMs are so Resource Intensive.