For best experience please turn on javascript and use a modern browser!
You are using a browser that is no longer supported by Microsoft. Please upgrade your browser. The site may not present itself correctly if you continue browsing.
This year, AI systems that can write almost human-like texts have made a breakthrough worldwide. However, many academic questions on how exactly these systems work remain unanswered. Three UvA researchers are trying to make the underlying language models more transparent, reliable and human.

The launch of ChatGPT by OpenAI on 30 November 2022 was a game changer for artificial intelligence. All of a sudden the public became aware of the power of writing machines. Some two months later, ChatGPT already had 100 million users.

Now, students are using it to write essays, programmers are using it to generate code, and companies are automating everyday writing tasks. At the same time, there are significant concerns about the unreliable nature of automatically generated text, and about the adoption of stereotypes and discrimination found in the training data.

The media across the world quickly jumped on ChatGPT, with stories flying around on the good, the bad and everything in between. ‘Before the launch of ChatGPT, I hadn’t heard a peep from the media about this topic for a long time,’ says UvA researcher Jelle Zuidema, ‘while my colleagues and I have tried to tell them multiple times over the years that important developments were on the horizon.’

'The problem with large language models is that knowledge isn’t stored in way that’s understandable for people' Jelle Zuidema

Zuidema is an associate professor of Natural Language Processing, Explainable AI and Cognitive Modelling at the Institute for Logic, Language and Computation (ILLC). He is advocating for a measured discussion on the use of large language models, which is the kind of model that forms the basis for ChatGPT (see Box 1). Zuidema: ‘Downplaying or acting outraged about this development, saying things like “it’s all just plagiarism”, is pointless. Students use it, scientists use it, programmers use it, and many other groups in society are going to be dealing with it. Instead, we should be asking questions like: What consequences will language models have? What jobs will change? What will happen to the self-worth of copywriters?’

Under the hood

One important academic question is what actually happens under the hood of a large language model. ‘­’ says Zuidema. ‘That knowledge is represented by a whole bunch of numbers: the parameters of the deep neural network. But we don’t know what those numbers mean.’

This has significant consequences. We know that large language models often make things up: incorrect facts, biographies and references conjured out of thin air. They also adopt stereotypes or hateful expressions found in the training data. ‘It’s hard to combat this effectively with the current technology,’ says Zuidema. ‘That’s why my research group and I have spent the last years developing methods to understand what’s happening in these language models. We hope to make them more reliable.’

Jelle Zuidema
Jelle Zuidema

Two important aspects of human intelligence in which ChatGPT performs poorly are logical reasoning and arithmetic. Take the following riddle: ‘I leave five T-shirts out to dry in the sun. After five hours, all five are dry. How long will it take for 30 T-shirts to dry in the sun?’ ChatGPT’s answer: ‘30 hours’. Obviously, the correct answer is five hours. ‘Creating large language models that are capable of logical reasoning is one of the biggest challenges,’ says Zuidema. ‘In recent years, logic has been a neglected field, but it’s found new relevance in current AI research.’

In order to gain insight into the inner workings of a large language model, Zuidema developed several detectors of sorts, including a detector that scans whether logical reasoning is occurring in a model and a detector that discovers how a language model represents numbers. Zuidema: ‘Take these two sentences: “The trains are running” and “The trains are not running”. The second sentence is a denial of the first. Our detector investigates how this denial is represented in the deep neural network. If we can understand that, we can hopefully also intervene when the model makes mistakes in logical reasoning.’

Zuidema also used a simple logic model to generate logic puzzles and then train language models with those logic puzzles. His team then used this to investigate whether they could discover how the model solves logic puzzles. ‘Until now, we’ve really only succeeded at this for very simple puzzles,’ says Zuidema. ‘These are baby steps, but they’re necessary to help a new generation of large language models improve at logical reasoning.’


Large language models have been trained with so much unfiltered data that they automatically contain a multitude of stereotypes. ILLC PhD candidate Rochelle Choenni is investigating which stereotypes appear in the training data and how sensitive the language models are to them.

Using searches in Google, Yahoo and DuckDuckGo search engines, Choenni compiled a database of more than 2,000 English-language stereotypes on professions, social groups, country of origin, gender, age and political beliefs. Choenni: ‘This resulted in stereotypes about – for example – black people being fast, athletic, hated, angry and loud. For millennials, we found stereotypes about them being fragile, nostalgic, lonely and broken.’

Next, she looked into what would happen if a language model were to be refined by training it with carefully selected new texts. Choenni: ‘For example, we took texts from Fox News and The New Yorker. We saw that the stereotypes changed fairly quickly. Using training data from Fox News, the stereotypical police officer is portrayed more positively. Training it with The New Yorker makes the stereotype more negative. This shows that stereotypes in language models can quickly change depending on the training data used.’

Copyright: Academic Affairs
'This shows that stereotypes in language models can quickly change depending on the training data used.’ Rochelle Choenni

The PhD candidate emphasised that it is important to distinguish between stereotypical information in the training data and biases in the behaviour of the language model. ‘Just like people have stereotypes, but don’t necessarily use them in their behaviour, that’s true for language models as well,’ she says. ‘It’s only a problem if a language model actually uses stereotypes from training data when generating a new text.’

Initially, Choenni investigated stereotypes in a language model that had only been trained with English-language texts. But what is the situation for language models trained with multiple languages? These models are being used increasingly often. Currently, Choenni is investigating what happens with stereotypes in those models: ‘Different cultures have different stereotypes, so that’s how they end up in different languages. The problem with stereotypes may resolve itself as long as you train a model with enough languages, but we currently don’t know whether that’s true or not.’

What is Choenni’s view on the mass use of language models when we know that the training data is full of stereotypes? ‘I don’t think it’s realistic to remove all stereotypes from the training data. It’s much too complex, and anyway, not every stereotype is negative. If I say “Dutch people are tall”, that’s a stereotype that’s true as a statistical average. The most important thing is that people are aware of the fact that language models can produce stereotypes.’          

Humanising chatbots

While Zuidema and Choenni are trying to look under the hood of large language models, Raquel Fernández, professor of Computational Linguistics & Dialogue Systems at the UvA, is also engaged in research at the ILLC. She is trying to connect large language models and the way in which people use language. Fernández: ‘I’m interested in how people talk to each other and how we can replicate this naturally with machines.’

For computational linguists such as Fernández, large language models offer a new tool to quantify characteristics of human dialogue and to test whether certain hypotheses on human language use are correct. One of the theories of psycholinguistics is that people subconsciously adjust their language use to make it as easy for their conversational partner to understand them as possible, for example by shortening a sentence or by using simpler words or constructions.                            

Raquel Fernández
Raquel Fernández

Fernández: ‘With these powerful language models, we can quantify to a certain degree how people use language. We do see people speaking in such a way that the other person can understand them without too much effort. But we also see that, for some sentences and language, the models underestimate that effort. This is because large language models are trained with far more texts than you and I could ever read.’

Besides this theoretical work, Fernández tries to expand language models by anchoring them in the visual world – giving them eyes, in a manner of speaking. ‘Humans learn language while in contact with our physical reality. Visual information is an important part of that. We expect a language model that also learns from images to be better at learning language. In one of our research projects, we connect language to the gestures that people make when speaking. Particularly when you want to implement automatic dialogue systems in fields such as education or health care, visual information on posture, gestures or facial expressions is very useful.’

While large language models are very good at generating language, it is difficult to assign them a specific task, such as booking a table at a restaurant or buying a ticket. Fernández: ‘Language models generate the output that they perceive as most likely. They’ve not been trained to achieve a goal that you have in mind. The system has to know what the goal is and how it can achieve it. That remains a major challenge.’

Together with a consortium of a number of Dutch universities and businesses such as Ahold Delhaize, Achmea, and KPN, Fernández has been working on the LESSEN project, financed by the Dutch Research Council. ‘We want to develop chat-based conversational AI agents that are useful to Dutch businesses,’ says Fernández. ‘And we want to do so in such a way that we need less training data, for smaller languages such as Dutch, and for specific fields.’

‘Language models generate the output that they perceive as most likely. They’ve not been trained to achieve a goal that you have in mind. The system has to know what the goal is and how it can achieve it. That remains a major challenge.’ Raquel Fernández

Expectations for the future

What do the three UvA researchers expect for the future when it comes to the use of large language models in society?

Despite all the hurdles to the responsible use of large language models, Rochelle Choenni is somewhat optimistic: ‘It’s just like social media: they have downsides that we learn to deal with. Large language models are here to stay. We need to learn to work with them. The biggest advantage of large language models, I think, is that they’ll make knowledge more accessible than the internet has done so far. For successful applications, however, computer scientists will have to work with people from other fields, such as psychologists, philosophers and sociologists.’

Raquel Fernández points out that many sectors in society are dealing with manpower shortages and that this can partly be resolved with machines that can communicate with people. Fernández: ‘I see good opportunities for conversational AI agents in education and health care. Dialogue is a powerful tool, whether it’s used by a teacher or an educational chatbot. And when a chatbot helps vulnerable people because they experience empathy from a machine, why not? That said, we have a great responsibility in this regard. I think we need to make it clear when people are communicating with a machine.’

Jelle Zuidema thinks we will get language models trained on specific fields: ‘For instance, I expect a lot of applications in education. With language models, you can build teaching assistants with endless reserves of patience. On top of that, we’ll have much better interfaces, allowing users to work with large language models more easily.’ On the other hand, Zuidema calls on the government to take action and shepherd the use of large language models in society in the right direction: ‘I hope the government will create legislation that requires the use of large language models to comply with safety demands.’

About the interviewees

Jelle Zuidema
  • Associate professor in Computational Linguistics and Cognitive Science at the Institute for Logic, Language and Computation
Raquel Fernández
  • Professor in Computational Linguistics and Dialogue Systems at the Institute for Logic, Language and Computation;
  • Leading the Dialogue Modelling Group;
Rochelle Choenni
  • PhD Candidate in Natural Language Processing at the Institute for Logic, Language and Computation;
Read more about the research theme Smart

Developments within information technology are moving fast. Worldwide, but also at the University of Amsterdam. At the Faculty of Natural Sciences, Mathematics and Computer Science, we present all the research around this theme under the heading Smart.

Check our Smart theme page for more research on AI, but also on big data and quantum computing, for example.

Other AI-research from Amsterdam

The UvA is part of the Amsterdam AI, technology for people coalition. A unique partnership between Amsterdam knowledge institutes, research and medical centres, the municipality of Amsterdam and the Amsterdam Economic Board. This Amsterdam coalition focuses on the development and application of responsible AI by combining the power of AI with a people-oriented approach.