Multilingual language technology that goes beyond where ChatGPT ends

16 May 2023

UvA’s Language and Technology Lab helps create language technologies for languages for which little data is available and which are not served by the big tech companies.

In recent months, text generator ChatGPT has amazed the world with automatic writing of humanlike texts in all kinds of styles. Based on prompts that you type in ChatGPT can generate news articles, long reads, essays, poems, dialogues, scripts and even jokes or computercode. It can also answer questions and translate.

Scale up

The fundamental techniques of ChatGPT date from 2017, but since then OpenAI, the company that developed the commercial text generator, has scaled up the model from 200 million parameters to 175 billion parameters last year. In addition, it has scaled up computing power and training data to such an extent that this year’s results have astonished even experts in the field.

‘Scientists could see ChatGPT coming,’ says UvA professor Christof Monz, ‘but I was still surprised at how well it works. It is great to see how much interest there is now in language technology. That shows how close human thinking skills and language are and also how important language is to give the impression of an intelligent system.’

It is great to see how much interest there is now in language technology. That shows how close human thinking skills and language are and also how important language is to give the impression of an intelligent system. Christof Monz

Having said that, ChatGPT hasn’t solved everything in natural language processing and generation. Monz: ‘It can, for example, generate plausible-looking text that is factually incorrect, logically inconsistent, or contains harmful pre-judgments. You should be well aware that you cannot fully trust ChatGPT’s texts.’

At the Informatics Institute Monz leads the Language and Technology Lab (LTL) group which goes beyond where ChatGPT ends. One of ChatGPT’s shortcomings is that it needs enormous amounts of data. The text generator is trained on so much text, all scraped from the internet, Wikipedia, online libraries and other sources, that if a single human would read eight hours per day and seven days per week, they would need 22,000 years to read what ChatGPT has processed during its training.

‘Smaller’ languages

Of the more than seven thousand languages spoken worldwide, however, most have so little digital data available that ChatGPT cannot understand, generate or translate these ‘smaller’ languages, many of which still have many millions of speakers. ‘Google Translate works for something like 140 languages,’ says Monz, ‘and the European equivalent DeepL for something like twenty languages. From the point of view of inclusiveness though, you want to offer language technology for those smaller languages as well. There is a lot to be gained there, and that is an important part of what we do in our lab.’

The Language and Technology Lab that Monz leads focusses on machine translation, question-answering systems, summarising documents and on non-toxic language generation. Multilingual aspects of language technologies are a common thread.

Monz: ‘We want to be able to translate languages for which there exist little or no data. Let’s take the example of translating between Arabic and Dutch. Surprisingly, few texts translated from Arabic to Dutch are available, too few to train our deep learning models on. Therefore, we train our systems on other language pairs for which we do have a lot of data, for example Arabic-English, English-Chinese and Dutch-English. We try to develop a system that can find language-independent representations for multilingual sentences with the same meaning.’

Neural networks

Deep learning systems are essentially neural networks in which artificial neurons are ordered in tens or hundreds of layers that connect thousands to billions of neurons with each other. The number of connections between the neurons is the number of parameters of the model. Two sentences in two different languages have the same representation if all the parameters are equal or roughly equal.

‘We are trying to invent techniques that give the same representation for multilingual sentences with the same meaning’, says Monz. ‘We are not there yet, but ideally, if an Arabic sentence has the same representation as a Dutch sentence, you have found the Dutch translation of the Arabic sentence without any explicit translation data from Arabic to Dutch being available.’

More information

https://openai.com/blog/chatgpt

Multilingual language technology that goes beyond where ChatGPT ends

Scale up

‘Smaller’ languages

Neural networks

More information

Cookie Consent