The first part of the thesis probes video-language models for their ability to distinguish simple ‘before’ and ‘after’ relationships, using synthetic data, to establish that out of six modern foundation models none passes a simple test of time. Further, a data-efficient post-pretraining procedure is proposed that can inject a sense of time into one such model, without the need to train it again from scratch. While doing so making huge computational savings. The updated model performs well in general temporal video-language reasoning tasks. Demonstrating for the first time that existing video foundation models can be ‘updated’ to inject new behavior. The second, more explorative, study extends temporal awareness to audio-visual models, covering the definition, representation, and benefits of being time-aware on state-changing actions like ‘pouring water’, and reversible actions like ‘turning on/off tap’. It again demonstrates existing foundation models are unable to pass a simple audiovisual time-aware arrow-of-time prediction task.
Part of Piyush’s thesis has been published as a paper at Computer Vision and Pattern Recognition (CVPR), the world’s most prestigious AI conference. According to Google Scholar, CVPR currently ranks as the fourth most cited scientific publication, just after Nature, The New England Journal of Medicine and Science. Piyush was the first UvA Master AI student to publish at CVPR.
Piyush was previously a recipient of the Amsterdam Merit Scholarship which supports international students studying in the Master’s programmes at the University of Amsterdam. He was also selected as an ELLIS Honors Student to pursue his Master’s thesis advised by Prof. Cees Snoek in collaboration with Dr. Makarand Tapaswi (IIIT Hyderabad, India) and Prof. Andrew Zisserman at the University of Oxford. Piyush is currently a DPhil student at the University of Oxford.