Publications - Anna Bavaresco

2025

Modeling Human Concepts with Subspaces in Deep Vision Models

Anna Bavaresco, Nhut Truong, Uri Hasson

ACM Transactions on Intelligent Interactive Systems 2025

Improving the modeling of human representations of everyday semantic categories, such as animals or food, can lead to better alignment between AI systems and humans. Humans are thought to represent such categories using dimensions that capture relevant variance, in this way defining the relationship between category members. In AI systems, the representational space for a category is defined by the distances between its members. Importantly, in this context, the same features are used for distance computations across all categories. In two experiments, we show that pruning a model’s feature space to better align with human representations of a category selects for different model features and different subspaces for different categories. In addition, we provide a proof of concept demonstrating the relevance of these findings for evaluating the quality of images generated by AI systems.

[Paper]

Modeling Human Concepts with Subspaces in Deep Vision Models

Anna Bavaresco, Nhut Truong, Uri Hasson

ACM Transactions on Intelligent Interactive Systems 2025

[Paper]

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks

Anna Bavaresco, Raffaella Bernardi, Leonardo Bertolazzi, Desmond Elliot, Raquel Fernández, Albert Gatt, Esam Ghaleb, Mario Giulianelli, Michael Hanna, Alexander Koller, André F. T. Martins, Philipp Mondorf, Vera Neplenbroek, Sandro Pezzelle, Barbara Plank, David Schlangen, Alessandro Suglia, Aditya K Surikuchi, Ece Takmaz, Alberto Testoni

ACL 2025 2025

There is an increasing trend towards evaluating NLP models with LLM-generated judgments instead of human judgments. In the absence of a comparison against human data, this raises concerns about the validity of these evaluations; in case they are conducted with proprietary models, this also raises concerns over reproducibility. We provide JUDGE-BENCH, a collection of 20 NLP datasets with human annotations, and comprehensively evaluate 11 current LLMs, covering both open-weight and proprietary models, for their ability to replicate the annotations. Our evaluations show that each LLM exhibits a large variance across datasets in its correlation to human judgments. We conclude that LLMs are not yet ready to systematically replace human judges in NLP.

[Code] [Paper]

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks

ACL 2025 2025

[Code] [Paper]

Testing Spatial Intuitions of Humans and Large Language and Multimodal Models in Analogies?

Ivo Bueno, Anna Bavaresco, João Miguel Cunha, Philipp Wicke

Proceedings of the 2nd Workshop on Analogical Abstraction in Cognition, Perception, and Language (Analogy-Angle II) 2025

Language and Vision-Language Models exhibit impressive language capabilities akin to human reasoning. However, unlike humans who acquire language through embodied, interactive experiences, these models learn from static datasets without real-world interaction. This difference raises questions about how they conceptualize abstract notions and whether their reasoning aligns with human cognition. We investigate spatial conceptualizations of LLMs and VLMs by conducting analogy prompting studies with LLMs, VLMs, and human participants. We assess their ability to generate and interpret analogies for spatial concepts. We quantitatively compare the analogies produced by each group, examining the impact of multimodal inputs and reasoning mechanisms. Our findings indicate that generative models can produce and interpret analogies but differ significantly from human reasoning in their abstraction of spatial concepts - variability influenced by input modality, model size, and prompting methods, with analogy-based prompts not consistently enhancing alignment. Contributions include a methodology for probing generative models through analogies; a comparative analysis of analogical reasoning among models, and humans; and insights into the effect of multimodal inputs on reasoning.

[Paper] [Code]

Testing Spatial Intuitions of Humans and Large Language and Multimodal Models in Analogies?

Ivo Bueno, Anna Bavaresco, João Miguel Cunha, Philipp Wicke

Proceedings of the 2nd Workshop on Analogical Abstraction in Cognition, Perception, and Language (Analogy-Angle II) 2025

[Paper] [Code]

Experiential Semantic Information and Brain Alignment: Are Multimodal Models Better than Language Models?

Anna Bavaresco, Raquel Fernández

CoNLL 2025 2025

A common assumption in Computational Linguistics is that text representations learnt by multimodal models are richer and more human-like than those by language-only models, as they are grounded in images or audio—similar to how human language is grounded in real-world experiences. However, empirical studies checking whether this is true are largely lacking. We address this gap by comparing word representations from contrastive multimodal models vs. language-only ones in the extent to which they capture experiential information—as defined by an existing norm-based 'experiential model'—and align with human fMRI responses. Our results indicate that, surprisingly, language-only models are superior to multimodal ones in both respects. Additionally, they learn more unique brain-relevant semantic information beyond that shared with the experiential model. Overall, our study highlights the need to develop computational models that better integrate the complementary semantic information provided by multimodal data sources.

[Paper] [Code]

Experiential Semantic Information and Brain Alignment: Are Multimodal Models Better than Language Models?

Anna Bavaresco, Raquel Fernández

CoNLL 2025 2025

[Paper] [Code]

2024

Modelling Multimodal Integration in Human Concept Processing with Vision-and-Language Models

Anna Bavaresco, Marianne de Heer Kloots, Sandro Pezzelle, Raquel Fernández

ArXiv 2024

Representations from deep neural networks (DNNs) have proven remarkably predictive of neural activity involved in both visual and linguistic processing. Despite these successes, most studies to date concern unimodal DNNs, encoding either visual or textual input but not both. Yet, there is growing evidence that human meaning representations integrate linguistic and sensory-motor information. Here we investigate whether the integration of multimodal information operated by current vision-and-language DNN models (VLMs) leads to representations that are more aligned with human brain activity than those obtained by language-only and vision-only DNNs. We focus on fMRI responses recorded while participants read concept words in the context of either a full sentence or an accompanying picture. Our results reveal that VLM representations correlate more strongly than language- and vision-only DNNs with activations in brain areas functionally related to language processing. A comparison between different types of visuo-linguistic architectures shows that recent generative VLMs tend to be less brain-aligned than previous architectures with lower performance on downstream applications. Moreover, through an additional analysis comparing brain vs. behavioural alignment across multiple VLMs, we show that -- with one remarkable exception -- representations that strongly align with behavioural judgments do not correlate highly with brain responses. This indicates that brain similarity does not go hand in hand with behavioural similarity, and vice versa.

[Code] [Paper]

Modelling Multimodal Integration in Human Concept Processing with Vision-and-Language Models

Anna Bavaresco, Marianne de Heer Kloots, Sandro Pezzelle, Raquel Fernández

ArXiv 2024

[Code] [Paper]

Don't Buy it! Reassessing the Ad Understanding Abilities of Contrastive Multimodal Models

Anna Bavaresco, Alberto Testoni, Raquel Fernández

ACL 2024

Image-based advertisements are complex multimodal stimuli that often contain unusual visual elements and figurative language. Previous research on automatic ad understanding has reported impressive zero-shot accuracy of contrastive vision-and-language models (VLMs) on an ad-explanation retrieval task. Here, we examine the original task setup and show that contrastive VLMs can solve it by exploiting grounding heuristics. To control for this confound, we introduce TRADE, a new evaluation test set with adversarial grounded explanations. While these explanations look implausible to humans, we show that they "fool" four different contrastive VLMs. Our findings highlight the need for an improved operationalisation of automatic ad understanding that truly evaluates VLMs' multimodal reasoning abilities. We make our code and TRADE available at https://github.com/dmg-illc/trade.

[Code] [Paper]

Don't Buy it! Reassessing the Ad Understanding Abilities of Contrastive Multimodal Models

Anna Bavaresco, Alberto Testoni, Raquel Fernández

ACL 2024

[Code] [Paper]

2023

Improved prediction of behavioral and neural similarity spaces using pruned DNNs

Priya Tarigopula, Scott Laurence Fairhall, Anna Bavaresco, Nhut Truong, Uri Hasson

Neural Networks 2023

Deep Neural Networks (DNNs) have become an important tool for modeling brain and behavior. One key area of interest has been to apply these networks to model human similarity judgements. Several previous works have used the embeddings from the penultimate layer of vision DNNs and showed that a reweighting of these features improves the fit between human similarity judgments and DNNs. These studies underline the idea that these embeddings form a good basis set but lack the correct level of salience. Here we re-examined the grounds for this idea and on the contrary, we hypothesized that these embeddings, beyond forming a good basis set, also have the correct level of salience to account for similarity judgments. It is just that the huge dimensional embedding needs to be pruned to select those features relevant for the considered domain for which a similarity space is modeled. In Study 1 we supervised DNN pruning based on a subset of human similarity judgments. We found that pruning: i) improved out-of-sample prediction of human similarity judgments from DNN embeddings, ii) produced better alignment with WordNet hierarchy, and iii) retained much higher classification accuracy than reweighting. Study 2 showed that pruning by neurobiological data is highly effective in improving out-of-sample prediction of brain-derived representational dissimilarity matrices from DNN embeddings, at times fleshing out isomorphisms not otherwise observable. Using pruned DNNs, image-level heatmaps can be produced to identify image sections whose features load on dimensions coded by a brain area. Pruning supervised by human brain/behavior therefore effectively identifies alignable dimensions of knowledge between DNNs and humans and constitutes an effective method for understanding the organization of knowledge in neural networks.

[Paper]

Improved prediction of behavioral and neural similarity spaces using pruned DNNs

Priya Tarigopula, Scott Laurence Fairhall, Anna Bavaresco, Nhut Truong, Uri Hasson

Neural Networks 2023

[Paper]