BAH Dataset for Ambivalence/Hesitancy Recognition in Videos for Digital Behavioural Change (ICLR2026)
by Manuela González-González3,4, Soufiane Belharbi1, Muhammad Osama Zeeshan1, Masoumeh Sharafi1, Muhammad Haseeb Aslam1, Alessandro Lameiras Koerich2, Marco Pedersoli1, Simon L. Bacon3,4, Eric Granger1
1 LIVIA, Dept. of Systems Engineering, ETS Montreal, Canada
2 LIVIA, Dept. of Software and IT Engineering, ETS Montreal, Canada
3 Dept. of Health, Kinesiology, & Applied Physiology, Concordia University, Montreal, Canada
4 Montreal Behavioural Medicine Centre, CIUSSS Nord-de-l’Ile-de-Montréal, Canada
Contact: 


Abstract
Ambivalence and hesitancy (A/H), a closely related construct, is the primary
reasons why individuals delay, avoid, or abandon health behaviour changes.
It is a subtle and conflicting emotion that sets a person in a state between
positive and negative orientations, or between acceptance and refusal to do
something. It manifests by a discord in affect between multiple modalities or
within a modality, such as facial and vocal expressions, and body language.
Although experts can be trained to recognize A/H as done for in-person
interactions, integrating them into digital health interventions is costly and
less effective. Automatic A/H recognition is therefore critical for the
personalization and cost-effectiveness of digital behaviour change interventions.
However, no datasets currently exists for the design of machine learning models
to recognize A/H.
This paper introduces the Behavioural Ambivalence/Hesitancy (BAH) dataset
collected for multimodal recognition of A/H in videos.
It contains 1,427 videos with a total duration of 10.60 hours captured from 300
participants across Canada answering predefined questions to elicit A/H. It is
intended to mirror real-world online personalized behaviour change interventions.
BAH is annotated by three experts to provide timestamps that
indicate where A/H occurs, and frame- and video-level annotations with A/H cues.
Video transcripts, cropped and aligned faces, and participants' meta-data are
also provided. Since A and H manifest similarly in practice, we provide a binary
annotation indicating the presence or absence of A/H.
Additionally, this paper includes benchmarking results using baseline models on
BAH for frame- and video-level recognition, zero-shot prediction, and
personalization using source-free domain adaptation. The limited performance
highlights the need for adapted multimodal and spatio-temporal models for A/H
recognition. Results for specialized methods for fusion are shown to assess the
presence of conflict between modalities, and for temporal modelling for
within-modality conflict are essential for better A/H recognition.
The data, code, and pretrained weights are publicly available.
Code: Pytorch 2.2.2
Citation:
@inproceedings{gonzalez-26-bah,
title={{BAH} Dataset for Ambivalence/Hesitancy Recognition in Videos for Digital Behavioural Change},
author={González-González, M. and Belharbi, S. and Zeeshan, M. O. and
Sharafi, M. and Aslam, M. H and Pedersoli, M. and Koerich, A. L. and
Bacon, S. L. and Granger, E.},
booktitle={ICLR},
year={2026}
}
Content:
BAH dataset: Download
To download BAH dataset, please follow closely the instructions described here: BAH Download instructions.
Pretrained weights
The folder pretrained-models contains the weights of several pretrained models:
- Frame-level supervised learning:
frame-level-supervised-learningcontains facial expression vision models, BAH_DB vision models, and multimodal models with different fusion techniques. - Domain adaptation: comming soon.
BAH presentation
BAH: Capture & Annotation


BAH: Variability













BAH: Experimental Protocol


Experiments: Baselines
1) Frame-level supervised classification using multimodal



2) Video-level supervised classification using multimodal

3) Zero-shot performance: Frame- & video-level


4) Personalization using domain adaptation (frame-level)

Conclusion
This work introduces a new and unique multimodal and subject-based video dataset, BAH, for A/H recognition in videos. BAH contains 300 participants across 9 provinces in Canada. Recruited participants answer 7 designed questions to elicit A/H while recording themselves via webcam and microphone via our web-platform. The dataset amounts to 1,427 videos for a total duration of 10.60 hours with 1.79 hours of A/H. It was annotated by our behavioural team at video- and frame-level.
Our initial benchmarking yielded limited performance highlighting the difficulty of A/H recognition. Our results showed also that leveraging context, multimodality, and adapted feature fusion is a first good direction to design robust models. Our dataset and code are made public.
The following appendix contains related work, more detailed and relevant statistics about the datasets and its diversity, dataset limitations, implementation details, and additional results.
Acknowledgments
This work was supported in part by the Fonds de recherche du Québec – Santé, the Natural Sciences and Engineering Research Council of Canada, Canada Foundation for Innovation, and the Digital Research Alliance of Canada. We thank interns that participated in the dataset annotation: Jessica Almeida (Concordia University, Université du Québec à Montréal), and Laura Lucia Ortiz (MBMC).