Simultaneous tracking of objects with loose context constraints from multiple views: human–human interaction paradigm

When a scene consists of multiple regions of interest related in some context, their tracking can (and often should) incorporate topology constraints that depict the inherent structure properties of the target scene subset. Such principle has been commonly implemented by pictorial structure represen...

Full description

Saved in:
Bibliographic Details
Published inMachine vision and applications Vol. 36; no. 4; p. 74
Main Authors Vatti, Jay, Tsechpenakis, Gavriil
Format Journal Article
LanguageEnglish
Published Berlin/Heidelberg Springer Berlin Heidelberg 01.07.2025
Springer Nature B.V
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:When a scene consists of multiple regions of interest related in some context, their tracking can (and often should) incorporate topology constraints that depict the inherent structure properties of the target scene subset. Such principle has been commonly implemented by pictorial structure representations, graph networks, siamese trackers, etc., and, in general, part-based spatio-temporal modeling. This is contrary to the multiple object tracking principle, where all objects of the desired categories are detected and tracked, including objects that do not ‘belong’ to the context. Context topology is often not fixed over time. We use the notion of ‘loose context’ to denote partial conditionality among the targets: preserving relationships with respect to labels and locations within each part set (part-defined entity), while assuming conditionality between part sets only where dictated by the given scenario. An indicative example is human–human interaction, where one person can move independently from the other, while their approximate relative positions are given. We encode context with a small graph, where tracked regions are represented by nodes, and context topology is captured by edges. Instead of using image patches in a fully connected graph representation, we employ region proposals: we decouple the graph definition from the image domain, and the search space consists of a proposal set to be sampled for deriving candidate solutions at each time instance. We use sequences from multiple views to alleviate missing data from occlusions, and the corresponding proposals are also considered, through projection regression, in the candidate graph solution for a reference plane (view). The objective function incorporates spatio-temporal topology information, and appearance similarity of the proposal regions as encoded by a definition of residual loss of a siamese graph attention network (embedding similarity). Our architecture consists of four parts: a region proposal network, a plane-to-plane reciprocal projection regression module, the siamese GAT for evaluating target set appearance similarity between successive instances, and the objective optimizer. We validate our method using a ‘round table’ setup with four subjects and three cameras: one providing bird’s-eye view, where the desired targets are hands, and each of the other cameras captures two subjects with desired targets being faces and hands.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:0932-8092
1432-1769
DOI:10.1007/s00138-025-01695-8