Towards Neuro-Symbolic Video Understanding

Anonymous Authors

Paper arXiv Video Code Slides

Abstract

The unprecedented surge in video data production in recent years necessitates efficient tools to extract meaningful frames from videos for downstream tasks. Long-term temporal reasoning is a key desideratum for frame retrieval systems. While state-of-the-art foundation models, like VideoLLaMA and ViCLIP, are proficient in short-term semantic understanding, they surprisingly fail at long-term reasoning across frames. A key reason for this failure is that they intertwine per-frame perception and temporal reasoning into a single deep network. Hence, decoupling but co-designing semantic understanding and temporal reasoning is essential for efficient scene identification. We propose a system that leverages vision-language models for semantic understanding of individual frames but effectively reasons about the long-term evolution of events using state machines and temporal logic (TL) formulae that inherently capture memory. Our TL-based reasoning improves the F1 score of complex event identification by 9-15% compared to benchmarks that use GPT4 for reasoning on state-of-the-art self-driving datasets such as Waymo and NuScenes.

Methodology

We introduce a novel way to identify scenes of interest using a neuro-symbolic approach. Given video streams or clips alongside the temporal logic specification Φ, Neuro-Symbolic Visual Search with Temporal Logic (NSVS-TL) identifies scenes of interest.

Method Overview

Step 1: We calibrate the confidence of neural perception models to ensure precise object detection. This calibration enables the detection of relevant propositions in a given frame to construct a probabilistic automaton.
Step 2: Subsequently, each frame undergoes a validation process utilizing two distinct validation functions. This step ensures that only frames containing relevant visual information proceed to the next phase of the method.
Step 3: Upon validation, we construct a probabilistic automaton dynamically to encapsulate the temporal and logical relations between successive frames.
Step 4: Finally, we utilize a model-checking method to determine whether a constructed automaton satisfies the temporal logic specification. If an automaton passes this check, then a sequence of frames within the automaton is identified as a scene of interest by the given temporal logic specification.

Autonomous Driving Example

Key Capabilities

Long Horizon Video Understanding

We evaluate multi-event sequences with temporally extended gaps which have a large impact on video length. We observe the consistency with videos spanning up to 40 minutes, indicating reliability in handling long videos.

Plug In Your Own Model

Our framework allows for the integration of any neural perceptual model, enhancing the capability to understand videos. This enables us to localize frames of interest with respect to queries.

Comparison to Benchmark

From the experiments, we observe that NSVS-TL with various neural perception models performs differently depending on the complexity of the TL specification and datasets. Using the datasets, we see that for single event scenarios, both our method and LLM-based reasoning perform reasonably well since these events do not require complex reasoning whereas for multi-event scenarios, our TL-based reasoning outperforms all LLM-based baselines.

BibTeX

@article{anonymous,
  author    = {anonymous},
  title     = {anonymous},
  journal   = {anonymous},
  year      = {anonymous},
}