Apple trained a large language model to efficiently understand long-form video

Apple researchers have developed an adapted version of the SlowFast-LLaVA model that beats larger models at long-form video analysis and understanding. Here’s what that means.

The nerdy bits

Very basically, when an LLM is trained to also understand video, it learns to split videos into frames, apply computer vision to extract visual features, analyze how those features change over time, and align all of that with language so it can describe or reason about the video in the form of text.

One very inefficient way to do this is to analyze every single frame of a video, which creates an overwhelming amount of duplicated information, since most frames rarely include significant changes from one to the next.

With this overwhelming amount of duplicated information at hand, it is very easy to blow past the LLM’s context window, which is the maximum amount of information it can retain at once. Once an LLM exceeds its context window, in order for a conversation to keep going, it stops taking older tokens into account to make room for new ones as it predicts each new token.

Of course, there are more efficient ways to train video LLMs (NVIDIA recently published an interesting paper on this), but this is the general idea to keep in mind for Apple’s study.

Apple’s study

As Apple’s researchers explain it in the paper SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding:

“Video large language models (LLMs) integrate video perception into pre-trained LLMs to process videos and generate responses to user commands. Although significant progress has been made, notable limitations remain in existing Video LLMs.”

The limitations, according to them, are threefold:

... continue reading