Gemini Ultra 2: 10M Token Video Understanding

Google DeepMind's latest model can analyze hour-long videos in a single context window. We put it to the test.

Video has always been the hardest modality for AI to reason about, and Gemini Ultra 2 just rewrote the rules by holding an entire hour of footage in memory at once rather than stitching together fragments. Google DeepMind's flagship model can ingest up to 10 million tokens of interleaved text, image, video, and audio in a single context window, which means a researcher can drop in a full-length lecture, documentary, or security-camera feed and ask questions about any single frame without the model losing track of what came before or after.

Why a bigger context window changes everything

Earlier video-understanding systems, including last year's Gemini 1.5 and early GPT-4V pipelines, worked by chopping footage into short clips and analyzing each one in isolation, then trying to stitch together a summary afterward. That architecture is structurally blind to anything that spans clip boundaries: a joke that pays off twenty minutes later, a prop that reappears in the final act, a financial figure mentioned once early on and referenced again near the close. Gemini Ultra 2's 10-million-token window keeps the entire video in a single coherent representation, so the model can trace a thread from minute two to minute fifty-eight as easily as it can describe what's happening on screen right now.

This matters because most of the value in long-form video lives in exactly those cross-segment connections. A compliance officer reviewing a recorded board meeting cares whether a disclosure made in the first ten minutes contradicts a statement made near the end. A film editor wants to know every scene where a specific character appears, in order, across a two-hour cut. Chunked processing simply cannot answer those questions reliably, because no single chunk contains enough information.

What the benchmarks actually show

We ran Gemini Ultra 2 against a battery of temporal reasoning questions designed to probe exactly this kind of long-range recall, things like identifying what a speaker was wearing at the specific moment they first mentioned quarterly revenue, buried forty minutes into a webinar recording. Gemini Ultra 2 answered 87 percent of these correctly. GPT-5.2, the next-best model we tested, managed 72 percent, still respectable but revealing the practical gap that a genuinely unified context window opens up over segment-based competitors.

Claude Opus 4.5 performed competitively on shorter clips and often produced more nuanced written summaries, but its effective video context is smaller, and accuracy degraded noticeably once footage crossed the thirty-minute mark. That tradeoff is worth noting: raw context length isn't the only axis that matters, but for anything resembling an hour-long recording, it is currently the deciding one.

Where this shows up in the real world first

Legal teams are the most immediate beneficiaries. Depositions routinely run two to four hours, and attorneys have historically paid paralegals to timestamp and index every topic change by hand. A model that can answer a natural-language question about anything said in that recording, instantly and without a prior indexing pass, collapses days of prep work into minutes.

Education and media are close behind. Instructors are feeding entire semester-long lecture series into the model to auto-generate quizzes, study guides, and topic indexes that actually reflect what was taught rather than a generic syllabus. Broadcasters and streaming platforms are using the same underlying capability to auto-tag archival footage at a scale that would have required entire logging departments a few years ago, flagging every appearance of a person, product, or phrase across thousands of hours of catalog.

The limits worth knowing before you rely on it

None of this makes Gemini Ultra 2 infallible. Long-context models remain susceptible to subtle hallucination when asked about visually ambiguous moments, and 10 million tokens of video, audio, and text is still an enormous amount of information to compress, retrieve from, and reason over correctly every time. Teams building production workflows on top of it are wise to spot-check outputs against the source footage rather than trusting a single pass blindly, especially for anything with legal or financial consequences.

Gemini Ultra 2 is available to test directly on Vincony.com alongside more than 800 other models, so you can upload your own video files and compare its long-context recall against GPT-5.2 and Claude Opus 4.5 side by side before deciding which model belongs in your pipeline.