Skip to content

How AI Watches Cooking Videos: Inside Pluck's Multi-Modal Extraction Pipeline

Pluck Team 12 min read
technology AI video recipe extraction deep-dive

Imagine a 30-second Instagram Reel where someone makes shakshuka. The camera tilts down into a cast iron skillet. A hand cracks eggs into simmering tomato sauce. Text flashes on screen for two seconds — “1 can crushed tomatoes” — while the creator says “I like to add a little cumin here, maybe half a teaspoon.” There’s no written recipe in the caption. There’s no blog post. The entire recipe exists as a tangle of audio, video, and fleeting on-screen text.

How does an AI turn that into a structured recipe with an ingredient list and numbered steps?

This is the question at the center of what Pluck does, and the answer involves a pipeline that’s more sophisticated than most people realize. If you’ve read our post on whether AI can actually extract recipes from video, you got the overview. This post goes deeper — into the engineering decisions, the tradeoffs between extraction modes, the confidence math, and the reasons different video platforms demand different strategies.

What “watching” a video actually means for an AI

When a human watches a cooking video, everything happens at once. You see the ingredients, hear the instructions, read the text overlay, and mentally assemble the recipe in real time. Your brain fuses all of those inputs without you thinking about it.

An AI has to do the same thing, but deliberately and in stages. Pluck’s extraction pipeline breaks a cooking video into five distinct signal channels, processes each one independently, and then merges them through a fusion step that resolves conflicts and fills gaps. The five channels are: video frame analysis, audio transcription, on-screen text recognition (OCR), caption and subtitle parsing, and metadata analysis.

Each channel has different strengths. Each one fails in different ways. The power comes from combining them.

Video frame sampling: seeing the kitchen

The AI doesn’t process every frame of a video. A 10-minute YouTube video at 30 frames per second contains 18,000 frames. Most of them are nearly identical — the camera sitting on a tripod watching a pot simmer. Processing all of them would waste compute and produce no additional signal.

Instead, Pluck uses intelligent frame sampling. The system detects scene changes — a cut from a prep shot to a stovetop shot, a zoom into a measuring cup, a title card appearing. These transition points are where the information density is highest. The system also samples at regular intervals to catch slow-moving scenes where a creator is assembling ingredients on a counter without dramatic cuts.

What does the AI actually look for in a sampled frame? Several things simultaneously. Ingredient layouts — a cutting board with identifiable vegetables. Measurements visible on cups, spoons, or kitchen scales. Packaging labels that identify specific products. Text overlays the creator has added. The state of the food at different stages: raw ingredients, something browning in a pan, a finished dish on a plate. Each of these observations becomes a data point that contributes to the final recipe.

Frame analysis is indispensable for a specific category of content: the silent cooking video. These ASMR-style clips, popular on YouTube and Instagram, show someone cooking in total silence. No narration. Sometimes no text. Just hands, ingredients, and technique. Without frame analysis, these videos are completely opaque. With it, the AI can identify most of the ingredients and reconstruct the preparation sequence from visual evidence alone.

Why frame rate and resolution matter

Not all video sources are equal. A 4K YouTube upload gives the AI crisp, legible text overlays and identifiable packaging. A low-resolution TikTok repost compressed through three platforms might have text that’s barely readable even to a human. Pluck’s pipeline adapts its OCR confidence thresholds based on the source resolution — lower-resolution inputs get lower text confidence scores, which means the system leans more heavily on audio and caption data to compensate.

Audio transcription: listening to the cook

Speech-to-text is the single richest extraction channel for most cooking videos. Creators narrate as they cook, and that narration contains the core recipe: ingredients, quantities, timing, temperatures, and technique. “Two cups of all-purpose flour, sifted.” “Sear the chicken on medium-high for about four minutes per side.” “Now I’m going to add a teaspoon of smoked paprika — you could use regular paprika if that’s what you have.”

Modern speech recognition handles the chaos of a real kitchen surprisingly well. Sizzling oil, running water, clanking pans, background music — the models filter these out and isolate the human voice. Accents, fast speech, and mumbling that would have killed older systems are handled reliably by current-generation models.

But the AI isn’t just transcribing words. It’s parsing cooking language. It understands that “throw in a knob of butter” means butter is an ingredient, not a mechanical instruction. It knows that “let it go for about twenty minutes” is a timing cue. It recognizes that “season to taste” means salt and pepper are ingredients, even though no quantity was stated.

The transcription also captures conversational asides that contain real information. When a creator says “my grandmother always used lard here but I’m using butter,” the AI correctly identifies butter as the ingredient being used in this recipe while noting the substitution context.

On-screen text recognition: reading what creators write

A huge portion of short-form recipe video — TikTok, Instagram Reels, YouTube Shorts — uses text overlays as the primary communication method. A trending audio track plays while the creator layers text on screen: “2 cups flour,” “1/2 cup sugar,” “350F for 25 min.” The recipe lives entirely in this visual text layer.

Pluck’s OCR goes beyond raw character recognition. It applies spatial and temporal reasoning to the text it reads. Text that appears at the top of the frame in a large font at the beginning of a video is probably a recipe title. Text that flashes briefly while an ingredient is being added is probably a measurement. Text that persists across multiple cuts might be a running ingredient list.

The temporal dimension is important. A text overlay that appears at the same moment the creator pours milk into a bowl gets associated with that action. If the overlay says “1 cup whole milk” and the audio says “add some milk,” the AI takes the more specific OCR reading and merges it with the audio’s confirmation that milk is being added at this step.

The 300-millisecond problem

Some creators use editing styles where text appears for 300 milliseconds — barely long enough for a human to read, let alone an AI to process. Pluck’s frame sampling strategy specifically targets text appearance events, sampling more densely when it detects text entering the frame. Even so, extremely rapid text flashing remains one of the harder challenges. The system flags these extractions with lower confidence so you know to double-check them.

Caption parsing and metadata: the cross-reference layer

Closed captions, video descriptions, and platform metadata form the fourth and fifth channels. These are often treated as secondary, but they serve a critical role as a cross-reference layer that validates and enriches data from the primary channels.

Platform-generated auto-captions are essentially the platform’s own speech-to-text output. They’re imperfect — YouTube auto-captions are notorious for mangling cooking terms — but they provide an independent second opinion on what was said. When Pluck’s own transcription and YouTube’s auto-captions agree on “two tablespoons of soy sauce,” the confidence for that data point goes up. When they disagree, the system examines the context to decide which is more likely correct.

Creator-uploaded subtitles are more valuable because they’ve been manually reviewed. Video descriptions on YouTube frequently contain partial or full ingredient lists, timestamps, or links to blog posts with the written recipe. TikTok captions might mention key ingredients or the recipe name. All of this text gets parsed and correlated with the other channels.

Multi-modal fusion: where the signals merge

This is the part that makes the pipeline more than the sum of its parts. After each channel has been processed independently, the fusion step brings all five data streams together and resolves conflicts.

Consider a concrete example. A YouTube creator is making chicken tikka masala. Here’s what each channel produces:

  • Frame analysis identifies chicken, tomatoes, cream, onions, and several spices on the counter. It sees a measuring spoon.
  • Audio transcription captures “about a tablespoon of garam masala” and “two cups of yogurt for the marinade.”
  • OCR reads a text overlay that says “1 tbsp garam masala.”
  • Auto-captions render it as “a tablespoon of grandma’s salad” (because auto-captions are unreliable with cooking vocabulary).
  • Description lists “garam masala” in a partial ingredient list without a quantity.

The fusion engine resolves this to “1 tablespoon garam masala” — taking the quantity from OCR and audio (which agree), the ingredient identity from audio, OCR, and description (three-way agreement), and discarding the auto-caption’s garbled version. The auto-caption error actually doesn’t hurt anything because the other channels outvote it.

This conflict resolution happens for every ingredient, every measurement, every instruction in the recipe. Each data point gets a confidence score based on how many channels contributed to it and how consistent they were.

Confidence scoring: knowing what you don’t know

Not every extraction is equally reliable, and pretending otherwise would erode trust. Pluck assigns a confidence score to every extraction, and that score directly affects the user experience.

High-confidence extractions — where multiple channels corroborated each other and the recipe is complete — can be saved quickly with minimal review. Lower-confidence extractions get surfaced with indicators showing which parts of the recipe the system is less sure about. Maybe the audio was noisy and the AI isn’t sure if the creator said “one teaspoon” or “one tablespoon” of cayenne. That ingredient gets flagged so you can verify it.

The confidence system also accounts for recipe completeness. A recipe with a title, 12 well-specified ingredients, and 8 numbered steps scores higher than one with a title, 4 vague ingredients, and a single paragraph of instructions. Completeness and specificity both factor into the final score.

Why different platforms need different strategies

A 45-minute YouTube video and a 15-second TikTok clip are fundamentally different extraction problems, even though both contain recipes. Pluck’s pipeline adapts to the source platform in several ways.

YouTube videos tend to be longer, better narrated, and often include description text. Audio transcription is usually the dominant channel. The AI has more data to work with, but also needs to handle multi-recipe videos (like “5 easy weeknight dinners”) by identifying recipe boundaries within the content. For a detailed walkthrough, see our guide on saving YouTube cooking videos as recipes.

TikTok videos are short, often loud, and heavily reliant on text overlays. OCR becomes more important here. The audio track is frequently a trending song rather than narration, which means audio transcription produces less signal. Frame analysis matters more because the compressed format means every visual frame carries proportionally more information. And as we’ve covered extensively, TikTok’s own save feature is unreliable for long-term recipe storage.

Instagram Reels sit somewhere in between. They often combine narration with text overlays, and Instagram’s caption field regularly contains the full recipe in text form. When a creator writes out the recipe in the caption, that becomes the highest-confidence source, and the video analysis serves as validation rather than the primary extraction channel.

AI model routing: picking the right tool for the job

Pluck doesn’t rely on a single AI model. Different models have different strengths, and the pipeline routes extractions to the provider best suited for the content type.

The system can use Claude, GPT, and Gemini, among others. Some models are stronger at vision tasks — parsing cluttered frames with multiple text overlays against a busy background. Others are better at understanding spoken language with heavy accents or background noise. Some excel at structured output generation, producing clean JSON with well-normalized ingredient quantities.

The routing decision is based on the content analysis from the early stages of the pipeline. A video that’s primarily narrated with clear audio might get routed to a model with strong language understanding. A silent cooking video with dense text overlays might go to a model with superior vision capabilities. The user never sees this routing — they just get the best extraction the system can produce.

Why any of this matters when you’re cooking dinner

All of this engineering serves a single practical purpose: you saw a recipe in a video, and you want to cook it. You don’t want to pause and rewind 40 times. You don’t want to screenshot it and lose it in your camera roll. You don’t want to transcribe it by hand.

The multi-modal pipeline means Pluck can handle the messy reality of how recipes actually exist on the internet today. Not neatly formatted on food blogs with a “Print Recipe” button — but spoken fast over sizzling pans, flashed on screen for two seconds, half-written in a caption, and never formally documented anywhere. The AI watches, listens, reads, and cross-references so you don’t have to. What comes out the other end is a clean recipe card: title, ingredients, steps, times, servings. The same format whether the source was a 45-minute YouTube masterclass or a 15-second TikTok. You can read more about how all five extraction modes work on our How Pluck Works page.

The technology is genuinely new. Two years ago, none of this was possible at a quality level that was useful for real cooking. Today, multi-modal AI has crossed the line from research demo to practical kitchen tool. And it’s still getting better — the models improve, the confidence calibration tightens, and the category of videos that produce incomplete extractions shrinks with every update.


Want to see multi-modal extraction in action? Pluck is available now on Android — get it on Google Play. Paste any cooking video URL and watch the AI turn it into a recipe you can actually cook from. iOS coming soon; join the waitlist to be notified.

P

Pluck Team

We're a small team of home cooks and engineers building the recipe app we always wanted. We write about recipe saving, AI extraction, and cooking smarter.

Learn more about us

Ready to save your recipes?

Pluck is available now on Android. iOS coming soon.

iOS coming soon — join the waitlist

No spam. Unsubscribe anytime.

Got a feature idea? We're all ears - Pluck is shaped by its community.