All reports
Technology by deep-research

Narrativ MVP Technical Blueprint — Document-to-Video Pipeline Architecture, Rendering Stack, and AI Integration

Narrativ

Narrativ MVP Technical Blueprint — Document-to-Video Pipeline Architecture, Rendering Stack, and AI Integration

Date: 2026-03-20 Context: Narrativ has minimal technical research coverage (1 market report + 1 compatibility report). This blueprint synthesizes the document-to-video opportunity, Remotion rendering architecture, and AI pipeline design into a concrete MVP plan.


Executive Summary

  • Narrativ’s core pipeline is a 4-stage transformation: Document Ingestion → Content Intelligence → Scene Composition → Video Rendering
  • Remotion is the right rendering engine — React-based composition, Lambda-powered parallel rendering, and programmatic control over every frame. The Remotion + Next.js 16 compatibility issue (MOKA-309) is solvable with webpack mode enforcement
  • AI orchestration drives three critical stages: content extraction (LLM), narration generation (TTS), and visual asset selection (multimodal LLM)
  • Google’s NotebookLM Cinematic Video Overviews (launched March 2026) validates the document-to-video category but targets consumers, not professional/enterprise workflows
  • Key differentiation: Narrativ should own the “iterative refinement” loop — edit scenes, swap assets, adjust timing, re-render — unlike one-shot generators

1. Pipeline Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                        Narrativ Pipeline                                │
│                                                                         │
│  ┌──────────┐   ┌──────────────┐   ┌───────────────┐   ┌─────────────┐ │
│  │ Document  │──▶│   Content    │──▶│    Scene      │──▶│   Video     │ │
│  │ Ingestion │   │ Intelligence │   │ Composition   │   │  Rendering  │ │
│  └──────────┘   └──────────────┘   └───────────────┘   └─────────────┘ │
│                                                                         │
│  PDF, PPTX,       Extract key       Generate scenes     Remotion SSR    │
│  DOCX, MD,        points, create    with visual cues,   or Lambda for   │
│  Web URLs,        narrative arc,    TTS narration,      parallel MP4    │
│  Google Docs      chapter splits    asset matching       output          │
└─────────────────────────────────────────────────────────────────────────┘

Stage 1: Document Ingestion

Input FormatExtraction MethodOutput
PDFpdf-parse + LLM for layout understandingStructured text + images
PPTXpptx npm package → slide-by-slide extractionText + speaker notes + embedded images
DOCXmammoth → HTML → structured contentHeadings, paragraphs, images
MarkdownDirect parsing (remark)AST with sections
Web URL@mozilla/readability + Playwright screenshotClean text + page visuals
Google DocsGoogle Docs API → JSONStructured document

Key design decision: Extract structure, not just text. Headings define chapters. Speaker notes become narration hints. Embedded images become scene visuals.

interface ExtractedDocument {
  title: string;
  sections: Section[];
  metadata: {
    author?: string;
    language: string;
    totalWords: number;
    estimatedReadTime: number;
    images: ExtractedImage[];
  };
}

interface Section {
  heading: string;
  level: number;       // h1=1, h2=2, etc.
  content: string;     // main text
  speakerNotes?: string;  // from PPTX
  images: ExtractedImage[];
  bulletPoints: string[];
}

Stage 2: Content Intelligence (LLM-Powered)

This is where AI transforms raw document content into a video script:

interface VideoScript {
  title: string;
  totalDuration: number;  // estimated seconds
  scenes: SceneScript[];
  style: VideoStyle;
}

interface SceneScript {
  id: string;
  title: string;
  narration: string;         // text for TTS
  visualDescription: string; // prompt for asset selection
  keyPoints: string[];       // on-screen text overlays
  duration: number;          // seconds
  transition: 'fade' | 'slide' | 'cut' | 'zoom';
  visualType: 'illustration' | 'screenshot' | 'chart' | 'photo' | 'animation';
  sourceSection: string;     // reference back to document section
}

interface VideoStyle {
  tone: 'professional' | 'casual' | 'educational' | 'marketing';
  pacing: 'slow' | 'normal' | 'fast';
  colorScheme: string[];
  fontFamily: string;
  musicMood?: string;
}

LLM Prompt Strategy:

  1. Script Generation — Claude or GPT-4o receives the extracted document and produces a structured VideoScript JSON. System prompt defines scene composition rules: max 30 seconds per scene, 1-3 key points per scene, narration matches visual descriptions.

  2. Visual Asset Matching — Multimodal LLM matches visualDescription to:

    • Stock image APIs (Unsplash, Pexels)
    • Document’s own embedded images
    • Generated illustrations (Flux via API)
    • Auto-generated charts from data in the document
  3. Narration Refinement — LLM adjusts narration text for spoken delivery: shorter sentences, natural pauses, pronunciation hints for technical terms.

Model Selection:

TaskPrimary ModelFallbackCost per Video (est.)
Script generationClaude Sonnet 4.6GPT-4o$0.02-0.05
Visual matchingGPT-4o (vision)Claude Sonnet 4.6$0.01-0.03
Narration polishClaude Haiku 4.5GPT-4o-mini$0.001-0.005
Total AI cost per video$0.03-0.09

Stage 3: Scene Composition (React + Remotion)

Each scene becomes a Remotion <Composition>:

// scenes/KeyPointScene.tsx
import { AbsoluteFill, Img, useCurrentFrame, interpolate } from 'remotion';

interface KeyPointSceneProps {
  narration: string;
  backgroundImage: string;
  keyPoints: string[];
  duration: number;  // frames
}

export const KeyPointScene: React.FC<KeyPointSceneProps> = ({
  backgroundImage,
  keyPoints,
  duration,
}) => {
  const frame = useCurrentFrame();

  return (
    <AbsoluteFill style={{ backgroundColor: '#0f172a' }}>
      <Img
        src={backgroundImage}
        style={{
          width: '100%',
          height: '100%',
          objectFit: 'cover',
          opacity: 0.3,
        }}
      />
      <div style={{ position: 'absolute', padding: 80 }}>
        {keyPoints.map((point, i) => {
          const enterFrame = (duration / keyPoints.length) * i;
          const opacity = interpolate(
            frame,
            [enterFrame, enterFrame + 15],
            [0, 1],
            { extrapolateRight: 'clamp' }
          );
          return (
            <div key={i} style={{ opacity, marginBottom: 24 }}>
              <span style={{ fontSize: 36, color: 'white' }}>{point}</span>
            </div>
          );
        })}
      </div>
    </AbsoluteFill>
  );
};

Scene Template Library (MVP):

TemplateUse CaseVisual Elements
TitleSceneOpening/chapter titlesAnimated text, gradient background
KeyPointSceneBullet pointsStaggered text reveal, background image
ChartSceneData visualizationAnimated Recharts, data from document
ImageShowcaseSceneProduct/screenshotKen Burns effect, caption overlay
QuoteScenePull quotesLarge text, attribution
ComparisonSceneBefore/after, pros/consSplit screen, animated transitions
SummarySceneClosing recapList of key takeaways
CallToActionSceneOutroCTA text, link, QR code

Stage 4: Video Rendering

Two rendering paths based on use case:

Path A: Server-Side Rendering (self-hosted)

import { bundle } from '@remotion/bundler';
import { renderMedia } from '@remotion/renderer';

async function renderVideo(script: VideoScript): Promise<string> {
  const bundled = await bundle({
    entryPoint: './src/remotion/index.ts',
    webpackOverride: (config) => config,
  });

  const outputPath = `/tmp/output-${script.id}.mp4`;

  await renderMedia({
    composition: {
      id: 'NarrativVideo',
      durationInFrames: script.totalDuration * 30,  // 30fps
      fps: 30,
      width: 1920,
      height: 1080,
    },
    serveUrl: bundled,
    codec: 'h264',
    outputLocation: outputPath,
    inputProps: { script },
  });

  return outputPath;
}

Path B: Lambda Rendering (scalable)

import { renderMediaOnLambda } from '@remotion/lambda/client';

async function renderOnLambda(script: VideoScript): Promise<string> {
  const { renderId, bucketName } = await renderMediaOnLambda({
    region: 'us-east-1',
    functionName: 'narrativ-render',
    composition: 'NarrativVideo',
    codec: 'h264',
    inputProps: { script },
    framesPerLambda: 20,  // parallelism factor
  });

  // Poll for completion
  const result = await waitForRender({ renderId, bucketName, region: 'us-east-1' });
  return result.outputFile;  // S3 URL
}

Rendering Cost Comparison:

Method3-min Video10-min VideoScalability
Self-hosted (VPS)~2 min render, $0~7 min render, $0Limited by CPU
Lambda (parallel)~30 sec, ~$0.15~45 sec, ~$0.40Near-infinite
Lambda + SQS queue~30 sec, ~$0.15~45 sec, ~$0.40Rate-limited but queued

2. TTS Integration — Voice Narration

Provider Comparison (March 2026)

ProviderLatency (TTFB)QualityPrice per 1M charsLanguagesVoice CloningBest For
ElevenLabs~150msBest-in-class~$3070+Yes (instant + pro)Quality-first narration
OpenAI TTS-1~200msGood~$1558NoCost-effective, ecosystem
OpenAI TTS-1-HD~300msVery good~$3058NoHigher quality at same price
Cartesia Sonic~95msGood~$615LimitedUltra-low latency
Deepgram Aura-2~150msGood~$1520+NoReal-time apps
Google Cloud TTS~200msGood~$1650+NoGoogle ecosystem

MVP: OpenAI TTS-1-HD — good quality, simple API, cost-effective at $30/M chars. A 3-minute video needs 4,500 characters of narration ($0.14).

V1.1: Add ElevenLabs as premium option — voice cloning, superior expressiveness, 70+ languages.

interface TTSProvider {
  synthesize(text: string, options: TTSOptions): Promise<AudioBuffer>;
}

interface TTSOptions {
  voice: string;
  speed: number;     // 0.5-2.0
  language: string;
  format: 'mp3' | 'wav';
}

// Usage in pipeline
async function generateNarration(scenes: SceneScript[]): Promise<NarrationResult[]> {
  const tts = new OpenAITTS({ model: 'tts-1-hd', voice: 'nova' });

  return Promise.all(
    scenes.map(async (scene) => {
      const audio = await tts.synthesize(scene.narration, {
        voice: 'nova',
        speed: 1.0,
        language: 'en',
        format: 'mp3',
      });

      return {
        sceneId: scene.id,
        audioUrl: await uploadToStorage(audio),
        durationMs: audio.duration * 1000,
      };
    })
  );
}

Audio-Visual Sync

Narration duration determines scene duration (not the other way around):

  1. Generate TTS for each scene’s narration text
  2. Measure actual audio duration
  3. Adjust Remotion composition durationInFrames to match: Math.ceil(audioDurationSec * fps)
  4. Key point animations are timed proportionally within the audio duration

3. Application Architecture

┌──────────────────────────────────────────────────────────┐
│                    Next.js 16 App                         │
│                                                           │
│  ┌─────────────┐  ┌──────────────┐  ┌──────────────────┐ │
│  │  Upload UI   │  │  Editor UI   │  │  Dashboard UI    │ │
│  │  (drag-drop) │  │  (scene edit)│  │  (video library) │ │
│  └──────┬──────┘  └──────┬───────┘  └──────────────────┘ │
│         │                │                                 │
│  ┌──────▼────────────────▼────────────────────────────┐   │
│  │              API Routes (Next.js)                   │   │
│  │  POST /api/projects       — create from document    │   │
│  │  POST /api/projects/:id/generate — run AI pipeline  │   │
│  │  PATCH /api/scenes/:id    — edit scene              │   │
│  │  POST /api/projects/:id/render — trigger render     │   │
│  │  GET  /api/projects/:id/status — render progress    │   │
│  └──────┬────────────────────────────────────────────┘   │
│         │                                                 │
│  ┌──────▼──────────────────────────────────────────────┐ │
│  │           Background Jobs (BullMQ + Redis)           │ │
│  │                                                       │ │
│  │  IngestJob → ContentIntelligenceJob → TTSJob →       │ │
│  │  CompositionJob → RenderJob → NotifyJob              │ │
│  └──────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────┘

              ┌────────────┼─────────────┐
              │            │             │
       ┌──────▼──┐  ┌─────▼────┐  ┌─────▼──────┐
       │ Postgres │  │ R2/S3    │  │ Remotion   │
       │ (meta)   │  │ (assets) │  │ Lambda     │
       └─────────┘  └──────────┘  └────────────┘

Tech Stack

LayerTechnologyRationale
FrameworkNext.js 16 (App Router)Moklabs standard, SSR for SEO, API routes
UIReact 19, Tailwind, RadixDesign system consistency
StateTanStack Query + ZustandServer state + local editor state
DatabasePostgreSQL (Drizzle ORM)Projects, scenes, render history
QueueBullMQ + RedisJob pipeline with retries, progress tracking
StorageCloudflare R2Assets, renders, thumbnails
RenderingRemotion 4.x + LambdaDeterministic video from React components
TTSOpenAI TTS-1-HD (MVP)Balance of quality and cost
AIClaude Sonnet 4.6 / GPT-4oScript generation, visual matching
AuthBetter AuthMoklabs SSO integration

Database Schema (Core)

-- Projects
CREATE TABLE projects (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  user_id UUID REFERENCES users(id),
  title TEXT NOT NULL,
  source_type TEXT NOT NULL,  -- 'pdf', 'pptx', 'docx', 'md', 'url'
  source_url TEXT,
  extracted_content JSONB,
  video_script JSONB,
  style JSONB,
  status TEXT DEFAULT 'draft',  -- draft, generating, ready, rendering, complete
  created_at TIMESTAMPTZ DEFAULT now(),
  updated_at TIMESTAMPTZ DEFAULT now()
);

-- Scenes (editable independently)
CREATE TABLE scenes (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  project_id UUID REFERENCES projects(id) ON DELETE CASCADE,
  sort_order INTEGER NOT NULL,
  template TEXT NOT NULL,  -- 'title', 'keypoint', 'chart', etc.
  narration TEXT,
  visual_description TEXT,
  key_points JSONB,
  background_asset_url TEXT,
  audio_url TEXT,
  audio_duration_ms INTEGER,
  duration_frames INTEGER,
  transition TEXT DEFAULT 'fade',
  props JSONB,  -- template-specific properties
  created_at TIMESTAMPTZ DEFAULT now()
);

-- Renders
CREATE TABLE renders (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  project_id UUID REFERENCES projects(id),
  status TEXT DEFAULT 'queued',  -- queued, rendering, complete, failed
  output_url TEXT,
  codec TEXT DEFAULT 'h264',
  resolution TEXT DEFAULT '1920x1080',
  fps INTEGER DEFAULT 30,
  duration_sec FLOAT,
  render_time_sec FLOAT,
  cost_usd FLOAT,
  error_message TEXT,
  created_at TIMESTAMPTZ DEFAULT now()
);

4. Editor UX — The Iterative Refinement Loop

The key differentiator vs one-shot generators:

Upload → Auto-generate script → Preview → Edit → Re-render → Download
                                    ▲         │
                                    └─────────┘
                                  (iterate on scenes)

Editor Capabilities (MVP)

  1. Scene Timeline — Horizontal strip of scene thumbnails. Drag to reorder. Click to select.
  2. Scene Editor Panel — Edit narration text, key points, visual description. Swap background image. Adjust duration. Change template.
  3. Live Preview — Remotion <Player> component renders scene in-browser. Instant feedback without full render.
  4. Regenerate Scene — Re-run AI for a single scene with modified instructions.
  5. Full Render — Trigger server-side or Lambda render for final MP4.

Remotion Player Integration

import { Player } from '@remotion/player';

function ScenePreview({ scene }: { scene: Scene }) {
  return (
    <Player
      component={sceneTemplates[scene.template]}
      inputProps={scene.props}
      durationInFrames={scene.duration_frames}
      fps={30}
      compositionWidth={1920}
      compositionHeight={1080}
      style={{ width: '100%', aspectRatio: '16/9' }}
      controls
    />
  );
}

5. Cost Model

Per-Video Cost Breakdown (3-minute video, 10 scenes)

ComponentCost
Document extraction$0.00 (local processing)
Script generation (Claude Sonnet)$0.03
Visual matching (GPT-4o vision)$0.02
Narration polish (Claude Haiku)$0.005
TTS narration (OpenAI TTS-1-HD, ~4,500 chars)$0.14
Stock images (Unsplash/Pexels free tier)$0.00
Lambda rendering$0.15
R2 storage (100MB video, 1 month)$0.0015
Total COGS per video~$0.35

Pricing Potential

At $0.35 COGS, a freemium model with:

  • Free: 3 videos/month (COGS: $1.05/user/month)
  • Pro $19/month: 30 videos/month → 98% gross margin
  • Business $49/month: 100 videos + team features → 96% gross margin

6. Competitive Moat

FeatureNotebookLM CinematicSynthesiaLumen5Narrativ
Document inputYes (Google Docs only)NoBlog/textAny format (PDF, PPTX, DOCX, MD, URL)
Iterative editingNo (one-shot)Scene editorTemplate editorFull scene editor + live preview
Custom brandingNoYes (Enterprise)YesYes (all tiers)
Programmatic APINoYes ($$$)NoYes (API-first)
Self-hosted optionNoNoNoYes (Remotion SSR)
AI voice narrationYesAvatar lip-syncNoTTS with voice selection
Cost per videoFree (Google lock-in)$0.50-2.00$0.30+$0.35
Open sourceNoNoNoCore engine (MIT)

Differentiation Summary

  1. Format-agnostic ingestion — not locked to Google Docs or specific templates
  2. Iterative refinement — edit any scene, re-render selectively, not one-shot generation
  3. API-first — developers can build on top of Narrativ’s pipeline
  4. Transparent cost — no per-minute or per-credit pricing games

7. MVP Scope & Milestones

Phase 1: Core Pipeline (4 weeks)

  • Document ingestion (PDF + PPTX + Markdown)
  • LLM script generation with VideoScript output
  • 4 scene templates (Title, KeyPoint, ImageShowcase, Summary)
  • OpenAI TTS-1-HD narration
  • Remotion SSR rendering (server-side)
  • Basic upload → generate → download flow
  • PostgreSQL schema + Drizzle ORM

Phase 2: Editor + Preview (3 weeks)

  • Remotion Player in-browser preview
  • Scene editor (narration, key points, images)
  • Drag-and-drop scene reordering
  • Re-generate single scene
  • BullMQ job pipeline with progress
  • R2 asset storage

Phase 3: Scale + Polish (3 weeks)

  • Lambda rendering integration
  • Additional templates (Chart, Comparison, Quote, CTA)
  • DOCX + URL ingestion
  • ElevenLabs TTS integration (premium)
  • User dashboard with video library
  • Better Auth integration

Phase 4: Growth (ongoing)

  • Public API for programmatic video generation
  • Custom template builder
  • Team workspaces
  • Localization (multi-language TTS)
  • Background music library
  • SCORM export for L&D market

8. Risk Assessment

RiskImpactMitigation
Remotion + Next.js 16 compatibility (MOKA-309)HighEnforce webpack mode; keep Remotion server-only; see existing research report
NotebookLM makes doc-to-video freeHighDifferentiate on editing, API, format support, non-Google lock-in
TTS quality perceptionMediumDefault to HD model; offer ElevenLabs upgrade; allow custom voice upload
Lambda cold start latencyLowPre-warm functions; SQS queue for background rendering
LLM script quality varianceMediumStructured output schemas; human review step; template constraints

Sources

Related Reports