Narrativ MVP Technical Blueprint — Document-to-Video Pipeline Architecture, Rendering Stack, and AI Integration

Technology Mar 20, 2026 by deep-research

#narrativ #remotion #document-to-video #pipeline #tts #ai-video #rendering #architecture #mvp

Narrativ MVP Technical Blueprint — Document-to-Video Pipeline Architecture, Rendering Stack, and AI Integration

Date: 2026-03-20 Context: Narrativ has minimal technical research coverage (1 market report + 1 compatibility report). This blueprint synthesizes the document-to-video opportunity, Remotion rendering architecture, and AI pipeline design into a concrete MVP plan.

Executive Summary

Narrativ’s core pipeline is a 4-stage transformation: Document Ingestion → Content Intelligence → Scene Composition → Video Rendering
Remotion is the right rendering engine — React-based composition, Lambda-powered parallel rendering, and programmatic control over every frame. The Remotion + Next.js 16 compatibility issue (MOKA-309) is solvable with webpack mode enforcement
AI orchestration drives three critical stages: content extraction (LLM), narration generation (TTS), and visual asset selection (multimodal LLM)
Google’s NotebookLM Cinematic Video Overviews (launched March 2026) validates the document-to-video category but targets consumers, not professional/enterprise workflows
Key differentiation: Narrativ should own the “iterative refinement” loop — edit scenes, swap assets, adjust timing, re-render — unlike one-shot generators

1. Pipeline Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                        Narrativ Pipeline                                │
│                                                                         │
│  ┌──────────┐   ┌──────────────┐   ┌───────────────┐   ┌─────────────┐ │
│  │ Document  │──▶│   Content    │──▶│    Scene      │──▶│   Video     │ │
│  │ Ingestion │   │ Intelligence │   │ Composition   │   │  Rendering  │ │
│  └──────────┘   └──────────────┘   └───────────────┘   └─────────────┘ │
│                                                                         │
│  PDF, PPTX,       Extract key       Generate scenes     Remotion SSR    │
│  DOCX, MD,        points, create    with visual cues,   or Lambda for   │
│  Web URLs,        narrative arc,    TTS narration,      parallel MP4    │
│  Google Docs      chapter splits    asset matching       output          │
└─────────────────────────────────────────────────────────────────────────┘

Stage 1: Document Ingestion

Input Format	Extraction Method	Output
PDF	`pdf-parse` + LLM for layout understanding	Structured text + images
PPTX	`pptx` npm package → slide-by-slide extraction	Text + speaker notes + embedded images
DOCX	`mammoth` → HTML → structured content	Headings, paragraphs, images
Markdown	Direct parsing (`remark`)	AST with sections
Web URL	`@mozilla/readability` + Playwright screenshot	Clean text + page visuals
Google Docs	Google Docs API → JSON	Structured document

Key design decision: Extract structure, not just text. Headings define chapters. Speaker notes become narration hints. Embedded images become scene visuals.

interface ExtractedDocument {
  title: string;
  sections: Section[];
  metadata: {
    author?: string;
    language: string;
    totalWords: number;
    estimatedReadTime: number;
    images: ExtractedImage[];
  };
}

interface Section {
  heading: string;
  level: number;       // h1=1, h2=2, etc.
  content: string;     // main text
  speakerNotes?: string;  // from PPTX
  images: ExtractedImage[];
  bulletPoints: string[];
}

Stage 2: Content Intelligence (LLM-Powered)

This is where AI transforms raw document content into a video script:

interface VideoScript {
  title: string;
  totalDuration: number;  // estimated seconds
  scenes: SceneScript[];
  style: VideoStyle;
}

interface SceneScript {
  id: string;
  title: string;
  narration: string;         // text for TTS
  visualDescription: string; // prompt for asset selection
  keyPoints: string[];       // on-screen text overlays
  duration: number;          // seconds
  transition: 'fade' | 'slide' | 'cut' | 'zoom';
  visualType: 'illustration' | 'screenshot' | 'chart' | 'photo' | 'animation';
  sourceSection: string;     // reference back to document section
}

interface VideoStyle {
  tone: 'professional' | 'casual' | 'educational' | 'marketing';
  pacing: 'slow' | 'normal' | 'fast';
  colorScheme: string[];
  fontFamily: string;
  musicMood?: string;
}

LLM Prompt Strategy:

Script Generation — Claude or GPT-4o receives the extracted document and produces a structured VideoScript JSON. System prompt defines scene composition rules: max 30 seconds per scene, 1-3 key points per scene, narration matches visual descriptions.
Visual Asset Matching — Multimodal LLM matches visualDescription to:
- Stock image APIs (Unsplash, Pexels)
- Document’s own embedded images
- Generated illustrations (Flux via API)
- Auto-generated charts from data in the document
Narration Refinement — LLM adjusts narration text for spoken delivery: shorter sentences, natural pauses, pronunciation hints for technical terms.

Model Selection:

Task	Primary Model	Fallback	Cost per Video (est.)
Script generation	Claude Sonnet 4.6	GPT-4o	$0.02-0.05
Visual matching	GPT-4o (vision)	Claude Sonnet 4.6	$0.01-0.03
Narration polish	Claude Haiku 4.5	GPT-4o-mini	$0.001-0.005
Total AI cost per video			$0.03-0.09

Stage 3: Scene Composition (React + Remotion)

Each scene becomes a Remotion <Composition>:

// scenes/KeyPointScene.tsx
import { AbsoluteFill, Img, useCurrentFrame, interpolate } from 'remotion';

interface KeyPointSceneProps {
  narration: string;
  backgroundImage: string;
  keyPoints: string[];
  duration: number;  // frames
}

export const KeyPointScene: React.FC<KeyPointSceneProps> = ({
  backgroundImage,
  keyPoints,
  duration,
}) => {
  const frame = useCurrentFrame();

  return (
    <AbsoluteFill style={{ backgroundColor: '#0f172a' }}>
      <Img
        src={backgroundImage}
        style={{
          width: '100%',
          height: '100%',
          objectFit: 'cover',
          opacity: 0.3,
        }}
      />
      <div style={{ position: 'absolute', padding: 80 }}>
        {keyPoints.map((point, i) => {
          const enterFrame = (duration / keyPoints.length) * i;
          const opacity = interpolate(
            frame,
            [enterFrame, enterFrame + 15],
            [0, 1],
            { extrapolateRight: 'clamp' }
          );
          return (
            <div key={i} style={{ opacity, marginBottom: 24 }}>
              <span style={{ fontSize: 36, color: 'white' }}>{point}</span>
            </div>
          );
        })}
      </div>
    </AbsoluteFill>
  );
};

Scene Template Library (MVP):

Template	Use Case	Visual Elements
`TitleScene`	Opening/chapter titles	Animated text, gradient background
`KeyPointScene`	Bullet points	Staggered text reveal, background image
`ChartScene`	Data visualization	Animated Recharts, data from document
`ImageShowcaseScene`	Product/screenshot	Ken Burns effect, caption overlay
`QuoteScene`	Pull quotes	Large text, attribution
`ComparisonScene`	Before/after, pros/cons	Split screen, animated transitions
`SummaryScene`	Closing recap	List of key takeaways
`CallToActionScene`	Outro	CTA text, link, QR code

Stage 4: Video Rendering

Two rendering paths based on use case:

Path A: Server-Side Rendering (self-hosted)

import { bundle } from '@remotion/bundler';
import { renderMedia } from '@remotion/renderer';

async function renderVideo(script: VideoScript): Promise<string> {
  const bundled = await bundle({
    entryPoint: './src/remotion/index.ts',
    webpackOverride: (config) => config,
  });

  const outputPath = `/tmp/output-${script.id}.mp4`;

  await renderMedia({
    composition: {
      id: 'NarrativVideo',
      durationInFrames: script.totalDuration * 30,  // 30fps
      fps: 30,
      width: 1920,
      height: 1080,
    },
    serveUrl: bundled,
    codec: 'h264',
    outputLocation: outputPath,
    inputProps: { script },
  });

  return outputPath;
}

Path B: Lambda Rendering (scalable)

import { renderMediaOnLambda } from '@remotion/lambda/client';

async function renderOnLambda(script: VideoScript): Promise<string> {
  const { renderId, bucketName } = await renderMediaOnLambda({
    region: 'us-east-1',
    functionName: 'narrativ-render',
    composition: 'NarrativVideo',
    codec: 'h264',
    inputProps: { script },
    framesPerLambda: 20,  // parallelism factor
  });

  // Poll for completion
  const result = await waitForRender({ renderId, bucketName, region: 'us-east-1' });
  return result.outputFile;  // S3 URL
}

Rendering Cost Comparison:

Method	3-min Video	10-min Video	Scalability
Self-hosted (VPS)	~2 min render, $0	~7 min render, $0	Limited by CPU
Lambda (parallel)	~30 sec, ~$0.15	~45 sec, ~$0.40	Near-infinite
Lambda + SQS queue	~30 sec, ~$0.15	~45 sec, ~$0.40	Rate-limited but queued

2. TTS Integration — Voice Narration

Provider Comparison (March 2026)

Provider	Latency (TTFB)	Quality	Price per 1M chars	Languages	Voice Cloning	Best For
ElevenLabs	~150ms	Best-in-class	~$30	70+	Yes (instant + pro)	Quality-first narration
OpenAI TTS-1	~200ms	Good	~$15	58	No	Cost-effective, ecosystem
OpenAI TTS-1-HD	~300ms	Very good	~$30	58	No	Higher quality at same price
Cartesia Sonic	~95ms	Good	~$6	15	Limited	Ultra-low latency
Deepgram Aura-2	~150ms	Good	~$15	20+	No	Real-time apps
Google Cloud TTS	~200ms	Good	~$16	50+	No	Google ecosystem

Recommended Strategy

MVP: OpenAI TTS-1-HD — good quality, simple API, cost-effective at $30/M chars. A 3-minute video needs ~~4,500 characters of narration (~~$0.14).

V1.1: Add ElevenLabs as premium option — voice cloning, superior expressiveness, 70+ languages.

interface TTSProvider {
  synthesize(text: string, options: TTSOptions): Promise<AudioBuffer>;
}

interface TTSOptions {
  voice: string;
  speed: number;     // 0.5-2.0
  language: string;
  format: 'mp3' | 'wav';
}

// Usage in pipeline
async function generateNarration(scenes: SceneScript[]): Promise<NarrationResult[]> {
  const tts = new OpenAITTS({ model: 'tts-1-hd', voice: 'nova' });

  return Promise.all(
    scenes.map(async (scene) => {
      const audio = await tts.synthesize(scene.narration, {
        voice: 'nova',
        speed: 1.0,
        language: 'en',
        format: 'mp3',
      });

      return {
        sceneId: scene.id,
        audioUrl: await uploadToStorage(audio),
        durationMs: audio.duration * 1000,
      };
    })
  );
}

Audio-Visual Sync

Narration duration determines scene duration (not the other way around):

Generate TTS for each scene’s narration text
Measure actual audio duration
Adjust Remotion composition durationInFrames to match: Math.ceil(audioDurationSec * fps)
Key point animations are timed proportionally within the audio duration

3. Application Architecture

┌──────────────────────────────────────────────────────────┐
│                    Next.js 16 App                         │
│                                                           │
│  ┌─────────────┐  ┌──────────────┐  ┌──────────────────┐ │
│  │  Upload UI   │  │  Editor UI   │  │  Dashboard UI    │ │
│  │  (drag-drop) │  │  (scene edit)│  │  (video library) │ │
│  └──────┬──────┘  └──────┬───────┘  └──────────────────┘ │
│         │                │                                 │
│  ┌──────▼────────────────▼────────────────────────────┐   │
│  │              API Routes (Next.js)                   │   │
│  │  POST /api/projects       — create from document    │   │
│  │  POST /api/projects/:id/generate — run AI pipeline  │   │
│  │  PATCH /api/scenes/:id    — edit scene              │   │
│  │  POST /api/projects/:id/render — trigger render     │   │
│  │  GET  /api/projects/:id/status — render progress    │   │
│  └──────┬────────────────────────────────────────────┘   │
│         │                                                 │
│  ┌──────▼──────────────────────────────────────────────┐ │
│  │           Background Jobs (BullMQ + Redis)           │ │
│  │                                                       │ │
│  │  IngestJob → ContentIntelligenceJob → TTSJob →       │ │
│  │  CompositionJob → RenderJob → NotifyJob              │ │
│  └──────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────┘
                           │
              ┌────────────┼─────────────┐
              │            │             │
       ┌──────▼──┐  ┌─────▼────┐  ┌─────▼──────┐
       │ Postgres │  │ R2/S3    │  │ Remotion   │
       │ (meta)   │  │ (assets) │  │ Lambda     │
       └─────────┘  └──────────┘  └────────────┘

Tech Stack

Layer	Technology	Rationale
Framework	Next.js 16 (App Router)	Moklabs standard, SSR for SEO, API routes
UI	React 19, Tailwind, Radix	Design system consistency
State	TanStack Query + Zustand	Server state + local editor state
Database	PostgreSQL (Drizzle ORM)	Projects, scenes, render history
Queue	BullMQ + Redis	Job pipeline with retries, progress tracking
Storage	Cloudflare R2	Assets, renders, thumbnails
Rendering	Remotion 4.x + Lambda	Deterministic video from React components
TTS	OpenAI TTS-1-HD (MVP)	Balance of quality and cost
AI	Claude Sonnet 4.6 / GPT-4o	Script generation, visual matching
Auth	Better Auth	Moklabs SSO integration

Database Schema (Core)

-- Projects
CREATE TABLE projects (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  user_id UUID REFERENCES users(id),
  title TEXT NOT NULL,
  source_type TEXT NOT NULL,  -- 'pdf', 'pptx', 'docx', 'md', 'url'
  source_url TEXT,
  extracted_content JSONB,
  video_script JSONB,
  style JSONB,
  status TEXT DEFAULT 'draft',  -- draft, generating, ready, rendering, complete
  created_at TIMESTAMPTZ DEFAULT now(),
  updated_at TIMESTAMPTZ DEFAULT now()
);

-- Scenes (editable independently)
CREATE TABLE scenes (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  project_id UUID REFERENCES projects(id) ON DELETE CASCADE,
  sort_order INTEGER NOT NULL,
  template TEXT NOT NULL,  -- 'title', 'keypoint', 'chart', etc.
  narration TEXT,
  visual_description TEXT,
  key_points JSONB,
  background_asset_url TEXT,
  audio_url TEXT,
  audio_duration_ms INTEGER,
  duration_frames INTEGER,
  transition TEXT DEFAULT 'fade',
  props JSONB,  -- template-specific properties
  created_at TIMESTAMPTZ DEFAULT now()
);

-- Renders
CREATE TABLE renders (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  project_id UUID REFERENCES projects(id),
  status TEXT DEFAULT 'queued',  -- queued, rendering, complete, failed
  output_url TEXT,
  codec TEXT DEFAULT 'h264',
  resolution TEXT DEFAULT '1920x1080',
  fps INTEGER DEFAULT 30,
  duration_sec FLOAT,
  render_time_sec FLOAT,
  cost_usd FLOAT,
  error_message TEXT,
  created_at TIMESTAMPTZ DEFAULT now()
);

The key differentiator vs one-shot generators:

Upload → Auto-generate script → Preview → Edit → Re-render → Download
                                    ▲         │
                                    └─────────┘
                                  (iterate on scenes)

Editor Capabilities (MVP)

Scene Timeline — Horizontal strip of scene thumbnails. Drag to reorder. Click to select.
Scene Editor Panel — Edit narration text, key points, visual description. Swap background image. Adjust duration. Change template.
Live Preview — Remotion <Player> component renders scene in-browser. Instant feedback without full render.
Regenerate Scene — Re-run AI for a single scene with modified instructions.
Full Render — Trigger server-side or Lambda render for final MP4.

Remotion Player Integration

import { Player } from '@remotion/player';

function ScenePreview({ scene }: { scene: Scene }) {
  return (
    <Player
      component={sceneTemplates[scene.template]}
      inputProps={scene.props}
      durationInFrames={scene.duration_frames}
      fps={30}
      compositionWidth={1920}
      compositionHeight={1080}
      style={{ width: '100%', aspectRatio: '16/9' }}
      controls
    />
  );
}

5. Cost Model

Per-Video Cost Breakdown (3-minute video, 10 scenes)

Component	Cost
Document extraction	$0.00 (local processing)
Script generation (Claude Sonnet)	$0.03
Visual matching (GPT-4o vision)	$0.02
Narration polish (Claude Haiku)	$0.005
TTS narration (OpenAI TTS-1-HD, ~4,500 chars)	$0.14
Stock images (Unsplash/Pexels free tier)	$0.00
Lambda rendering	$0.15
R2 storage (100MB video, 1 month)	$0.0015
Total COGS per video	~$0.35

Pricing Potential

At $0.35 COGS, a freemium model with:

Free: 3 videos/month (COGS: $1.05/user/month)
Pro $19/month: 30 videos/month → 98% gross margin
Business $49/month: 100 videos + team features → 96% gross margin

6. Competitive Moat

Feature	NotebookLM Cinematic	Synthesia	Lumen5	Narrativ
Document input	Yes (Google Docs only)	No	Blog/text	Any format (PDF, PPTX, DOCX, MD, URL)
Iterative editing	No (one-shot)	Scene editor	Template editor	Full scene editor + live preview
Custom branding	No	Yes (Enterprise)	Yes	Yes (all tiers)
Programmatic API	No	Yes ($$$)	No	Yes (API-first)
Self-hosted option	No	No	No	Yes (Remotion SSR)
AI voice narration	Yes	Avatar lip-sync	No	TTS with voice selection
Cost per video	Free (Google lock-in)	$0.50-2.00	$0.30+	$0.35
Open source	No	No	No	Core engine (MIT)

Differentiation Summary

Format-agnostic ingestion — not locked to Google Docs or specific templates
Iterative refinement — edit any scene, re-render selectively, not one-shot generation
API-first — developers can build on top of Narrativ’s pipeline
Transparent cost — no per-minute or per-credit pricing games

7. MVP Scope & Milestones

Phase 1: Core Pipeline (4 weeks)

Document ingestion (PDF + PPTX + Markdown)
LLM script generation with VideoScript output
4 scene templates (Title, KeyPoint, ImageShowcase, Summary)
OpenAI TTS-1-HD narration
Remotion SSR rendering (server-side)
Basic upload → generate → download flow
PostgreSQL schema + Drizzle ORM

Phase 2: Editor + Preview (3 weeks)

Remotion Player in-browser preview
Scene editor (narration, key points, images)
Drag-and-drop scene reordering
Re-generate single scene
BullMQ job pipeline with progress
R2 asset storage

Phase 3: Scale + Polish (3 weeks)

Lambda rendering integration
Additional templates (Chart, Comparison, Quote, CTA)
DOCX + URL ingestion
ElevenLabs TTS integration (premium)
User dashboard with video library
Better Auth integration

Phase 4: Growth (ongoing)

8. Risk Assessment

Risk	Impact	Mitigation
Remotion + Next.js 16 compatibility (MOKA-309)	High	Enforce webpack mode; keep Remotion server-only; see existing research report
NotebookLM makes doc-to-video free	High	Differentiate on editing, API, format support, non-Google lock-in
TTS quality perception	Medium	Default to HD model; offer ElevenLabs upgrade; allow custom voice upload
Lambda cold start latency	Low	Pre-warm functions; SQS queue for background rendering
LLM script quality variance	Medium	Structured output schemas; human review step; template constraints

Narrativ MVP Technical Blueprint — Document-to-Video Pipeline Architecture, Rendering Stack, and AI Integration

Narrativ MVP Technical Blueprint — Document-to-Video Pipeline Architecture, Rendering Stack, and AI Integration

Executive Summary

1. Pipeline Architecture

Stage 1: Document Ingestion

Stage 2: Content Intelligence (LLM-Powered)

Stage 3: Scene Composition (React + Remotion)

Stage 4: Video Rendering

2. TTS Integration — Voice Narration

Provider Comparison (March 2026)

Recommended Strategy

Audio-Visual Sync

3. Application Architecture

Tech Stack

Database Schema (Core)

4. Editor UX — The Iterative Refinement Loop

Editor Capabilities (MVP)

Remotion Player Integration

5. Cost Model

Per-Video Cost Breakdown (3-minute video, 10 scenes)

Pricing Potential

6. Competitive Moat

Differentiation Summary

7. MVP Scope & Milestones

Phase 1: Core Pipeline (4 weeks)

Phase 2: Editor + Preview (3 weeks)

Phase 3: Scale + Polish (3 weeks)

Phase 4: Growth (ongoing)

8. Risk Assessment

Sources

Related Reports