Narrativ MVP Technical Blueprint — Document-to-Video Pipeline Architecture, Rendering Stack, and AI Integration
Narrativ MVP Technical Blueprint — Document-to-Video Pipeline Architecture, Rendering Stack, and AI Integration
Date: 2026-03-20 Context: Narrativ has minimal technical research coverage (1 market report + 1 compatibility report). This blueprint synthesizes the document-to-video opportunity, Remotion rendering architecture, and AI pipeline design into a concrete MVP plan.
Executive Summary
- Narrativ’s core pipeline is a 4-stage transformation: Document Ingestion → Content Intelligence → Scene Composition → Video Rendering
- Remotion is the right rendering engine — React-based composition, Lambda-powered parallel rendering, and programmatic control over every frame. The Remotion + Next.js 16 compatibility issue (MOKA-309) is solvable with webpack mode enforcement
- AI orchestration drives three critical stages: content extraction (LLM), narration generation (TTS), and visual asset selection (multimodal LLM)
- Google’s NotebookLM Cinematic Video Overviews (launched March 2026) validates the document-to-video category but targets consumers, not professional/enterprise workflows
- Key differentiation: Narrativ should own the “iterative refinement” loop — edit scenes, swap assets, adjust timing, re-render — unlike one-shot generators
1. Pipeline Architecture
┌─────────────────────────────────────────────────────────────────────────┐
│ Narrativ Pipeline │
│ │
│ ┌──────────┐ ┌──────────────┐ ┌───────────────┐ ┌─────────────┐ │
│ │ Document │──▶│ Content │──▶│ Scene │──▶│ Video │ │
│ │ Ingestion │ │ Intelligence │ │ Composition │ │ Rendering │ │
│ └──────────┘ └──────────────┘ └───────────────┘ └─────────────┘ │
│ │
│ PDF, PPTX, Extract key Generate scenes Remotion SSR │
│ DOCX, MD, points, create with visual cues, or Lambda for │
│ Web URLs, narrative arc, TTS narration, parallel MP4 │
│ Google Docs chapter splits asset matching output │
└─────────────────────────────────────────────────────────────────────────┘
Stage 1: Document Ingestion
| Input Format | Extraction Method | Output |
|---|---|---|
pdf-parse + LLM for layout understanding | Structured text + images | |
| PPTX | pptx npm package → slide-by-slide extraction | Text + speaker notes + embedded images |
| DOCX | mammoth → HTML → structured content | Headings, paragraphs, images |
| Markdown | Direct parsing (remark) | AST with sections |
| Web URL | @mozilla/readability + Playwright screenshot | Clean text + page visuals |
| Google Docs | Google Docs API → JSON | Structured document |
Key design decision: Extract structure, not just text. Headings define chapters. Speaker notes become narration hints. Embedded images become scene visuals.
interface ExtractedDocument {
title: string;
sections: Section[];
metadata: {
author?: string;
language: string;
totalWords: number;
estimatedReadTime: number;
images: ExtractedImage[];
};
}
interface Section {
heading: string;
level: number; // h1=1, h2=2, etc.
content: string; // main text
speakerNotes?: string; // from PPTX
images: ExtractedImage[];
bulletPoints: string[];
}
Stage 2: Content Intelligence (LLM-Powered)
This is where AI transforms raw document content into a video script:
interface VideoScript {
title: string;
totalDuration: number; // estimated seconds
scenes: SceneScript[];
style: VideoStyle;
}
interface SceneScript {
id: string;
title: string;
narration: string; // text for TTS
visualDescription: string; // prompt for asset selection
keyPoints: string[]; // on-screen text overlays
duration: number; // seconds
transition: 'fade' | 'slide' | 'cut' | 'zoom';
visualType: 'illustration' | 'screenshot' | 'chart' | 'photo' | 'animation';
sourceSection: string; // reference back to document section
}
interface VideoStyle {
tone: 'professional' | 'casual' | 'educational' | 'marketing';
pacing: 'slow' | 'normal' | 'fast';
colorScheme: string[];
fontFamily: string;
musicMood?: string;
}
LLM Prompt Strategy:
-
Script Generation — Claude or GPT-4o receives the extracted document and produces a structured VideoScript JSON. System prompt defines scene composition rules: max 30 seconds per scene, 1-3 key points per scene, narration matches visual descriptions.
-
Visual Asset Matching — Multimodal LLM matches
visualDescriptionto:- Stock image APIs (Unsplash, Pexels)
- Document’s own embedded images
- Generated illustrations (Flux via API)
- Auto-generated charts from data in the document
-
Narration Refinement — LLM adjusts narration text for spoken delivery: shorter sentences, natural pauses, pronunciation hints for technical terms.
Model Selection:
| Task | Primary Model | Fallback | Cost per Video (est.) |
|---|---|---|---|
| Script generation | Claude Sonnet 4.6 | GPT-4o | $0.02-0.05 |
| Visual matching | GPT-4o (vision) | Claude Sonnet 4.6 | $0.01-0.03 |
| Narration polish | Claude Haiku 4.5 | GPT-4o-mini | $0.001-0.005 |
| Total AI cost per video | $0.03-0.09 |
Stage 3: Scene Composition (React + Remotion)
Each scene becomes a Remotion <Composition>:
// scenes/KeyPointScene.tsx
import { AbsoluteFill, Img, useCurrentFrame, interpolate } from 'remotion';
interface KeyPointSceneProps {
narration: string;
backgroundImage: string;
keyPoints: string[];
duration: number; // frames
}
export const KeyPointScene: React.FC<KeyPointSceneProps> = ({
backgroundImage,
keyPoints,
duration,
}) => {
const frame = useCurrentFrame();
return (
<AbsoluteFill style={{ backgroundColor: '#0f172a' }}>
<Img
src={backgroundImage}
style={{
width: '100%',
height: '100%',
objectFit: 'cover',
opacity: 0.3,
}}
/>
<div style={{ position: 'absolute', padding: 80 }}>
{keyPoints.map((point, i) => {
const enterFrame = (duration / keyPoints.length) * i;
const opacity = interpolate(
frame,
[enterFrame, enterFrame + 15],
[0, 1],
{ extrapolateRight: 'clamp' }
);
return (
<div key={i} style={{ opacity, marginBottom: 24 }}>
<span style={{ fontSize: 36, color: 'white' }}>{point}</span>
</div>
);
})}
</div>
</AbsoluteFill>
);
};
Scene Template Library (MVP):
| Template | Use Case | Visual Elements |
|---|---|---|
TitleScene | Opening/chapter titles | Animated text, gradient background |
KeyPointScene | Bullet points | Staggered text reveal, background image |
ChartScene | Data visualization | Animated Recharts, data from document |
ImageShowcaseScene | Product/screenshot | Ken Burns effect, caption overlay |
QuoteScene | Pull quotes | Large text, attribution |
ComparisonScene | Before/after, pros/cons | Split screen, animated transitions |
SummaryScene | Closing recap | List of key takeaways |
CallToActionScene | Outro | CTA text, link, QR code |
Stage 4: Video Rendering
Two rendering paths based on use case:
Path A: Server-Side Rendering (self-hosted)
import { bundle } from '@remotion/bundler';
import { renderMedia } from '@remotion/renderer';
async function renderVideo(script: VideoScript): Promise<string> {
const bundled = await bundle({
entryPoint: './src/remotion/index.ts',
webpackOverride: (config) => config,
});
const outputPath = `/tmp/output-${script.id}.mp4`;
await renderMedia({
composition: {
id: 'NarrativVideo',
durationInFrames: script.totalDuration * 30, // 30fps
fps: 30,
width: 1920,
height: 1080,
},
serveUrl: bundled,
codec: 'h264',
outputLocation: outputPath,
inputProps: { script },
});
return outputPath;
}
Path B: Lambda Rendering (scalable)
import { renderMediaOnLambda } from '@remotion/lambda/client';
async function renderOnLambda(script: VideoScript): Promise<string> {
const { renderId, bucketName } = await renderMediaOnLambda({
region: 'us-east-1',
functionName: 'narrativ-render',
composition: 'NarrativVideo',
codec: 'h264',
inputProps: { script },
framesPerLambda: 20, // parallelism factor
});
// Poll for completion
const result = await waitForRender({ renderId, bucketName, region: 'us-east-1' });
return result.outputFile; // S3 URL
}
Rendering Cost Comparison:
| Method | 3-min Video | 10-min Video | Scalability |
|---|---|---|---|
| Self-hosted (VPS) | ~2 min render, $0 | ~7 min render, $0 | Limited by CPU |
| Lambda (parallel) | ~30 sec, ~$0.15 | ~45 sec, ~$0.40 | Near-infinite |
| Lambda + SQS queue | ~30 sec, ~$0.15 | ~45 sec, ~$0.40 | Rate-limited but queued |
2. TTS Integration — Voice Narration
Provider Comparison (March 2026)
| Provider | Latency (TTFB) | Quality | Price per 1M chars | Languages | Voice Cloning | Best For |
|---|---|---|---|---|---|---|
| ElevenLabs | ~150ms | Best-in-class | ~$30 | 70+ | Yes (instant + pro) | Quality-first narration |
| OpenAI TTS-1 | ~200ms | Good | ~$15 | 58 | No | Cost-effective, ecosystem |
| OpenAI TTS-1-HD | ~300ms | Very good | ~$30 | 58 | No | Higher quality at same price |
| Cartesia Sonic | ~95ms | Good | ~$6 | 15 | Limited | Ultra-low latency |
| Deepgram Aura-2 | ~150ms | Good | ~$15 | 20+ | No | Real-time apps |
| Google Cloud TTS | ~200ms | Good | ~$16 | 50+ | No | Google ecosystem |
Recommended Strategy
MVP: OpenAI TTS-1-HD — good quality, simple API, cost-effective at $30/M chars. A 3-minute video needs 4,500 characters of narration ($0.14).
V1.1: Add ElevenLabs as premium option — voice cloning, superior expressiveness, 70+ languages.
interface TTSProvider {
synthesize(text: string, options: TTSOptions): Promise<AudioBuffer>;
}
interface TTSOptions {
voice: string;
speed: number; // 0.5-2.0
language: string;
format: 'mp3' | 'wav';
}
// Usage in pipeline
async function generateNarration(scenes: SceneScript[]): Promise<NarrationResult[]> {
const tts = new OpenAITTS({ model: 'tts-1-hd', voice: 'nova' });
return Promise.all(
scenes.map(async (scene) => {
const audio = await tts.synthesize(scene.narration, {
voice: 'nova',
speed: 1.0,
language: 'en',
format: 'mp3',
});
return {
sceneId: scene.id,
audioUrl: await uploadToStorage(audio),
durationMs: audio.duration * 1000,
};
})
);
}
Audio-Visual Sync
Narration duration determines scene duration (not the other way around):
- Generate TTS for each scene’s narration text
- Measure actual audio duration
- Adjust Remotion composition
durationInFramesto match:Math.ceil(audioDurationSec * fps) - Key point animations are timed proportionally within the audio duration
3. Application Architecture
┌──────────────────────────────────────────────────────────┐
│ Next.js 16 App │
│ │
│ ┌─────────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ Upload UI │ │ Editor UI │ │ Dashboard UI │ │
│ │ (drag-drop) │ │ (scene edit)│ │ (video library) │ │
│ └──────┬──────┘ └──────┬───────┘ └──────────────────┘ │
│ │ │ │
│ ┌──────▼────────────────▼────────────────────────────┐ │
│ │ API Routes (Next.js) │ │
│ │ POST /api/projects — create from document │ │
│ │ POST /api/projects/:id/generate — run AI pipeline │ │
│ │ PATCH /api/scenes/:id — edit scene │ │
│ │ POST /api/projects/:id/render — trigger render │ │
│ │ GET /api/projects/:id/status — render progress │ │
│ └──────┬────────────────────────────────────────────┘ │
│ │ │
│ ┌──────▼──────────────────────────────────────────────┐ │
│ │ Background Jobs (BullMQ + Redis) │ │
│ │ │ │
│ │ IngestJob → ContentIntelligenceJob → TTSJob → │ │
│ │ CompositionJob → RenderJob → NotifyJob │ │
│ └──────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────┘
│
┌────────────┼─────────────┐
│ │ │
┌──────▼──┐ ┌─────▼────┐ ┌─────▼──────┐
│ Postgres │ │ R2/S3 │ │ Remotion │
│ (meta) │ │ (assets) │ │ Lambda │
└─────────┘ └──────────┘ └────────────┘
Tech Stack
| Layer | Technology | Rationale |
|---|---|---|
| Framework | Next.js 16 (App Router) | Moklabs standard, SSR for SEO, API routes |
| UI | React 19, Tailwind, Radix | Design system consistency |
| State | TanStack Query + Zustand | Server state + local editor state |
| Database | PostgreSQL (Drizzle ORM) | Projects, scenes, render history |
| Queue | BullMQ + Redis | Job pipeline with retries, progress tracking |
| Storage | Cloudflare R2 | Assets, renders, thumbnails |
| Rendering | Remotion 4.x + Lambda | Deterministic video from React components |
| TTS | OpenAI TTS-1-HD (MVP) | Balance of quality and cost |
| AI | Claude Sonnet 4.6 / GPT-4o | Script generation, visual matching |
| Auth | Better Auth | Moklabs SSO integration |
Database Schema (Core)
-- Projects
CREATE TABLE projects (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID REFERENCES users(id),
title TEXT NOT NULL,
source_type TEXT NOT NULL, -- 'pdf', 'pptx', 'docx', 'md', 'url'
source_url TEXT,
extracted_content JSONB,
video_script JSONB,
style JSONB,
status TEXT DEFAULT 'draft', -- draft, generating, ready, rendering, complete
created_at TIMESTAMPTZ DEFAULT now(),
updated_at TIMESTAMPTZ DEFAULT now()
);
-- Scenes (editable independently)
CREATE TABLE scenes (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
project_id UUID REFERENCES projects(id) ON DELETE CASCADE,
sort_order INTEGER NOT NULL,
template TEXT NOT NULL, -- 'title', 'keypoint', 'chart', etc.
narration TEXT,
visual_description TEXT,
key_points JSONB,
background_asset_url TEXT,
audio_url TEXT,
audio_duration_ms INTEGER,
duration_frames INTEGER,
transition TEXT DEFAULT 'fade',
props JSONB, -- template-specific properties
created_at TIMESTAMPTZ DEFAULT now()
);
-- Renders
CREATE TABLE renders (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
project_id UUID REFERENCES projects(id),
status TEXT DEFAULT 'queued', -- queued, rendering, complete, failed
output_url TEXT,
codec TEXT DEFAULT 'h264',
resolution TEXT DEFAULT '1920x1080',
fps INTEGER DEFAULT 30,
duration_sec FLOAT,
render_time_sec FLOAT,
cost_usd FLOAT,
error_message TEXT,
created_at TIMESTAMPTZ DEFAULT now()
);
4. Editor UX — The Iterative Refinement Loop
The key differentiator vs one-shot generators:
Upload → Auto-generate script → Preview → Edit → Re-render → Download
▲ │
└─────────┘
(iterate on scenes)
Editor Capabilities (MVP)
- Scene Timeline — Horizontal strip of scene thumbnails. Drag to reorder. Click to select.
- Scene Editor Panel — Edit narration text, key points, visual description. Swap background image. Adjust duration. Change template.
- Live Preview — Remotion
<Player>component renders scene in-browser. Instant feedback without full render. - Regenerate Scene — Re-run AI for a single scene with modified instructions.
- Full Render — Trigger server-side or Lambda render for final MP4.
Remotion Player Integration
import { Player } from '@remotion/player';
function ScenePreview({ scene }: { scene: Scene }) {
return (
<Player
component={sceneTemplates[scene.template]}
inputProps={scene.props}
durationInFrames={scene.duration_frames}
fps={30}
compositionWidth={1920}
compositionHeight={1080}
style={{ width: '100%', aspectRatio: '16/9' }}
controls
/>
);
}
5. Cost Model
Per-Video Cost Breakdown (3-minute video, 10 scenes)
| Component | Cost |
|---|---|
| Document extraction | $0.00 (local processing) |
| Script generation (Claude Sonnet) | $0.03 |
| Visual matching (GPT-4o vision) | $0.02 |
| Narration polish (Claude Haiku) | $0.005 |
| TTS narration (OpenAI TTS-1-HD, ~4,500 chars) | $0.14 |
| Stock images (Unsplash/Pexels free tier) | $0.00 |
| Lambda rendering | $0.15 |
| R2 storage (100MB video, 1 month) | $0.0015 |
| Total COGS per video | ~$0.35 |
Pricing Potential
At $0.35 COGS, a freemium model with:
- Free: 3 videos/month (COGS: $1.05/user/month)
- Pro $19/month: 30 videos/month → 98% gross margin
- Business $49/month: 100 videos + team features → 96% gross margin
6. Competitive Moat
| Feature | NotebookLM Cinematic | Synthesia | Lumen5 | Narrativ |
|---|---|---|---|---|
| Document input | Yes (Google Docs only) | No | Blog/text | Any format (PDF, PPTX, DOCX, MD, URL) |
| Iterative editing | No (one-shot) | Scene editor | Template editor | Full scene editor + live preview |
| Custom branding | No | Yes (Enterprise) | Yes | Yes (all tiers) |
| Programmatic API | No | Yes ($$$) | No | Yes (API-first) |
| Self-hosted option | No | No | No | Yes (Remotion SSR) |
| AI voice narration | Yes | Avatar lip-sync | No | TTS with voice selection |
| Cost per video | Free (Google lock-in) | $0.50-2.00 | $0.30+ | $0.35 |
| Open source | No | No | No | Core engine (MIT) |
Differentiation Summary
- Format-agnostic ingestion — not locked to Google Docs or specific templates
- Iterative refinement — edit any scene, re-render selectively, not one-shot generation
- API-first — developers can build on top of Narrativ’s pipeline
- Transparent cost — no per-minute or per-credit pricing games
7. MVP Scope & Milestones
Phase 1: Core Pipeline (4 weeks)
- Document ingestion (PDF + PPTX + Markdown)
- LLM script generation with VideoScript output
- 4 scene templates (Title, KeyPoint, ImageShowcase, Summary)
- OpenAI TTS-1-HD narration
- Remotion SSR rendering (server-side)
- Basic upload → generate → download flow
- PostgreSQL schema + Drizzle ORM
Phase 2: Editor + Preview (3 weeks)
- Remotion Player in-browser preview
- Scene editor (narration, key points, images)
- Drag-and-drop scene reordering
- Re-generate single scene
- BullMQ job pipeline with progress
- R2 asset storage
Phase 3: Scale + Polish (3 weeks)
- Lambda rendering integration
- Additional templates (Chart, Comparison, Quote, CTA)
- DOCX + URL ingestion
- ElevenLabs TTS integration (premium)
- User dashboard with video library
- Better Auth integration
Phase 4: Growth (ongoing)
- Public API for programmatic video generation
- Custom template builder
- Team workspaces
- Localization (multi-language TTS)
- Background music library
- SCORM export for L&D market
8. Risk Assessment
| Risk | Impact | Mitigation |
|---|---|---|
| Remotion + Next.js 16 compatibility (MOKA-309) | High | Enforce webpack mode; keep Remotion server-only; see existing research report |
| NotebookLM makes doc-to-video free | High | Differentiate on editing, API, format support, non-Google lock-in |
| TTS quality perception | Medium | Default to HD model; offer ElevenLabs upgrade; allow custom voice upload |
| Lambda cold start latency | Low | Pre-warm functions; SQS queue for background rendering |
| LLM script quality variance | Medium | Structured output schemas; human review step; template constraints |