Accessibility Converter — Architecture

Current pipeline vs. proposed VLM-based pipeline

Current
Deterministic Pipeline
Sequential processing. Text extraction → heuristic semantic analysis → rule-based HTML rendering. LLM is optional, only for ambiguous nodes.
📄 Upload .pptx or .pdf main.py
File validation, job creation, stored in memory dict
🔍 Parse File parsers.py
python-pptx for PPTX, pdfplumber for PDF. Extracts text, tables, images as NormalizedNode objects with bounding boxes and font sizes.
📐 Reconstruct Reading Order semantic.py
Column-aware spatial sorting. Uses bounding box positions to guess left-right / top-down reading order.
🧠 Infer Semantics Sequential semantic.py
Heuristic role assignment: font size → heading, bullet chars → list items, bounding box patterns → tables. Assigns SemanticIntent (role + confidence) to each node.
🤖 Groq LLM (Ambiguity Only) Optional groq_client.py
Text-only. Only invoked for low-confidence nodes (<0.6). Classifies ambiguous nodes into h2/h3/p/li/table/img/input. Single batch call, no vision.
🏗️ Render HTML html_renderer.py
Iterates NormalizedNodes, maps SemanticIntent roles → HTML tags. Builds full document with nav, skip links, CSS, ARIA landmarks. Rule-based, no intelligence.
Validate A11y validator.py
Checks heading hierarchy, empty alt text, missing table headers, form labels. Regex-based HTML checks.
💾 Write Outputs converter.py
document.html + manifest.json saved to results/{job_id}/
Proposed
VLM-Powered Pipeline
Per-page parallel processing. Page screenshot + text extraction → DeepSeek VLM generates semantic HTML directly → deterministic stitch. LLM is the core, not a sidecar.
📄 Upload .pptx or .pdf main.py
Same job lifecycle. Unchanged API contract.
✂️ Split into Pages New page_splitter.py
PDF → N individual pages. PPTX → convert to PDF first, then split. Each page becomes an independent work unit.
Per-Page Processing (Parallel) Concurrent
All pages processed simultaneously with semaphore rate limiting (5-10 max concurrent)
📸 Render Page Screenshot New page_splitter.py
Rasterize page to PNG at ~150-200 DPI using pdf2image / poppler
📝 Extract Text Layer New page_splitter.py
pdfplumber or PyMuPDF extracts raw text per page. This is the "authoritative" text — VLM uses it as ground truth, screenshot for structure only.
👁️ DeepSeek VLM Call Core deepseek_client.py
Vision + text. Receives: page PNG + extracted text + system prompt (output contract). Returns: single <section> of semantic HTML. Handles layout, hierarchy, alt text, reading order — all in one shot.
🧩 Deterministic Stitch New stitcher.py
Pure code, no LLM. Concatenates sections in order, wraps in HTML shell, builds <nav> from <h2> tags, injects IDs, strips duplicate headers/footers.
Validate A11y validator.py
Updated checks: data-page on every section, alt on every img, heading hierarchy, output contract compliance.
💾 Write Outputs converter.py
Same output format: document.html + manifest.json

File-by-File Impact

app/main.py
API routes, job lifecycle
app/main.py
Minor changes — converter call becomes async
app/config.py
Groq settings, file limits
app/config.py
Replace Groq settings with DeepSeek API key, concurrency limits, DPI setting
app/services/parsers.py
350 lines — PptxParser, PdfParser, NormalizedNode extraction
app/services/page_splitter.py
Split PDF into pages, render PNGs, extract text per page. Simpler — no node/bbox modeling
app/services/semantic.py
148 lines — heuristic role inference, reading order, Groq disambiguation
Deleted
VLM handles all semantic decisions directly
app/services/groq_client.py
70 lines — text-only Groq wrapper for ambiguous nodes
app/services/deepseek_client.py
Vision LLM client — sends image + text, receives <section> HTML. Retry logic, rate limiting
app/services/html_renderer.py
224 lines — node-by-node HTML rendering, CSS, document shell
app/services/stitcher.py
Concatenate VLM sections, wrap in HTML doc shell, build nav, inject IDs. Much simpler
app/services/converter.py
Orchestrates parse → semantic → render → validate
app/services/converter.py
Orchestrates split → parallel VLM calls → stitch → validate. Now async
app/services/validator.py
Checks heading hierarchy, alt text, table headers
app/services/validator.py
Updated checks: data-page attrs, output contract compliance, no inline styles
app/models/schemas.py
NormalizedNode, BBox, SemanticIntent, SlideDocument
app/models/schemas.py
Simplified — drop NormalizedNode/BBox/SemanticIntent. Add PageResult (text, image_path, html_section)

What Changes

LLM goes from optional sidecar → core of the pipeline
Text-only Groq → vision-capable DeepSeek VLM
Sequential whole-doc processing → parallel per-page processing
Heuristic semantic analysis eliminated entirely
Rule-based HTML rendering eliminated — VLM generates HTML directly
NormalizedNode/BBox/SemanticIntent data model no longer needed
converter.py becomes async to support parallelism

What Stays

FastAPI app structure and API contract unchanged
Job lifecycle (create → poll → result) unchanged
Output format: document.html + manifest.json
A11y validation pass (updated, not replaced)
Frontend (static/index.html) unchanged
File upload validation logic unchanged
PPTX support (convert to PDF first, then same pipeline)