2023 AI 编年史：GPT-4V 与 SDXL 多模态浪潮

发表于 2023-06-05 分类于 mechine 阅读次数：

2023 年 AI 编年史：GPT-4 Vision 与 Stable Diffusion XL 引领多模态理解与生成，视觉-语言模型架构与应用场景中英文对照分析。

2023 AI 编年史：GPT-4V 与 SDXL 多模态浪潮 | AI Timeline 2023: GPT-4V and SDXL Multimodal Wave

一、背景 | Background

English

In June 2023, multimodal AI reached a tipping point on two fronts: understanding (GPT-4 Vision / GPT-4V) and generation (Stable Diffusion XL / SDXL). GPT-4V, released in September 2023 preview but architecturally finalized by mid-2023, could analyze images, charts, screenshots, and handwritten notes. SDXL 1.0 (July 2023) delivered photorealistic 1024×1024 image generation as an open-source alternative to Midjourney and DALL·E 3.

Multimodal AI extends LLMs beyond text to process vision, audio, and other modalities — enabling applications from visual Q&A to automated design.

Key terms:

Multimodal: Models processing multiple input types (text + image + audio).
VLM (Vision-Language Model): Model jointly understanding images and text.
GPT-4V: GPT-4 with vision capabilities — image input alongside text prompts.
SDXL (Stable Diffusion XL): Open-source text-to-image model at 1024×1024 resolution.
CLIP (Contrastive Language-Image Pre-training): Foundation model aligning text and image embeddings.
Diffusion Model: Generative model that iteratively denoises random noise into structured images.

中文

2023 年 6 月，多模态 AI 在两条战线上达到临界点：理解（GPT-4 Vision / GPT-4V）与生成（Stable Diffusion XL / SDXL）。GPT-4V（2023 年 9 月预览发布，架构于年中定型）可分析图像、图表、截图与手写笔记。SDXL 1.0（2023 年 7 月）以开源方案提供 1024×1024 逼真图像生成，对标 Midjourney 与 DALL·E 3。

多模态 AI 将 LLM 从纯文本扩展到视觉、音频等模态——解锁视觉问答到自动化设计等应用。

关键词解释：

Multimodal（多模态）：处理多种输入类型（文本 + 图像 + 音频）的模型。
VLM（视觉-语言模型）：联合理解图像与文本的模型。
GPT-4V：具备视觉能力的 GPT-4——文本 Prompt 搭配图像输入。
SDXL：1024×1024 分辨率的开源文生图模型。
CLIP：对齐文本与图像嵌入的基础模型。
Diffusion Model（扩散模型）：迭代去噪生成结构化图像的生成模型。

二、架构 | Architecture

2.1 GPT-4V 视觉理解架构 | GPT-4V Vision Architecture

English

GPT-4V integrates a vision encoder with the GPT-4 language model:

Image Input → Vision Encoder（ViT）→ Image Tokens
                                        ↓
Text Prompt ──────────────────→  Unified Transformer
                                        ↓
                              Text Response（描述 / 分析 / 推理）

Processing pipeline:

Image resized/tiled to fit context (detail: low/high/auto modes)
Vision Transformer (ViT) encodes patches into embedding vectors
Image embeddings interleaved with text tokens in the transformer
Autoregressive generation produces text analysis

Capabilities demonstrated in 2023:

OCR and handwriting recognition
Chart and diagram interpretation
Spatial reasoning (“which object is left of the red box?”)
Multi-image comparison
Meme and cultural context understanding

中文

GPT-4V 将 视觉编码器 与 GPT-4 语言模型集成：图像经 ViT 编码为 Image Tokens，与文本 Token 交错输入统一 Transformer，自回归生成文本分析。2023 年展示的能力包括 OCR、图表解读、空间推理、多图对比与 Meme 文化理解。

2.2 SDXL 生成架构 | SDXL Generation Architecture

English

SDXL uses a two-stage diffusion pipeline:

Text Prompt → Dual Text Encoders（OpenCLIP + OpenCLIP ViT-L）
                    ↓
            UNet（Base Model, 3.5B params）
                    ↓
            Latent Image（128×128 in latent space = 1024×1024 pixel）
                    ↓
            Refiner Model（optional second UNet pass）
                    ↓
            VAE Decoder → Final 1024×1024 Image

Key innovations over SD 1.5:

Feature	SD 1.5	SDXL
分辨率	512×512	1024×1024
文本编码器	1× CLIP	2× CLIP（更大语义空间）
UNet 参数	~860M	~3.5B
Refiner	无	可选精修阶段
训练数据	LAION-2B 子集	更大更干净的数据集

中文

SDXL 采用 两阶段扩散流水线：双文本编码器（OpenCLIP + OpenCLIP ViT-L）→ Base UNet（3.5B 参数）→ 潜空间图像 → 可选 Refiner 精修 → VAE 解码为 1024×1024 最终图像。相比 SD 1.5，分辨率翻倍、双 CLIP 编码器、参数量 4 倍增长。

2.3 多模态技术栈对比 | Multimodal Stack Comparison

方向 Direction	闭源 Closed	开源 Open
视觉理解	GPT-4V, Gemini Pro Vision	LLaVA, Qwen-VL, InternVL
图像生成	DALL·E 3, Midjourney	SDXL, Flux, Playground v2
视频生成	— (Sora 2024)	AnimateDiff, ModelScope
语音	Whisper (OpenAI)	Whisper (open), Bark

三、趋势 | Trends

English

June–December 2023 multimodal trends:

GPT-4V API launch (Nov 2023): Vision API at $0.01/image enabled visual apps at scale.
LLaVA open-source wave: LLaVA-1.5 matched GPT-4V on many benchmarks with 13B params.
SDXL ecosystem explosion: ControlNet, LoRA, IP-Adapter, ComfyUI workflows.
DALL·E 3 + ChatGPT integration: Text-to-image inside ChatGPT conversations.
Document AI: GPT-4V replaced OCR pipelines for invoice/receipt/form processing.
Multimodal RAG: Image + text retrieval for visual knowledge bases.

中文

2023 年 6–12 月多模态趋势：

GPT-4V API 上线（11 月）：$0.01/图使视觉应用规模化。
LLaVA 开源浪潮：LLaVA-1.5 以 13B 参数在多数基准对标 GPT-4V。
SDXL 生态爆发：ControlNet、LoRA、IP-Adapter、ComfyUI 工作流。
DALL·E 3 + ChatGPT 集成：对话中直接文生图。
文档 AI：GPT-4V 替代 OCR 流水线处理发票/收据/表单。
多模态 RAG：图像 + 文本检索构建视觉知识库。

四、优缺点 | Pros and Cons

4.1 GPT-4V（视觉理解）

优点 Advantages	缺点 Disadvantages
通用视觉理解——图表、照片、截图	闭源，API 定价按图计费
与 GPT-4 文本能力无缝结合	空间推理仍有错误
支持多图输入与对比	无法生成图像（仅理解）
减少专用 OCR/分类模型需求	隐私——图像上传至 OpenAI 服务器

4.2 SDXL（图像生成）

优点 Advantages	缺点 Disadvantages
开源（CreativeML 许可）可自部署	1024×1024 仍不够商业印刷
ComfyUI/Automatic1111 生态丰富	手部/文字生成仍有问题
LoRA 微调门槛低	需要 GPU（8GB+ VRAM）
ControlNet 精确控制构图	版权与深度伪造伦理争议

五、应用场景 | Use Cases

场景 Scenario	技术 Tech	中文说明
发票/收据 OCR	GPT-4V	拍照即结构化提取字段
UI 设计评审	GPT-4V	截图分析布局与可用性
电商商品图	SDXL + LoRA	品牌风格一致的产品图生成
医学影像辅助	GPT-4V / 专用 VLM	X 光/病理切片初步分析
教育可视化	SDXL + ControlNet	根据课文自动生成插图
安防监控	GPT-4V + 视频帧	异常事件自然语言描述
广告创意	SDXL + IP-Adapter	品牌 IP 风格广告素材
无障碍辅助	GPT-4V	为视障用户描述周围环境

六、GitHub 与开源生态 | GitHub and Open Source

项目 Project	说明 Description
Stability-AI/generative-models	SDXL 官方代码
haotian-liu/LLaVA	开源 VLM，对标 GPT-4V
lllyasviel/ControlNet	SD/SDXL 精确构图控制
comfyanonymous/ComfyUI	节点式 SDXL 工作流 UI
openai/openai-python	GPT-4V API 客户端

七、总结 | Summary

中文：2023 年 6 月，GPT-4V 与 SDXL 分别在多模态「理解」与「生成」两端树立新标杆。GPT-4V 让 LLM 「看见」世界，SDXL 让每个人都能「创造」视觉内容。两者共同推动 AI 从纯文本时代进入真正的多模态时代，也为后续的 Sora 视频生成与 GPT-4o 原生多模态奠定基础。

English: In June 2023, GPT-4V and SDXL set new benchmarks for multimodal understanding and generation respectively. GPT-4V let LLMs “see” the world; SDXL let everyone “create” visual content. Together they propelled AI from the text-only era into true multimodality, laying groundwork for Sora video generation and GPT-4o native multimodal capabilities.

参考链接 | References

OpenAI: GPT-4V System Card
Stability AI: SDXL 1.0 Release
论文: LLaVA: Visual Instruction Tuning
论文: SDXL: Improving Latent Diffusion Models
论文: CLIP: Learning Transferable Visual Models