导读
3月11日,DeepSeek-AI开源了全新多模态大模型DeepSeek-VL系列,包含1.3b、7b两种不同规模的4个版本的模型。
官方总结DeepSeek-VL的模型优势:
- 不丢失语言能力的情况下融入多模态能力,对绝大多数现实场景下的问题给出细致、有条理的回复;
- 能接受大尺寸分辨率图片作为输入(高达1024x1024),识别图片中的细小物体;
- 具备通用多模式理解能力,能处理逻辑图、网页、公式识别、科学文献、自然图像
DeepSeek VL模型结合视觉和语言信息的多模态预训练和微调方法,构建一个能够高效处理跨模态任务的统一模型,并且特别关注其在零样本设置下的表现。研究工作分为数据构建、方法论、评估和未来方向几个部分。
在数据构建阶段,DeepSeek VL模型采用了一系列多样化的数据集进行联合视觉和语言预训练及监督微调,包括公开数据集ShareGPT4V、LAION-GPTV、LVIS-Instruct4V、textOCR-GPT4V、LLaVA1.6-GPT4V、IconQA、Ureader等,涵盖了地理、科学、屏幕代码、图像描述等多个领域。此外,还特意引入纯文本数据集如DeepSeek-LLM,用于保持模型在语言相关任务上的性能。
模型训练流程主要包括以下三个关键阶段:
1. 视觉-语言适配器训练:首先,模型通过对比式预训练(Contrastive pre-training),将图像编码器和文本编码器的输出投影到同一潜在空间中,减小两者间的表示差异,使模型能够捕捉并关联不同模态之间的相似性和对应关系。
2. 联合视觉-语言预训练:进一步地,模型在大规模多模态数据上进行训练,从标签文本中构建数据集分类器,即将文本标签转化为特征向量,并与图像嵌入合并输入至分类器中,增强模型对图像和文本之间语义联系的理解能力。
3. 监督微调:最后,在特定任务或领域的有标注数据上进行细粒度的微调,比如针对表格、图表、编程代码等复杂场景的数据,确保模型在具体应用时能准确地根据文本指令或描述对图像进行类别预测。
魔搭社区提供了关于DeepSeek VL系列的推理、微调实践教程,希望对感兴趣的小伙伴有所帮助。
模型链接和下载
deepseek-vl系列模型现已在魔搭ModelScope社区开源,包括:
deepseek-vl-1.3b-chat:
https://modelscope.cn/models/deepseek-ai/deepseek-vl-1.3b-chat
deepseek-vl-7b-chat:
https://modelscope.cn/models/deepseek-ai/deepseek-vl-7b-chat
deepseek-vl-7b-base:
https://modelscope.cn/models/deepseek-ai/deepseek-vl-7b-base
deepseek-vl-1.3b-base:
https://modelscope.cn/models/deepseek-ai/deepseek-vl-1.3b-base
社区支持直接下载模型的repo:
#模型下载 from modelscope import snapshot_download model_dir = snapshot_download('deepseek-ai/deepseek-vl-7b-chat')
模型推理
推理代码:
使用魔搭社区的免费算力,使用deepseek-vl系列模型,推荐py38的镜像:
以deepseek-vl-1.3b-chat为例,模型推理代码:
环境安装:
git clone https://github.com/deepseek-ai/DeepSeek-VL cd DeepSeek-VL pip install -e .
推理代码:
import torch from transformers import AutoModelForCausalLM from deepseek_vl.models import VLChatProcessor, MultiModalityCausalLM from deepseek_vl.utils.io import load_pil_images from modelscope import snapshot_download # specify the path to the model model_path = snapshot_download("deepseek-ai/deepseek-vl-1.3b-chat") vl_chat_processor: VLChatProcessor = VLChatProcessor.from_pretrained(model_path) tokenizer = vl_chat_processor.tokenizer vl_gpt: MultiModalityCausalLM = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True) vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval() conversation = [ { "role": "User", "content": "<image_placeholder>Describe each stage of this image.", "images": ["/mnt/workspace/DeepSeek-VL/images/training_pipelines.jpg"] }, { "role": "Assistant", "content": "" } ] # load images and prepare for inputs pil_images = load_pil_images(conversation) prepare_inputs = vl_chat_processor( conversations=conversation, images=pil_images, force_batchify=True ).to(vl_gpt.device) # run image encoder to get the image embeddings inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs) # run the model to get the response outputs = vl_gpt.language_model.generate( inputs_embeds=inputs_embeds, attention_mask=prepare_inputs.attention_mask, pad_token_id=tokenizer.eos_token_id, bos_token_id=tokenizer.bos_token_id, eos_token_id=tokenizer.eos_token_id, max_new_tokens=512, do_sample=False, use_cache=True ) answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True) print(f"{prepare_inputs['sft_format'][0]}", answer)
推理生成样例:
""" The image depicts a sequence of three stages, each labeled with a different color and a corresponding icon. The stages are labeled as "Stage 1: Training VL Adapter," "Stage 2: Joint VL Pre-training," and "Stage 3: Supervised Fine-tuning." In the first stage, labeled "Stage 1: Training VL Adapter," the image shows a process where a VL Adapter is trained. The VL Adapter is represented by a blue rectangle with a white "V" inside it, and it is connected to a "Video Adapter" with a white arrow pointing to it. The "Video Adapter" is depicted as a white rectangle with a black "V" inside it, and it is connected to a "Video Sequence" with a white arrow pointing to it. In the second stage, labeled "Stage 2: Joint VL Pre-training," the image shows a process where a joint VL pre-training is performed. The VL Adapter is again represented by a blue rectangle with a white "V" inside it, and it is connected to a "Video Adapter" with a white arrow pointing to it. The "Video Adapter" is depicted as a white rectangle with a black "V" inside it, and it is connected to a "Video Sequence" with a white arrow pointing to it. In the third stage, labeled "Stage 3: Supervised Fine-tuning," the image shows a process where a supervised fine-tuning is performed. The VL Adapter is represented by a blue rectangle with a white "V" inside it, and it is connected to a "Video Adapter" with a white arrow pointing to it. The "Video Adapter" is depicted as a white rectangle with a black "V" inside it, and it is connected to a "Video Sequence" with a white arrow pointing to it. Each stage is accompanied by a visual representation of the VL Adapter, Video Adapter, and Video Sequence, as well as a "Hybrid Vision" icon, which is a combination of a "Video Adapter" and a "Video Sequence" icon. The "Image Test Pairs" icon is also present, indicating that the VL Adapter is being evaluated on a set of test images. The text "Stage 1: Training VL Adapter," "Stage 2: Joint VL Pre-training," and "Stage 3: Supervised Fine-tuning" are also present, providing a clear understanding of the stages and their respective tasks. """
模型微调和微调后推理
我们使用swift来对模型进行微调, swift是魔搭社区官方提供的LLM微调推理框架.
微调代码开源地址: https://github.com/modelscope/swift
swift对deepseek-vl推理与微调微调最佳实践:
多模态大模型微调通常使用自定义数据集进行微调. 这里展示可直接运行的demo:
我们使用数据集`coco-mini-en-2`进行微调, 该数据集的任务是: 对图片内容进行描述.
环境准备:
git clone https://github.com/modelscope/swift.git cd swift pip install -e .[llm]
微调脚本: (LoRA)
默认只对LLM部分的qkv进行lora微调. 如果你想对LLM部分的所有linear进行微调, 可以指定`--lora_target_modules ALL`. 该模型暂不支持对vision模型部分微调
# Experimental environment: A10, 3090, V100 # 20GB GPU memory CUDA_VISIBLE_DEVICES=0 swift sft \ --model_type deepseek-vl-7b-chat \ --dataset coco-mini-en-2 \
如果使用自定义数据集,需要指定如下参数:
--custom_train_dataset_path xxx.jsonl \ --custom_val_dataset_path yyy.jsonl \
自定义数据集支持json, jsonl样式, 以下是自定义数据集的例子:
(支持多轮对话, 每轮对话必须包含一张图片, 支持传入本地路径或URL)
{"query": "55555", "response": "66666", "images": ["image_path"]} {"query": "eeeee", "response": "fffff", "history": [], "images": ["image_path"]} {"query": "EEEEE", "response": "FFFFF", "history": [["AAAAA", "BBBBB"], ["CCCCC", "DDDDD"]], "images": ["image_path", "image_path2", "image_path3"]}
微调后推理脚本: (这里的ckpt_dir需要修改为训练生成的checkpoint文件夹)
# Experimental environment: A10, 3090, V100 CUDA_VISIBLE_DEVICES=0 swift infer \ --ckpt_dir output/deepseek-vl-7b-chat/vx-xxx/checkpoint-xxx \ --load_dataset_config true \
微调的可视化结果:
微调后样例:
[PROMPT]<|begin▁of▁sentence|>You are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language. User: <image_placeholder>please describe the image. Assistant:[OUTPUT]A large airplane is suspended from the ceiling.<|end▁of▁sentence|> [LABELS]People walking in a museum with a airplane hanging from the celing. [IMAGES]['https://xingchen-data.oss-cn-zhangjiakou.aliyuncs.com/coco/2014/val2014/COCO_val2014_000000492132.jpg']
[PROMPT]<|begin▁of▁sentence|>You are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language. User: <image_placeholder>please describe the image. Assistant:[OUTPUT]A bowl of berries and a cup of coffee.<|end▁of▁sentence|> [LABELS]a bowl of fruit and pastry on a table [IMAGES]['https://xingchen-data.oss-cn-zhangjiakou.aliyuncs.com/coco/2014/val2014/COCO_val2014_000000558642.jpg']
点击直达开源模型