DeepSeek VL系列开源,魔搭社区模型微调最佳实践教程来啦!

本文涉及的产品
模型在线服务 PAI-EAS,A10/V100等 500元 1个月
交互式建模 PAI-DSW,每月250计算时 3个月
模型训练 PAI-DLC,5000CU*H 3个月
简介: 3月11日,DeepSeek-AI开源了全新多模态大模型DeepSeek-VL系列,包含1.3b、7b两种不同规模的4个版本的模型。

导读

3月11日,DeepSeek-AI开源了全新多模态大模型DeepSeek-VL系列,包含1.3b、7b两种不同规模的4个版本的模型。

官方总结DeepSeek-VL的模型优势:

  • 不丢失语言能力的情况下融入多模态能力,对绝大多数现实场景下的问题给出细致、有条理的回复;
  • 能接受大尺寸分辨率图片作为输入(高达1024x1024),识别图片中的细小物体;
  • 具备通用多模式理解能力,能处理逻辑图、网页、公式识别、科学文献、自然图像

DeepSeek VL模型结合视觉和语言信息的多模态预训练和微调方法,构建一个能够高效处理跨模态任务的统一模型,并且特别关注其在零样本设置下的表现。研究工作分为数据构建、方法论、评估和未来方向几个部分。

在数据构建阶段,DeepSeek VL模型采用了一系列多样化的数据集进行联合视觉和语言预训练及监督微调,包括公开数据集ShareGPT4V、LAION-GPTV、LVIS-Instruct4V、textOCR-GPT4V、LLaVA1.6-GPT4V、IconQA、Ureader等,涵盖了地理、科学、屏幕代码、图像描述等多个领域。此外,还特意引入纯文本数据集如DeepSeek-LLM,用于保持模型在语言相关任务上的性能。

模型训练流程主要包括以下三个关键阶段:

1. 视觉-语言适配器训练:首先,模型通过对比式预训练(Contrastive pre-training),将图像编码器和文本编码器的输出投影到同一潜在空间中,减小两者间的表示差异,使模型能够捕捉并关联不同模态之间的相似性和对应关系。

2. 联合视觉-语言预训练:进一步地,模型在大规模多模态数据上进行训练,从标签文本中构建数据集分类器,即将文本标签转化为特征向量,并与图像嵌入合并输入至分类器中,增强模型对图像和文本之间语义联系的理解能力。

3. 监督微调:最后,在特定任务或领域的有标注数据上进行细粒度的微调,比如针对表格、图表、编程代码等复杂场景的数据,确保模型在具体应用时能准确地根据文本指令或描述对图像进行类别预测。

魔搭社区提供了关于DeepSeek VL系列的推理、微调实践教程,希望对感兴趣的小伙伴有所帮助。

模型链接和下载

deepseek-vl系列模型现已在魔搭ModelScope社区开源,包括:

deepseek-vl-1.3b-chat:

https://modelscope.cn/models/deepseek-ai/deepseek-vl-1.3b-chat

deepseek-vl-7b-chat:

https://modelscope.cn/models/deepseek-ai/deepseek-vl-7b-chat

deepseek-vl-7b-base:

https://modelscope.cn/models/deepseek-ai/deepseek-vl-7b-base

deepseek-vl-1.3b-base:

https://modelscope.cn/models/deepseek-ai/deepseek-vl-1.3b-base

社区支持直接下载模型的repo:

#模型下载
from modelscope import snapshot_download
model_dir = snapshot_download('deepseek-ai/deepseek-vl-7b-chat')

模型推理

推理代码:

使用魔搭社区的免费算力,使用deepseek-vl系列模型,推荐py38的镜像:

以deepseek-vl-1.3b-chat为例,模型推理代码:

环境安装:

git clone https://github.com/deepseek-ai/DeepSeek-VL
cd DeepSeek-VL
pip install -e .

推理代码:

import torch
from transformers import AutoModelForCausalLM
from deepseek_vl.models import VLChatProcessor, MultiModalityCausalLM
from deepseek_vl.utils.io import load_pil_images
from modelscope import snapshot_download
# specify the path to the model
model_path = snapshot_download("deepseek-ai/deepseek-vl-1.3b-chat")
vl_chat_processor: VLChatProcessor = VLChatProcessor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer
vl_gpt: MultiModalityCausalLM = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True)
vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()
conversation = [
    {
        "role": "User",
        "content": "<image_placeholder>Describe each stage of this image.",
        "images": ["/mnt/workspace/DeepSeek-VL/images/training_pipelines.jpg"]
    },
    {
        "role": "Assistant",
        "content": ""
    }
]
# load images and prepare for inputs
pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(
    conversations=conversation,
    images=pil_images,
    force_batchify=True
).to(vl_gpt.device)
# run image encoder to get the image embeddings
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)
# run the model to get the response
outputs = vl_gpt.language_model.generate(
    inputs_embeds=inputs_embeds,
    attention_mask=prepare_inputs.attention_mask,
    pad_token_id=tokenizer.eos_token_id,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    max_new_tokens=512,
    do_sample=False,
    use_cache=True
)
answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
print(f"{prepare_inputs['sft_format'][0]}", answer)

推理生成样例:

"""
The image depicts a sequence of three stages, each labeled with a different color and a corresponding icon. The stages are labeled as "Stage 1: Training VL Adapter," "Stage 2: Joint VL Pre-training," and "Stage 3: Supervised Fine-tuning."
In the first stage, labeled "Stage 1: Training VL Adapter," the image shows a process where a VL Adapter is trained. The VL Adapter is represented by a blue rectangle with a white "V" inside it, and it is connected to a "Video Adapter" with a white arrow pointing to it. The "Video Adapter" is depicted as a white rectangle with a black "V" inside it, and it is connected to a "Video Sequence" with a white arrow pointing to it.
In the second stage, labeled "Stage 2: Joint VL Pre-training," the image shows a process where a joint VL pre-training is performed. The VL Adapter is again represented by a blue rectangle with a white "V" inside it, and it is connected to a "Video Adapter" with a white arrow pointing to it. The "Video Adapter" is depicted as a white rectangle with a black "V" inside it, and it is connected to a "Video Sequence" with a white arrow pointing to it.
In the third stage, labeled "Stage 3: Supervised Fine-tuning," the image shows a process where a supervised fine-tuning is performed. The VL Adapter is represented by a blue rectangle with a white "V" inside it, and it is connected to a "Video Adapter" with a white arrow pointing to it. The "Video Adapter" is depicted as a white rectangle with a black "V" inside it, and it is connected to a "Video Sequence" with a white arrow pointing to it.
Each stage is accompanied by a visual representation of the VL Adapter, Video Adapter, and Video Sequence, as well as a "Hybrid Vision" icon, which is a combination of a "Video Adapter" and a "Video Sequence" icon. The "Image Test Pairs" icon is also present, indicating that the VL Adapter is being evaluated on a set of test images.
The text "Stage 1: Training VL Adapter," "Stage 2: Joint VL Pre-training," and "Stage 3: Supervised Fine-tuning" are also present, providing a clear understanding of the stages and their respective tasks.
"""

模型微调和微调后推理

我们使用swift来对模型进行微调, swift是魔搭社区官方提供的LLM微调推理框架.

微调代码开源地址: https://github.com/modelscope/swift

swift对deepseek-vl推理与微调微调最佳实践:

https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/deepseek-vl%E6%9C%80%E4%BD%B3%E5%AE%9E%E8%B7%B5.md

多模态大模型微调通常使用自定义数据集进行微调. 这里展示可直接运行的demo:

我们使用数据集`coco-mini-en-2`进行微调, 该数据集的任务是: 对图片内容进行描述.

环境准备:

git clone https://github.com/modelscope/swift.git
cd swift
pip install -e .[llm]

微调脚本: (LoRA)

默认只对LLM部分的qkv进行lora微调. 如果你想对LLM部分的所有linear进行微调, 可以指定`--lora_target_modules ALL`. 该模型暂不支持对vision模型部分微调

# Experimental environment: A10, 3090, V100
# 20GB GPU memory
CUDA_VISIBLE_DEVICES=0 swift sft \
    --model_type deepseek-vl-7b-chat \
    --dataset coco-mini-en-2 \

如果使用自定义数据集,需要指定如下参数:

--custom_train_dataset_path xxx.jsonl \
--custom_val_dataset_path yyy.jsonl \

自定义数据集支持json, jsonl样式, 以下是自定义数据集的例子:

(支持多轮对话, 每轮对话必须包含一张图片, 支持传入本地路径或URL)

{"query": "55555", "response": "66666", "images": ["image_path"]}
{"query": "eeeee", "response": "fffff", "history": [], "images": ["image_path"]}
{"query": "EEEEE", "response": "FFFFF", "history": [["AAAAA", "BBBBB"], ["CCCCC", "DDDDD"]], "images": ["image_path", "image_path2", "image_path3"]}

微调后推理脚本: (这里的ckpt_dir需要修改为训练生成的checkpoint文件夹)

# Experimental environment: A10, 3090, V100
CUDA_VISIBLE_DEVICES=0 swift infer \
    --ckpt_dir output/deepseek-vl-7b-chat/vx-xxx/checkpoint-xxx \
    --load_dataset_config true \

微调的可视化结果:

微调后样例:

[PROMPT]<|begin▁of▁sentence|>You are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.
User: <image_placeholder>please describe the image.
Assistant:[OUTPUT]A large airplane is suspended from the ceiling.<|end▁of▁sentence|>
[LABELS]People walking in a museum with a airplane hanging from the celing.
[IMAGES]['https://xingchen-data.oss-cn-zhangjiakou.aliyuncs.com/coco/2014/val2014/COCO_val2014_000000492132.jpg']

[PROMPT]<|begin▁of▁sentence|>You are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.
User: <image_placeholder>please describe the image.
Assistant:[OUTPUT]A bowl of berries and a cup of coffee.<|end▁of▁sentence|>
[LABELS]a bowl of fruit and pastry on a table
[IMAGES]['https://xingchen-data.oss-cn-zhangjiakou.aliyuncs.com/coco/2014/val2014/COCO_val2014_000000558642.jpg']

点击直达开源模型

魔搭社区 (modelscope.cn)

相关文章
|
7月前
|
自然语言处理 数据可视化 物联网
Qwen1.5-MoE开源,魔搭社区推理训练最佳实践教程来啦
通义千问团队推出Qwen系列的首个MoE模型,Qwen1.5-MoE-A2.7B。
|
7月前
|
数据可视化 物联网 Swift
谷歌发布开源LLM Gemma,魔搭社区评测+最佳实践教程来啦!
Gemma是由Google推出的一系列轻量级、先进的开源模型,他们是基于 Google Gemini 模型的研究和技术而构建。
|
数据可视化 PyTorch 算法框架/工具
零一万物Yi-34B-Chat 微调模型及量化版开源!魔搭社区最佳实践教程!
11月24日,零一万物基正式发布并开源微调模型 Yi-34B-Chat,可申请免费商用。同时,零一万物还为开发者提供了 4bit/8bit 量化版模型,Yi-34B-Chat 4bit 量化版模型可以直接在消费级显卡(如RTX3090)上使用。魔搭社区已支持下载、推理训练体验,并推出相关教程,欢迎大家来玩!
|
7月前
|
自然语言处理 API Swift
Qwen1.5开源!魔搭最佳实践来啦!
近几个月来,通义千问团队一直在努力探索如何构建一个“好”的模型,同时优化开发者体验。就在刚刚,中国新年前夕,通义千问团队分享了Qwen开源系列的下一个版本,Qwen1.5。
|
人工智能 监控 Swift
魔搭社区LLM模型部署实践 —— 以ChatGLM3为例
本文将以ChatGLM3-6B为例,介绍在魔搭社区如何部署LLM
|
数据可视化 测试技术 PyTorch
智谱ChatGLM3魔搭最佳实践教程来了!
ChatGLM3-6B 是 ChatGLM 系列最新一代的开源模型,在保留了前两代模型对话流畅、部署门槛低等众多优秀特性的基础上
|
7月前
|
自然语言处理 物联网 Swift
零一万物开源Yi-VL多模态大模型,魔搭社区推理&微调最佳实践来啦!
近期,零一万物Yi系列模型家族发布了其多模态大模型系列,Yi Vision Language(Yi-VL)多模态语言大模型正式面向全球开源。
|
2月前
|
计算机视觉
Deepseek开源多模态LLM模型框架Janus,魔搭社区最佳实践
deepseek近期推出了简单、统一且灵活的多模态框架Janus,它能够统一处理多模态理解和生成任务。让我们一起来了解一下吧。
|
7月前
|
安全 测试技术 Swift
Llama 3开源,魔搭社区手把手带你推理,部署,微调和评估
Meta发布了 Meta Llama 3系列,是LLama系列开源大型语言模型的下一代。在接下来的几个月,Meta预计将推出新功能、更长的上下文窗口、额外的模型大小和增强的性能,并会分享 Llama 3 研究论文。
Llama 3开源,魔搭社区手把手带你推理,部署,微调和评估
|
5月前
|
数据可视化 物联网 Swift
谷歌开源Gemma2!魔搭社区推理、微调最佳实践教程
Google面向全球研究人员和开发者发布并开源 Gemma 2 大语言模型!本次Gemma 2 系列为轻量级开放模型,提供9B和27B参数两种尺寸,采用全新的架构设计,性能表现优异。

热门文章

最新文章