社区供稿 | 达摩院多模态对话大模型猫头鹰mPLUG-Owl大升级，登顶MMBench

2023-08-22 631

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

本文涉及的产品

交互式建模 PAI-DSW，每月250计算时 3个月

模型训练 PAI-DLC，5000CU*H 3个月

模型在线服务 PAI-EAS，A10/V100等 500元 1个月

简介： 近日，在上海人工智能实验室发布的多模态大模型榜单MMBench中，来自达摩院的mPLUG-Owl 超过MiniGPT4，LLaVA，VisualGLM等14个多模态大模型，登顶榜首。目前，mPLUG-Owl最新的预训练，SFT模型都已在ModelScope开源，欢迎大家体验。

近日，在上海人工智能实验室发布的多模态大模型榜单MMBench中，来自达摩院的mPLUG-Owl 超过MiniGPT4，LLaVA，VisualGLM等14个多模态大模型，登顶榜首。目前，mPLUG-Owl最新的预训练，SFT模型都已在ModelScope开源，欢迎大家体验。

据悉，达摩院模块化多模态对话大模型mPLUG-Owl进行了大升级，通过加入细粒度预训练，多样性的指令微调提升模型多维度综合能力。除本次在MMBench登顶的优异成绩外，多模态模块化基础模型mPLUG-2也获得CVPR2023 STAR Challenge Best Performance Award，论文已被ICML2023接收。

模型架构与训练

mPLUG-Owl采用两阶段训练策略和混合指令增强提升模型多维度综合能力：

一阶段：训练视觉相关的模块（Visual Encoder和Visual Abstractor），使得模型能够关联视觉知识和文本知识的能力；

二阶段：将对齐好的视觉模块的文本模型进行冻结，利用非常少量的参数(~4M)对语言模型进行微调；

混合指令增强：基于文本和多模态数据联合训练，使得其能够在保证文本能力的基础上理解多模态的指令。

ModelScope实战

快速上手mPLUG系列模型，仅需在ModelScope中搜索mPLUG，即可出现mPLUG相关模型

接下来，我们以图像描述和多模态对话为例，用户需安装最新版的ModelScope。

图像/视频描述

对于图像描述或者视频描述，用户仅需提供一张图像或者视频的链接，并使用以下代码：

# 图像描述
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
model_id = 'damo/mplug_image-captioning_coco_base_en'
input_caption = 'https://alice-open.oss-cn-zhangjiakou.aliyuncs.com/mPLUG/image_captioning.png'
pipeline_caption = pipeline(Tasks.image_captioning, model=model_id)
result = pipeline_caption(input_caption)
print(result)
# >>> {'caption': 'the man is angry'}
# 视频描述
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
model_id = 'damo/multi-modal_hitea_video-captioning_base_en'
input_caption = 'http://xke-repo.oss-cn-hangzhou.aliyuncs.com/models/release/vid_hitea_videocap.avi'
pipeline_caption = pipeline(Tasks.video_captioning, model=model_id)
result = pipeline_caption(input_caption)
print(result)
# >>> {'caption': 'potato is being peeled'}

多模态对话

对于多模态对话，用户需要提供想要问的问题或者图片，即可获得回复：

from modelscope.pipelines import pipeline
chatbot = pipeline('multimodal-dialogue', 'damo/multi-modal_mplug_owl_multimodal-dialogue_7b')
image = 'http://mm-chatgpt.oss-cn-zhangjiakou.aliyuncs.com/mplug_owl_demo/released_checkpoint/portrait_input.png'
system_prompt_1 = 'The following is a conversation between a curious human and AI assistant.'
system_prompt_2 = "The assistant gives helpful, detailed, and polite answers to the user's questions."
messages = {
    'messages': [
        {'role': 'system', 'content': system_prompt_1 + ' ' + system_prompt_2},
        {'role': 'user', 'content': [{'image': image}]},
        {'role': 'user', 'content': 'Describe the mood of the man.'},
    ]
}
print(chatbot(messages))
# >>> {'text': 'The man is angry and frustrated, as he is clenching his fists and scowling.'}