魔搭社区每周速递（1.5-1.18）-阿里云开发者社区

🙋魔搭ModelScope本期社区进展：

📟3239个模型：MiniCPM-o 2.6、internlm3-8b-instruct、Valley-Eagle-7B、phi-4、麦橘超然、phi-4、memo、Qwen2.5-Math-PRM等；

📁711个数据集：squad、msmarco-distilbert-margin-mse-cls-dot-v2、coliee等；

🎨192个创新应用：AI 春节贺卡生成器、动态交互式文本冒险游戏DEMO、VITA1.5_demo、WebWalker、ACE++编辑生成模型、千问翻译大模型等；

📄 16篇内容：

通义千问团队开源全新的过程奖励模型PRM！
ModelScope魔搭25年1月版本发布月报
过年了，用魔搭+魔笔打造您的 AI 春节贺卡生成器！
MiniCPM-o 2.6：流式全模态，端到端，多模态端侧大模型来了！
基于Gradio的AI应用搭建实践课③：AI模型部署与推理：应用功能可无限拓展
InternLM3开源发布！4T数据达到18T效果，成本省75%，首度融合深度思考与对话能力！
Valley2，基于电商场景的多模态大模型
微软phi-4来啦！小模型之光，14B科学、代码等能力超70B模型效果！
基于Gradio的AI应用搭建实践课②：Gradio基础学习，应用UI界面可无限DIY
共学 | 2025年，更加有效地搭建Agent
Paper Reading | MEMO：记忆引导扩散模型实现生动的Talking Head生成
DashInfer-VLM，多模态SOTA推理性能，超vLLM！
10分钟掌握微调大模型改变自我认知，定制专属自己的聊天机器人
麦橘超然上线魔搭社区，免费生图和训练，文末返图有奖
使用 modelscope-studio 构建你的 Gradio 应用
TransferTOD：利用LLM解决TOD系统在域外场景槽位难以泛化的问题

01.精选模型

MiniCPM-o 2.6

MiniCPM-o 2.6 是 MiniCPM-o 系列的最新、性能最佳模型。该模型基于 SigLip-400M、Whisper-medium-300M、ChatTTS-200M 和 Qwen2.5-7B 构建，共 8B 参数，通过端到端方式训练和推理。相比 MiniCPM-V 2.6，该模型在性能上有了显著提升，并支持了实时语音对话和多模态流式交互的新功能。

模型链接：

https://modelscope.cn/models/OpenBMB/MiniCPM-o-2_6

示例代码：

安装依赖

!pip install vector-quantize-pytorch==1.18.5 
!pip install vocos==0.1.0
!pip install transformers==4.44.2

推理代码

import torch
from PIL import Image
from modelscope import AutoModel, AutoTokenizer
# load omni model default, the default init_vision/init_audio/init_tts is True
# if load vision-only model, please set init_audio=False and init_tts=False
# if load audio-only model, please set init_vision=False
model = AutoModel.from_pretrained(
    'openbmb/MiniCPM-o-2_6',
    trust_remote_code=True,
    attn_implementation='sdpa', # sdpa or flash_attention_2
    torch_dtype=torch.bfloat16,
    init_vision=True,
    init_audio=True,
    init_tts=True
)
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('OpenBMB/MiniCPM-o-2_6', trust_remote_code=True)
# In addition to vision-only mode, tts processor and vocos also needs to be initialized
model.init_tts()
model.tts.float()
import math
import numpy as np
from PIL import Image
from moviepy.editor import VideoFileClip
import tempfile
import librosa
import soundfile as sf
def get_video_chunk_content(video_path, flatten=True):
    video = VideoFileClip(video_path)
    print('video_duration:', video.duration)
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=True) as temp_audio_file:
        temp_audio_file_path = temp_audio_file.name
        video.audio.write_audiofile(temp_audio_file_path, codec="pcm_s16le", fps=16000)
        audio_np, sr = librosa.load(temp_audio_file_path, sr=16000, mono=True)
    num_units = math.ceil(video.duration)
    # 1 frame + 1s audio chunk
    contents= []
    for i in range(num_units):
        frame = video.get_frame(i+1)
        image = Image.fromarray((frame).astype(np.uint8))
        audio = audio_np[sr*i:sr*(i+1)]
        if flatten:
            contents.extend(["<unit>", image, audio])
        else:
            contents.append(["<unit>", image, audio])
    return contents
video_path="/mnt/workspace/video.mp4"
sys_msg = model.get_sys_prompt(mode='omni', language='en')
# if use voice clone prompt, please set ref_audio
# ref_audio_path = '/path/to/ref_audio'
# ref_audio, _ = librosa.load(ref_audio_path, sr=16000, mono=True)
# sys_msg = model.get_sys_prompt(ref_audio=ref_audio, mode='omni', language='en')
contents = get_video_chunk_content(video_path)
msg = {"role":"user", "content": contents}
msgs = [sys_msg, msg]
# please set generate_audio=True and output_audio_path to save the tts result
generate_audio = True
output_audio_path = '/mnt/workspace/4.wav'
res = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    temperature=0.5,
    max_new_tokens=4096,
    omni_input=True, # please set omni_input=True when omni inference
    use_tts_template=True,
    generate_audio=generate_audio,
    output_audio_path=output_audio_path,
    max_slice_nums=1,
    use_image_id=False,
    return_dict=True
)
print(res)

internlm3-8b-instruct

InternLM3 是上海人工智能实验室对书生大模型的重要升级版本，通过精炼数据框架大幅提升了数据效率与思维密度。仅需4T训练数据的InternLM3-8B-Instruct，其综合性能超越同量级开源模型，达到主流模型18T训练效果，节省75%以上的训练成本。该模型首次在通用模型中实现了常规对话与深度思考能力的融合，极大扩展了真实应用场景的应对能力。

InternLM3采用“通专融合”路径，结合大规模数据精炼框架，提高了训练数据质量，引入“思维密度”概念以提升模型性能，并为Scaling Law研究提供新范式。它还构建了合成数据探索方案，基于世界知识树进行指令标注和多智能体生成高质量回复，创建了数十万条微调指令数据集，优化了对话体验。评测显示，InternLM3在多个权威评测集中表现优异，接近GPT-4o-mini的综合性能。

模型链接：

https://www.modelscope.cn/models/Shanghai_AI_Laboratory/internlm3-8b-instruct

示例代码：

使用transformers推理模型：

import torch
from modelscope import AutoTokenizer, AutoModelForCausalLM
model_dir = "Shanghai_AI_Laboratory/internlm3-8b-instruct"
#model = AutoModelForCausalLM(model_dir, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
# Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and might cause OOM Error.
model = AutoModelForCausalLM.from_pretrained(model_dir, trust_remote_code=True, torch_dtype=torch.float16)
# (Optional) If on low resource devices, you can load model in 4-bit or 8-bit to further save GPU memory via bitsandbytes.
  # InternLM3 8B in 4bit will cost nearly 8GB GPU memory.
  # pip install -U bitsandbytes
  # 8-bit: model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, load_in_8bit=True)
  # 4-bit: model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, load_in_4bit=True)
model = model.eval()
model = model.cuda()
system_prompt = """You are an AI assistant whose name is InternLM (书生·浦语).
- InternLM (书生·浦语) is a conversational language model that is developed by Shanghai AI Laboratory (上海人工智能实验室). It is designed to be helpful, honest, and harmless.
- InternLM (书生·浦语) can understand and communicate fluently in the language chosen by the user such as English and 中文."""
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "Please tell me five scenic spots in Shanghai"},
 ]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").cuda()
generated_ids = model.generate(tokenized_chat, max_new_tokens=1024, temperature=1, repetition_penalty=1.005, top_k=40, top_p=0.8)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(tokenized_chat, generated_ids)
].cuda()
response = tokenizer.batch_decode(generated_ids)[0]
print(response)

麦橘超然

麦橘超然是麦橘制作的基于Flux.1的模型，可以生成高度摄影写实和富有光影感的图片，尤其擅长表现人物的脸部和肌肤细节。麦橘之前的作品麦橘写实是各大文生图开源站点最受欢迎的模型之一。

麦橘超然模型融合了多种模型架构，生成逼真人物摄影风格，能精细呈现头发、眼睛、雀斑等细节；光影处理出色，还原明暗对比，增强立体感与氛围，适合暗部和阴影场景。此外，与社区30多位创作者合作，发布超50个基于该模型训练的lora。

模型链接：

https://modelscope.cn/models/MAILAND/majicflus_v1

模型玩法：

魔搭为大家准备了Comfyui一键工具包，配置社区的麦橘超然工作流+魔搭社区免费notebook算力，享受独占式生图自由

Comfyui一键工具包链接： https://modelscope.cn/models/AI-ModelScope/ComfyUI-MajicFlus
麦橘超然工作流链接：

https://modelscope.oss-cn-beijing.aliyuncs.com/resource/majicflus.json

!wget "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/cloudflared-linux-amd64.deb"
!dpkg -i cloudflared-linux-amd64.deb
!git clone https://www.modelscope.cn/AI-ModelScope/ComfyUI-MajicFlus.git
%cd /mnt/workspace/ComfyUI-MajicFlus
import subprocess
import threading
import time
import socket
import urllib.request
def iframe_thread(port):
  while True:
      time.sleep(0.5)
      sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
      result = sock.connect_ex(('127.0.0.1', port))
      if result == 0:
        break
      sock.close()
  print("\nComfyUI finished loading, trying to launch cloudflared (if it gets stuck here cloudflared is having issues)\n")
  p = subprocess.Popen(["cloudflared", "tunnel", "--url", "http://127.0.0.1:{}".format(port)], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
  for line in p.stderr:
    l = line.decode()
    if "trycloudflare.com " in l:
      print("This is the URL to access ComfyUI:", l[l.find("http"):], end='')
    #print(l, end='')
threading.Thread(target=iframe_thread, daemon=True, args=(8188,)).start()
!python main.py --dont-print-server

phi-4

Phi-4是LLM-Research团队开发的模型，适用于语言理解、生成、多语言支持和知识推理等任务.

模型链接：

https://modelscope.cn/models/LLM-Research/phi-4

用法：

输入格式

考虑到训练数据的性质，phi-4最适合使用以下聊天格式的提示：

<|im_start|>system<|im_sep|>
You are a medieval knight and must provide explanations to modern people.<|im_end|>
<|im_start|>user<|im_sep|>
How should I explain the Internet?<|im_end|>
<|im_start|>assistant<|im_sep|>

和transformers

import transformers
from modelscope import snapshot_download
model_dir = snapshot_download("LLM-Research/phi-4")
pipeline = transformers.pipeline(
    "text-generation",
    model=model_dir,
    model_kwargs={"torch_dtype": "auto"},
    device_map="auto",
)
messages = [
    {"role": "system", "content": "You are a medieval knight and must provide explanations to modern people."},
    {"role": "user", "content": "How should I explain the Internet?"},
]
outputs = pipeline(messages, max_new_tokens=128)
print(outputs[0]["generated_text"][-1])

memo

MEMO是由Skywork AI、南洋理工大学、新加坡国立大学研究团队发布的视频生成模型，能通过一张图片和音频生成逼真、表情自然流畅的人像视频，同步音频与口型，效果栩栩如生.

模型链接：

https://www.modelscope.cn/models/ltzheng/memo

安装

conda create -n memo python=3.10 -y
conda activate memo
conda install -c conda-forge ffmpeg -y
pip install -e .

推理

python inference.py --config configs/inference.yaml --input_image <IMAGE_PATH> --input_audio <AUDIO_PATH> --output_dir <SAVE_PATH>

例如：

python inference.py --config configs/inference.yaml --input_image assets/examples/dicaprio.jpg --input_audio assets/examples/speech.wav --output_dir outputs

Valley2

Valley2 是一种新颖的多模态大型语言模型，旨在通过可扩展的视觉-语言设计增强各个领域的性能，并拓展电子商务和短视频场景的实际应用边界。Valley2 在电子商务和短视频领域中实现了最先进的性能。它引入了如大视觉词汇、卷积适配器（ConvAdapter）和Eagle模块等创新，提高了处理多样化真实世界输入的灵活性，同时增强了训练和推理效率。Valley2 采用了Qwen2.5作为其LLM主干，SigLIP-384作为视觉编码器，并结合MLP层和卷积进行高效的特征转换。

模型链接：

https://www.modelscope.cn/models/bytedance-research/Valley-Eagle-7B

示例代码：

模型推理

from valley_eagle_chat import ValleyEagleChat
from modelscope import snapshot_download
import urllib.request
# 需要把模型文件中的config.json的eagle_vision_tower和mm_vision_tower改为本地路径
model_dir = snapshot_download("bytedance-research/Valley-Eagle-7B")
!modelscope download --model=Qwen/Qwen2-VL-7B-Instruct --local_dir=./Qwen2-VL-7B-Instruct
!modelscope download --model=AI-ModelScope/siglip-so400m-patch14-384 --local_dir=./siglip-so400m-patch14-384
model = ValleyEagleChat(
    model_path=model_dir,
    padding_side = 'left',
)
url = 'http://p16-goveng-va.ibyteimg.com/tos-maliva-i-wtmo38ne4c-us/4870400481414052507~tplv-wtmo38ne4c-jpeg.jpeg'
img = urllib.request.urlopen(url=url, timeout=5).read()
request = {
    "chat_history": [
        {'role': 'system', 'content': 'You are Valley, developed by ByteDance. Your are a helpfull Assistant.'},
        {'role': 'user', 'content': 'Describe the given image.'},
    ],
    "images": [img],
}
result = model(request)
print(f"\n>>> Assistant:\n")
print(result)
from valley_eagle_chat import ValleyEagleChat
import decord
import requests
import numpy as np
from torchvision import transforms
model = ValleyEagleChat(
    model_path=model_dir,
    padding_side = 'left',
)
url = 'https://videos.pexels.com/video-files/29641276/12753127_1920_1080_25fps.mp4'
video_file = './video.mp4'
response = requests.get(url)
if response.status_code == 200:
    with open("video.mp4", "wb") as f:
        f.write(response.content)
else:
    print("download error!")
    exit(1)
video_reader = decord.VideoReader(video_file)
decord.bridge.set_bridge("torch")
video = video_reader.get_batch(
    np.linspace(0,  len(video_reader) - 1, 8).astype(np.int_)
).byte()
print([transforms.ToPILImage()(image.permute(2, 0, 1)).convert("RGB") for image in video])
request = {
    "chat_history": [
        {'role': 'system', 'content': 'You are Valley, developed by ByteDance. Your are a helpfull Assistant.'},
        {'role': 'user', 'content': 'Describe the given video.'},
    ],
    "images": [transforms.ToPILImage()(image.permute(2, 0, 1)).convert("RGB") for image in video],
}
result = model(request)
print(f"\n>>> Assistant:\n")
print(result)