🙋魔搭ModelScope本期社区进展:
📟1340个模型:通义千问 QwQ-32B、HunyuanVideo-I2V、CogView4-6B、Phi-4系列模型等;
📁220个数据集:Big-Math-RL-UNVERIFIED、msmarco-msmarco-MiniLM-L6-v3、KodCode-V1等;
🎨91个创新应用:CogView4、QwQ-32B-Demo、推理模型大作战(QwQ-32B vs DeepSeek-R1)等;
📄 8篇内容:
- 腾讯开源HunyuanVideo-I2V图生视频模型+LoRA训练脚本,社区部署、推理实战教程来啦!
- QwQ-32B开源!更小尺寸,仅1/20参数性能比肩满血R1
- 微软Phi-4系列开源:多模态与文本处理的创新突破
- 打造跨语言智能工具与应用,“万卷·丝路”专项课题开放申请
- CogView4开源发布!智谱AI文生图模型支持任意长度双语输入,汉字生成能力突出,可商用!
- CLIPer:开创性框架提升CLIP空间表征,实现开放词汇语义分割突破
- 高效部署通义万相Wan2.1:ComfyUI文生/图生视频实战,工作流直取!
- 高效部署通义万相Wan2.1:使用Gradio搭建WebUI体验实战
01.精选模型
通义千问 QwQ-32B
QwQ-32B是通义千问团队最新开源的推理模型,以320亿参数实现媲美更大模型的性能,其亮点包括:
推理能力突出 :数学测试(AIME24)与DeepSeek-R1持平,编程(LiveCodeBench)接近其水平,通用能力在LiveBench等评测中更超越DeepSeek-R1;
分阶段强化学习 :初期通过数学答案校验与代码执行反馈优化,后期结合通用奖励模型训练,兼顾多领域性能;
轻量化部署 :支持单卡(如3090/M4 Max)运行,降低硬件门槛;
开源标杆价值 :以1/20的参数规模(对比DeepSeek-R1)验证强化学习的效率提升,为开发者提供高性价比选择。
模型合集链接:
https://www.modelscope.cn/collections/QwQ-32B-0f1806b8a8514a
示例代码:
from modelscope import AutoModelForCausalLM, AutoTokenizer model_name = "Qwen/QwQ-32B" model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype="auto", device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained(model_name) prompt = "How many r's are in the word \"strawberry\"" messages = [ {"role": "user", "content": prompt} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) model_inputs = tokenizer([text], return_tensors="pt").to(model.device) generated_ids = model.generate( **model_inputs, max_new_tokens=32768 ) generated_ids = [ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) ] response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] print(response)
关于模型更多部署、推理、微调实战教程详见:
QwQ-32B开源!更小尺寸,仅1/20参数性能比肩满血R1
HunyuanVideo-I2V图生视频模型
腾讯混元重磅发布并开源HunyuanVideo-I2V图生视频模型。该模型基于HunyuanVideo文生视频基础模型,利用基础模型先进的视频生成能力,将应用扩展到图像到视频的生成任务,还同步开源了LoRA训练代码,用于定制化特效生成,可创建更有趣的视频效果。
研究团队采用图像潜在连接技术,重建参考图像信息并纳入视频生成过程,同时使用预训练的Decoder-Only架构多模态大语言模型作为文本编码器,增强模型对输入图像语义内容的理解,实现图像与文本描述信息的深度融合。
模型地址:
https://modelscope.cn/models/AI-ModelScope/HunyuanVideo-i2v
示例代码:
提供本地推理运行HunyuanVideo-I2V方案,硬件要求如下
模型 |
分辨率 |
GPU显存峰值 |
HunyuanVideo-I2V |
720p |
60GB |
- 需配备支持CUDA的NVIDIA GPU
- 测试环境为单卡80G GPU
- 最低要求: 720p分辨率需至少60GB显存
- 推荐配置: 建议使用80GB显存GPU以获得更佳生成质量
- 测试操作系统:Linux
克隆代码
git clone https://github.com/tencent/HunyuanVideo-I2V cd HunyuanVideo-I2V
配置环境
pip install -r requirements.txt pip install ninja pip install git+https://github.com/Dao-AILab/flash-attention.git@v2.6.3
混元图生视频包括三个模型,基础模型hunyuan-video-i2v-720p和两个文本编码器(text_encoder_i2v,text_encoder_2)。模型下载后默认放在HunyuanVideo-I2V/ckpts文件夹下,文件结构:
HunyuanVideo-I2V ├──ckpts │ ├──README.md │ ├──hunyuan-video-i2v-720p │ │ ├──transformers │ │ │ ├──mp_rank_00_model_states.pt ├ │ ├──vae ├ │ ├──lora │ │ │ ├──embrace_kohaya_weights.safetensors │ │ │ ├──hair_growth_kohaya_weights.safetensors │ ├──text_encoder_i2v │ ├──text_encoder_2 ├──...
魔搭上可以下载到这三个模型,下载命令如下:
cd HunyuanVideo-I2V # 下载基础模型 modelscope download --model AI-ModelScope/HunyuanVideo-I2V --local_dir ./ckpts # 下载文本编码器MLLM modelscope download --model AI-ModelScope/llava-llama-3-8b-v1_1-transformers --local_dir ./ckpts/text_encoder_i2v # 下载文本编码器CLIP modelscope download --model AI-ModelScope/clip-vit-large-patch14 --local_dir ./ckpts/text_encoder_2
推理代码
cd HunyuanVideo-I2V python3 sample_image2video.py \ --model HYVideo-T/2 \ --prompt "A man with short gray hair plays a red electric guitar." \ --i2v-mode \ --i2v-image-path ./assets/demo/i2v/imgs/0.png \ --i2v-resolution 720p \ --video-length 129 \ --infer-steps 50 \ --flow-reverse \ --flow-shift 17.0 \ --seed 0 \ --use-cpu-offload \ --save-path ./results
耗时:50步,生成1280*704分辨率5秒的视频,A100,大概需要50分钟
显存占用:约60G
关于模型更多部署、推理实战教程详见:
腾讯开源HunyuanVideo-I2V图生视频模型+LoRA训练脚本,社区部署、推理实战教程来啦!
CogView4-6B
智谱AI正式发布并开源了最新的图像生成模型CogView4,该模型具备复杂语义对齐和指令跟随能力,支持任意长度中英双语输入,可生成任意分辨率图像,文字生成能力出色,尤其在汉字生成方面表现突出。
在DPG-Bench基准测试中综合评分排名第一,达到开源文生图模型的SOTA水平。模型采用二维旋转位置编码、Flow-matching扩散生成建模、多阶段训练策略等技术,突破了传统模型在文本长度和图像分辨率上的限制,适合国内广告、短视频等领域的创意需求,且也是首个遵循 Apache 2.0协议开源的图像生成模型。
模型链接:
https://modelscope.cn/models/ZhipuAI/CogView4-6B
示例代码:
安装依赖
pip install git+https://github.com/huggingface/diffusers.git
推理代码
from diffusers import CogView4Pipeline from modelscope import snapshot_download import torch model_dir = snapshot_download("ZhipuAI/CogView4-6B") pipe = CogView4Pipeline.from_pretrained(model_dir, torch_dtype=torch.bfloat16) # Open it for reduce GPU memory usage pipe.enable_model_cpu_offload() pipe.vae.enable_slicing() pipe.vae.enable_tiling() prompt = "A vibrant cherry red sports car sits proudly under the gleaming sun, its polished exterior smooth and flawless, casting a mirror-like reflection. The car features a low, aerodynamic body, angular headlights that gaze forward like predatory eyes, and a set of black, high-gloss racing rims that contrast starkly with the red. A subtle hint of chrome embellishes the grille and exhaust, while the tinted windows suggest a luxurious and private interior. The scene conveys a sense of speed and elegance, the car appearing as if it's about to burst into a sprint along a coastal road, with the ocean's azure waves crashing in the background." image = pipe( prompt=prompt, guidance_scale=3.5, num_images_per_prompt=1, num_inference_steps=50, width=1024, height=1024, ).images[0] image.save("cogview4.png")
Phi-4系列
微软最新开源的Phi-4系列模型,包括Phi-4-multimodal和Phi-4-mini。Phi-4-multimodal是一个5.6B参数的多模态语言模型,能同时处理语音、视觉和文本,为创建具有上下文感知能力的应用程序提供新可能;Phi-4-mini则是一个3.8B参数的紧凑模型,专为提高速度和效率设计,在基于文本的任务中表现出色。
模型合集链接:
https://www.modelscope.cn/collections/phi-4-4ce2630c1b664f
示例代码:
Phi-4-mini-instruct推理代码:
from vllm import LLM, SamplingParams llm = LLM(model="LLM-Research/Phi-4-mini-instruct", trust_remote_code=True) messages = [ {"role": "system", "content": "You are a helpful AI assistant."}, {"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"}, {"role": "assistant", "content": "Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits together with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits together with some lemon juice and honey."}, {"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"}, ] sampling_params = SamplingParams( max_tokens=500, temperature=0.0, ) output = llm.chat(messages=messages, sampling_params=sampling_params) print(output[0].outputs[0].text)
Phi-4-multimodal-instruct推理代码:
import requests import torch import os import io from PIL import Image import soundfile as sf from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig from urllib.request import urlopen from modelscope import snapshot_download # Define model path model_path = snapshot_download("LLM-Research/Phi-4-multimodal-instruct") # Load model and processor processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_path, device_map="cuda", torch_dtype="auto", trust_remote_code=True, #attn_implementation='flash_attention_2', ).cuda() # Load generation config generation_config = GenerationConfig.from_pretrained(model_path) # Define prompt structure user_prompt = '<|user|>' assistant_prompt = '<|assistant|>' prompt_suffix = '<|end|>' # Part 1: Image Processing print("\n--- IMAGE PROCESSING ---") image_url = 'https://www.ilankelman.org/stopsigns/australia.jpg' prompt = f'{user_prompt}<|image_1|>What is shown in this image?{prompt_suffix}{assistant_prompt}' print(f'>>> Prompt\n{prompt}') # Download and open image image = Image.open(requests.get(image_url, stream=True).raw) inputs = processor(text=prompt, images=image, return_tensors='pt').to('cuda:0') # Generate response generate_ids = model.generate( **inputs, max_new_tokens=1000, generation_config=generation_config, ) generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:] response = processor.batch_decode( generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False )[0] print(f'>>> Response\n{response}') # Part 2: Audio Processing print("\n--- AUDIO PROCESSING ---") audio_url = "https://upload.wikimedia.org/wikipedia/commons/b/b0/Barbara_Sahakian_BBC_Radio4_The_Life_Scientific_29_May_2012_b01j5j24.flac" speech_prompt = "Transcribe the audio to text, and then translate the audio to French. Use <sep> as a separator between the original transcript and the translation." prompt = f'{user_prompt}<|audio_1|>{speech_prompt}{prompt_suffix}{assistant_prompt}' print(f'>>> Prompt\n{prompt}') # Downlowd and open audio file audio, samplerate = sf.read(io.BytesIO(urlopen(audio_url).read())) # Process with the model inputs = processor(text=prompt, audios=[(audio, samplerate)], return_tensors='pt').to('cuda:0') generate_ids = model.generate( **inputs, max_new_tokens=1000, generation_config=generation_config, ) generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:] response = processor.batch_decode( generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False )[0] print(f'>>> Response\n{response}')
更多Phi-4系列模型微调实战教程,详见:
02.数据集推荐
Big-Math-RL-UNVERIFIED
Big-Math-RL-UNVERIFIED 是一个专注于数学问题的高质量数据集,包含超过25万道数学题目及其可验证的解题过程,专为强化学习(Reinforcement Learning, RL)在语言模型中的应用而设计。
数据集链接:
https://www.modelscope.cn/datasets/SynthLabsAI/Big-Math-RL-UNVERIFIED
msmarco-msmarco-MiniLM-L6-v3
msmarco-MiniLM-L6-v3是一个基于 Sentence-Transformers 的预训练语言模型,专为文本嵌入和语义相似性任务设计。它在 MS MARCO数据集上进行了微调,适用于信息检索、文本匹配和语义搜索等任务。
数据集链接:
https://www.modelscope.cn/models/sentence-transformers/msmarco-MiniLM-L6-v3
KodCode-V1
KodCode 是最大的全合成开源数据集,为编码任务提供可验证的解决方案和测试。它包含 12 个不同的子集,涵盖各个领域(从算法到包特定的知识)和难度级别(从基本编码练习到面试和竞争性编程挑战)。KodCode 专为监督式微调 (SFT) 和 RL 优化而设计。
数据集链接:
https://www.modelscope.cn/datasets/AI-ModelScope/KodCode-V1
03.精选应用
CogView4
体验直达:
https://www.modelscope.cn/studios/ZhipuAI/CogView4
QwQ-32B-Demo
体验直达:
https://modelscope.cn/studios/Qwen/QwQ-32B-Demo
推理模型大作战(QwQ-32B vs DeepSeek-R1)
体验直达:
https://www.modelscope.cn/studios/AI-ModelScope/QwQ-32B-vs-DeepSeek-R1