CodeFuse-MFTCoder 项目地址:
https://github.com/codefuse-ai/MFTCoder
CodeFuse-Qwen-14B 模型地址:
https://huggingface.co/codefuse-ai/CodeFuse-QWen-14B
https://modelscope.cn/models/codefuse-ai/CodeFuse-QWen-14B/summary
一、Qwen-14B底座代码能力总览
Qwen(通义千问)是阿里云开源的大型语言模型集合,目前有两个参数规模的模型:Qwen-7B和Qwen-14B。
Qwen在LLama框架的基础上增加了以下特性:
- 词表:在原有的110k个词基础上,加入了中文字符和单词,并将数字按照0-9分别添加到词表中,最终词表大小为152K。
- 输入嵌入:没有绑定输入嵌入和输出投影的权重。
- 位置嵌入:使用RoPE进行位置嵌入,并选择了FP32精度,而不是BF16或FP16,以获得更好的表现。
- 偏置:除了注意力层的QKV层外,其他层中的偏置被去除。
- RMSNorm:使用RMSNorm替代常用的LayerNorm,因为RMSNorm更高效。
- 激活函数:选择了SwiGLU作为激活函数。
- FFN层:将常用的4倍隐藏维度换成了8/3倍隐藏维度。
- 扩展长度:通过NTK-aware插值和LogN-Scaling将令牌长度扩展到8092,同时保证了对高频信息和长程依赖的建模。
Qwen官方透出的评测,在各项能力上都超过了同等大小的开源大语言模型,包括LLaMA,LLaMA2,ChatGLM2,Baichuan2,InternLM等。详细的对比可以参考下图,该图中绿色绿色实线表示是其他开源模型在该项能力上取得最高分数。
图1:Qwen能力分布雷达图以及和其他模型的对比(截取至Qwen的技术报告)
同时,在代码能力上,我们结合官方的评测在python、java、javascript、c++、go等几种语言上,和其他模型进行了对比。
表1:Qwen-14B 和 其他模型代码补全能力对比。
模型 |
Python |
Java |
Java Script |
C++ |
Go |
均值 |
Qwen-14B |
32.93% |
35.37% |
32.93% |
30.49% |
21.34% |
30.61% |
Baichuan2-13B |
17.1% |
20.73% |
5.49% |
16.45% |
6.71% |
13.30% |
CodeGeeX2-6B |
35.90% |
30.80% |
32.20% |
29.30% |
22.50% |
30.14% |
StarCoder-15B |
33.57% |
30.22% |
30.79% |
31.55% |
17.62% |
28.75% |
CodeLLama-13B |
43.29% |
41.46% |
38.41% |
34.76% |
29.27% |
37.44% |
从上面,我们可以看出Qwen-14B的代码能力远高于同尺寸的自然语言大模型Baichuan2-13B,甚至可以和专门的代码大模型(比如CodeGeeX2、StarCoder)媲美。但CodeLLama的在代码方面的表现还是优于Qwen,可能是由于CodeLLama在LLama2的基础上进行了代码的加训。另外,我们注意到,Qwen也有一个专用于代码任务的版本叫做Code Qwen。和CodeLLama一样,Code Qwen也用了大量代码数据对Qwen进行了加训。Code Qwen暂时没有开源,我们期待看到Code-Qwen在代码上更优越的表现。
二、CodeFuse-MFTCoder 多任务微调Qwen-14B
借助CodeFuse-MFTCoder的多任务微调能力,我们可以使用多个代码任务数据集对Qwen-14B进行多任务微调(MFT)。在任务选择上,我们精选了6个核心代码任务数据,即代码补全(Code Completion),代码生成(Text2Code), 代码翻译(Code Translation),单测生成(Unittest Generation),SQL生成(Text2Sql), 报错修复(Exception Retry),一共110w条指令问答数据。
训练采用MFTCoder的多任务QLoRA(4bit量化)微调模式, 相应的配置如下:
{ "lora_rank": 96, "lora_alpha": 32, "lora_dropout": 0.05, "targeting_modules": ["c_proj", "c_attn", "w1", "w2"] }
对以上数据进行了约4个Epoch的训练到收敛。训练过程loss情况如下图所示。
通过多任务微调,CodeFuse-Qwen-14B的各方面代码能力均有比较大的提升。
三、CodeFuse-Qwen-14B模型效果
接下来看一些Qwen-14B-MFT微调后的模型的生成效果。在代码补全/代码翻译/代码修复/SQL生成/单元测试生成等代码任务上有较好表现,更丰富的生成样例,请看下面详细展示。
代码生成(Code Completion / Text2Code)
MFT后的Qwen-14B有更好的代码生成能力。在五种编程语言的代码补全测试集HumanEval-x上进行了相关评测(见表2),测试结果显示与Baichun2-13B-Base、Qwen-14B-Base、CodeGeex2-6B、StarCoder-15B等模型相比,微调后的Qwen-14B-MFT在Java/Python/Cpp/JavaScript均是Top1,相对于底座平均提升10%+。和剩余的模型里面表现最好的CodeLLama,其中JavaScript语言提升效果最明显(38.41%->46.34%),pass@1指标均值提升2.92%(37.44%-40.36%):
表2:CodeFuse-Qwen-14B和其他模型代码补全能力对比
模型 |
Python |
Java |
JavaScript |
C++ |
Go |
均值 |
CodeFuse-Qwen-14B |
48.78% |
41.46% |
46.34% |
38.41% |
26.83% |
40.36% |
Qwen-14B |
32.93% |
35.37% |
32.93% |
30.49% |
21.34% |
30.61% |
Baichuan2-13B |
17.1% |
20.73% |
5.49% |
16.45% |
6.71% |
13.30% |
CodeGeeX2-6B |
35.90% |
30.80% |
32.20% |
29.30% |
22.50% |
30.14% |
StarCoder-15B |
33.57% |
30.22% |
30.79% |
31.55% |
17.62% |
28.75% |
CodeLLama-13B |
43.29% |
41.46% |
38.41% |
34.76% |
29.27% |
37.44% |
与此同时,MFTcoder也让Qwen-14B在自然语言的任务上有了一些提升,可能是由于代码类任务对于提升自然语言的逻辑推理能力是有帮助的。
表3:CodeFuse-Qwen-14B 和 底座模型Qwen-14B 在自然语言能力上对比
模型/测试集 |
COPA: 因果推断 |
C3: 中文阅读理解 |
CLS: 中文科技文献 |
AFQMC: 金融句子相似度 |
Ax-b: 逻辑关系推理 |
RTE: 逻辑关系 |
CodeFuse-Qwen-14B |
95 |
91.1 |
68.5 |
71.99 |
67.84 |
76.53 |
Qwen-14B |
93 |
90.8 |
54.4 |
69 |
49.5 |
71.50 |
GPT-4 |
98 |
90.6 |
65 |
58.2 |
68.3 |
80.9 |
以下是多语言代码补全 & Text2Code任务上的具体样例:
代码补全(Code Completion )
Python |
Question: from typing import List def filter_by_substring(strings: List[str], substring: str) -> List[str]: """ Filter an input list of strings only for ones that contain given substring >>> filter_by_substring([], 'a') [] >>> filter_by_substring(['abc', 'bacd', 'cde', 'array'], 'a') ['abc', 'bacd', 'array'] """ Answer: return [s for s in strings if substring in s]
Question: from typing import List def below_zero(operations: List[int]) -> bool: """ You're given a list of deposit and withdrawal operations on a bank account that starts with zero balance. Your task is to detect if at any point the balance of account fallls below zero, and at that point function should return True. Otherwise it should return False. >>> below_zero([1, 2, 3]) False >>> below_zero([1, 2, -4, 5]) True """ Answer: balance = 0 for op in operations: balance += op if balance < 0: return True return False
|
C++ |
Question: /* You"re given a vector of deposit and withdrawal operations on a bank account that starts with zero balance. Your task is to detect if at any point the balance of account falls below zero, and at that point function should return true. Otherwise it should return false. >>> below_zero({1, 2, 3}) false >>> below_zero({1, 2, -4, 5}) true */ #include<stdio.h> #include<vector> using namespace std; bool below_zero(vector<int> operations){ Answer: int balance = 0; for(int i=0; i<operations.size(); i++){ balance += operations[i]; if(balance < 0){ return true; } } return false; }
Question: /* Filter an input vector of strings only for ones that contain given substring >>> filter_by_substring({}, "a") {} >>> filter_by_substring({"abc", "bacd", "cde", "vector"}, "a") {"abc", "bacd", "vector"} */ #include<stdio.h> #include<vector> #include<string> using namespace std; vector<string> filter_by_substring(vector<string> strings, string substring){ Answer: vector<string> filtered_strings; for(string s : strings){ if(s.find(substring) != string::npos){ filtered_strings.push_back(s); } } return filtered_strings; }
|
Java |
Question: import java.util.*; import java.lang.*; class Solution { /** You're given a list of deposit and withdrawal operations on a bank account that starts with zero balance. Your task is to detect if at any point the balance of account fallls below zero, and at that point function should return True. Otherwise it should return False. >>> belowZero(Arrays.asList(1, 2, 3)) false >>> belowZero(Arrays.asList(1, 2, -4, 5)) true */ public boolean belowZero(List<Integer> operations) { Answer: int balance = 0; for (int op : operations) { balance += op; if (balance < 0) { return true; } } return false; } }
Question: import java.util.*; import java.lang.*; class Solution { /** Filter an input list of strings only for ones that contain given substring >>> filterBySubstring(List.of(), "a") [] >>> filterBySubstring(Arrays.asList("abc", "bacd", "cde", "array"), "a") ["abc", "bacd", "array"] */ public List<String> filterBySubstring(List<String> strings, String substring) { Answer: List<String> filteredStrings = new ArrayList<>(); for (String str : strings) { if (str.contains(substring)) { filteredStrings.add(str); } } return filteredStrings; } }
|
JavaScript |
Question: /* You're given a list of deposit and withdrawal operations on a bank account that starts with zero balance. Your task is to detect if at any point the balance of account fallls below zero, and at that point function should return true. Otherwise it should return false. >>> belowZero([1, 2, 3]) false >>> belowZero([1, 2, -4, 5]) true */ const belowZero = (operations) => { Answer: let balance = 0; for (let i = 0; i < operations.length; i++) { balance += operations[i]; if (balance < 0) { return true; } } return false; };
Question: /* Filter an input list of strings only for ones that contain given substring >>> filterBySubstring([], 'a') [] >>> filterBySubstring(['abc', 'bacd', 'cde', 'array'], 'a') ['abc', 'bacd', 'array'] */ const filterBySubstring = (strings, substring) => { Answer: return strings.filter(string => string.includes(substring)); };
|
Go |
Question: // You're given a list of deposit and withdrawal operations on a bank account that starts with // zero balance. Your task is to detect if at any point the balance of account fallls below zero, and // at that point function should return true. Otherwise it should return false. // >>> BelowZero([1, 2, 3]) // false // >>> BelowZero([1, 2, -4, 5]) // true func BelowZero(operations []int) bool { Answer: balance := 0 for _, op := range operations { balance += op if balance < 0 { return true } } return false }
Question: import ( "fmt" ) // Input are two strings a and b consisting only of 1s and 0s. // Perform binary XOR on these inputs and return result also as a string. // >>> StringXor('010', '110') // '100' func StringXor(a string, b string) string { Answer: var result string for i := 0; i < len(a); i++ { if a[i] != b[i] { result += "1" } else { result += "0" } } return result
|
Text2Code 测试
摘录一些Python测试cases,模型不论在中文还是英文生成代码指令表现都比较优秀,摘录部分cases如下
Python-中文 |
Question: "写一个函数来查找一个给定的整数数组是否包含任何重复的元素" |
这是一个使用Python语言的函数,它使用了Python的内置数据结构集合(set)来查找重复元素.集合的特性是元素是唯一的,所以如果一个元素在集合中已经存在,它将被忽略,否则它将被添加到集合中.如果集合的长度大于0,则表示存在重复元素. def has_duplicate(nums): return len(set(nums)) != len(nums) 这个函数首先将整数数组转换为一个集合,然后比较集合的长度和原始数组的长度.如果它们不相等,则说明存在重复元素,函数返回True.否则,函数返回False. 例如,对于数组[1, 2, 3, 4, 5],函数将返回False,因为没有重复元素.对于数组[1, 2, 2, 3, 4, 5],函数将返回True,因为存在重复元素2. |
Python-英文 |
Question: "Write a python function to find the product of non-repeated elements in a given array." |
Here is a Python function that solves the problem: ```python def find_Product(arr, n): product = 1 for i in range(n): if arr[i] not in arr[:i] and arr[i] not in arr[i+1:]: product *= arr[i] return product ```
This function works by iterating over the array and checking if the current element is not repeated in the array. If it is not, it multiplies the current element to the product. The `not in` operator is used to check if an element is not in a list. The `arr[:i]` and `arr[i+1:]` are used to check if the element is not repeated before and after the current index respectively. The function returns the product of all non-repeated elements in the array. The time complexity of this function is O(n^2) because for each element in the array, we are checking if it is not repeated in the array. The space complexity is O(1) because we are not using any extra space that scales with the input size. |
单元测试生成(Unittest Generation)
模型可以根据prompt生成函数对应的测试单元,以下是生成单元测试的例子,这些单测是经过人工验证通过的cases:
python单元测试生成
Question: 为以下Python代码生成单元测试 def total_match(lst1, lst2): l1 = 0 for st in lst1: l1 += len(st) l2 = 0 for st in lst2: l2 += len(st) if l1 <= l2: return lst1 else: return lst2 |
Answer: import unittest class TestTotalMatch(unittest.TestCase): def test_equal_length(self): # 测试两个列表长度相等的情况 lst1 = ["abc", "def", "ghi"] lst2 = ["abc", "def", "jkl"] self.assertEqual(total_match(lst1, lst2), lst1) def test_different_length(self): # 测试两个列表长度不相等的情况 lst1 = ["abc", "def", "ghi"] lst2 = ["abc", "def"] self.assertEqual(total_match(lst1, lst2), lst2) def test_empty_list(self): # 测试两个列表都为空的情况 lst1 = [] lst2 = [] self.assertEqual(total_match(lst1, lst2), lst1) if __name__ == "__main__": unittest.main() |
Question: 为以下Python代码生成单元测试 def add_elements(arr, k): return sum(elem for elem in arr[:k] if len(str(elem)) <= 2) |
Answer: import unittest class TestAddElements(unittest.TestCase): def test_add_elements_with_k_0(self): # 测试当k为0时,返回0 arr = [1, 2, 3, 4, 5] k = 0 result = add_elements(arr, k) self.assertEqual(result, 0) def test_add_elements_with_k_1(self): # 测试当k为1时,返回1 arr = [1, 2, 3, 4, 5] k = 1 result = add_elements(arr, k) self.assertEqual(result, 1) def test_add_elements_with_k_2(self): # 测试当k为2时,返回1+2=3 arr = [1, 2, 3, 4, 5] k = 2 result = add_elements(arr, k) self.assertEqual(result, 3) if __name__ == '__main__': unittest.main() |
代码翻译(Code Translation)
除了代码生成,模型在多种常用编程语言下的代码翻译任务上有相当好的表现,例如:C++ <-> Python / Python<->Java / Java<->C++,以下展示更加详细的代码翻译示例:
代码翻译相关Cases 展示
C++->Python |
Question: # Translate the following C++ program to Python # C++ #include<stdio.h> #include<math.h> #include<string> #include<algorithm> using namespace std; #include<stdlib.h> int hex_key(string num){ string key="2357BD"; int out=0; for (int i=0;i<num.length();i++) if (find(key.begin(),key.end(),num[i])!=key.end()) out+=1; return out; } # Python |
Answer: def hex_key(num): key = "2357BD" out = 0 for i in range(len(num)): if num[i] in key: out += 1 return out |
Python->C++ |
Question: # Translate the following Python program to C++ # Python def check_Validity(a,b,c): if (a + b <= c) or (a + c <= b) or (b + c <= a) : return False else: return True # C++ /** * The result code must follow this signature: * bool checkValidity(int a, int b, int c) */ |
Answer: bool checkValidity(int a, int b, int c) { if (a + b <= c || a + c <= b || b + c <= a) { return false; } else { return true; } } |
代码修复(Excpetion Retry)
除了上述代码补全/代码翻译等代码生成任务之外,微调后的MFT模型在代码修复上表现也相当优秀,为了评测模型在代码修复上的结果,从StackOverflow上摘录部分数据作为测试,以此展示模型的修复代码能力,具体cases展示如下:
Question: # 除数为0时会抛出ZeroDivisionError异常 x = 10 y = 0 result = x / y # 在这里会出现ZeroDivisionError异常 print(result) 以上代码报错如下,请根据报错信息修复代码: Traceback (most recent call last): File "<stdin>", line 1, in <module> ZeroDivisionError: division by zero Answer: 代码中除数为0,导致了ZeroDivisionError异常.应该在除数为0之前添加判断语句,避免除数为0的情况发生. # 添加判断语句,避免除数为0的情况 x = 10 y = 0 if y != 0: result = x / y print(result) else: print("除数不能为0") |
Question: # 生成一个长度为10的列表 my_list = [i for i in range(10)] # 对这个列表进行排序,但是使用了错误的参数 sorted_list = sorted(my_list, key=lambda x: 1/x) 以上代码报错如下,请根据报错信息修复代码: Traceback (most recent call last): File "<stdin>", line 4, in <module> ZeroDivisionError: division by zero Answer: 以下是修复后的代码: # 生成一个长度为10的列表 my_list = [i for i in range(10)] # 对这个列表进行排序,但是使用了错误的参数 sorted_list = sorted(my_list, key=lambda x: 1/x if x!=0 else 0) |
SQL生成(Text2Sql)
模型在简单以及复杂SQL生成上表现不错,在多表JOIN ON、 INTERSECT、GROUP BY、LIKE、ORDER BY、DISTINCT、COUNT、LIMIT以及MAX/MIN/</>/=等常用SQL语法上,也能准确生成解析&语法完全正确的SQL语句,以下是一些具体的SQL生成cases:
Question: You are a professional SQL engineer and you are writing SQL queries for data query tasks.\n\nWhat are the medicine and trade names that can interact as an inhibitor and activitor with enzymes?\nschema:medicine(id|name|Trade_Name|FDA_approved),medicine_enzyme_interaction(enzyme_id|medicine_id|interaction_type) |
Answer: "SELECT T1.name , T1.trade_name FROM medicine AS T1 JOIN medicine_enzyme_interaction AS T2 ON T2.medicine_id = T1.id WHERE interaction_type = 'inhibitor' INTERSECT SELECT T1.name , T1.trade_name FROM medicine AS T1 JOIN medicine_enzyme_interaction AS T2 ON T2.medicine_id = T1.id WHERE interaction_type = 'activitor' " |
Question: You are a professional SQL engineer and you are writing SQL queries for data query tasks.\n\nHow many schools are in the basketball match?\nschema:basketball_match(Team_ID|School_ID|Team_Name|ACC_Regular_Season|ACC_Percent|ACC_Home|ACC_Road|All_Games|All_Games_Percent|All_Home|All_Road|All_Neutral) |
Answer: SELECT count(DISTINCT school_id) FROM basketball_match\n' |
Qestion: You are a professional SQL engineer and you are writing SQL queries for data query tasks.\n\nWhat is the primary conference of the school that has the lowest acc percent score in the competition?\nschema:university(School_ID|School|Location|Founded|Affiliation|Enrollment|Nickname|Primary_conference),basketball_match(Team_ID|School_ID|Team_Name|ACC_Regular_Season|ACC_Percent|ACC_Home|ACC_Road|All_Games|All_Games_Percent|All_Home|All_Road|All_Neutral) |
Answer: 'SELECT t1.Primary_conference FROM university AS t1 JOIN basketball_match AS t2 ON t1.school_id = t2.school_id ORDER BY t2.acc_percent LIMIT 1\n' |
Qestion: You are a professional SQL engineer and you are writing SQL queries for data query tasks.\n\nWhat are the different affiliations, and how many schools with each have an enrollment size of above 20000?\nschema:university(School_ID|School|Location|Founded|Affiliation|Enrollment|Nickname|Primary_conference) |
Answer: 'SELECT count(*) , affiliation FROM university WHERE enrollment > 20000 GROUP BY affiliation', 'predict': 'SELECT affiliation , count(*) FROM university WHERE enrollment > 20000 GROUP BY affiliation\n' |
Qestion: You are a professional SQL engineer and you are writing SQL queries for data query tasks.\n\nReturn the maximum enrollment across all schools.\nschema:university(School_ID|School|Location|Founded|Affiliation|Enrollment|Nickname|Primary_conference) |
Answer: 'SELECT max ( enrollment ) FROM university |
问题: |
回答: 'SELECT MIN(成交住宅土地宗数) FROM Table_5a4edf57312b11e9b253542696d6e445 WHERE 成交住宅建筑面积万平米 > 50 AND 成交楼面地价元_平米 > 3000' |
问题:You are a professional SQL engineer and you are writing SQL queries for data query tasks.\n\n你好,容积率大于1的楼盘里平均卖价最低的是多少钱\nschema:Table_c9896f30332111e98b80542696d6e445(楼盘|楼盘均价|楼盘容积率|楼盘装修标准|价格比较) |
回答: 'SELECT MIN(楼盘均价) FROM Table_c9896f30332111e98b80542696d6e445 WHERE 楼盘容积率 > 1' |
其他示例
除了代码相关指令,微调后的模型在中英文翻译,基础QA问答,文本生成等任务上也有比较好的表现,可以通过下载模型生成更丰富的生成样例,以下是一些日常对话问答的例子,未经过筛选:
Q: 请你写一个大熊猫和小兔子之间的故事,字数限制在50字以内 |
A: 大熊猫和小兔子一起玩耍, 大熊猫教小兔子爬树, 小兔子教大熊猫跳舞. 他们成为了好朋友. |
Q:把李白的《静夜思》翻译成英文 |
A: 床前明月光, 疑是地上霜. 举头望明月, 低头思故乡. Thoughts on a Quiet Night Bright moonlight before my bed, I suspect it is frost on the ground. I lift my head and gaze at the bright moon, Then lower it, missing my hometown. |
Q:你可以解释下牛顿第三定律吗 |
A: 当两个物体相互作用时,彼此之间的作用力大小相等,方向相反,作用在同一直线上.这就是牛顿第三定律.换句话说,对于每一个作用力,都有一个相等而反向的反作用力. |
Q:天津是中国的首都吗? |
A:No, 天津不是中国的首都.中国的首都是北京. |
Q:给自己许久不见的老师写一封邮件,问候他的情况 |
A: Dear [Teacher's Name], It's been a long time since we last spoke, and I hope this email finds you well. I wanted to take a moment to reach out and say hello, and to express my gratitude for all that you've done for me over the years. I remember how much I appreciated your dedication to teaching and your passion for your subject matter. Your classes were always engaging and thought-provoking, and I learned so much from you. I also appreciated your patience and understanding, especially during those times when I struggled with the material. I've been keeping up with some of the news about your work and your research, and I'm impressed by all that you've accomplished since we last spoke. I'm sure you're still making a difference in the lives of your students and in the field of [your teacher's field of expertise]. I hope you're doing well and that you're enjoying your time away from the classroom. Please let me know if there's anything I can do to help or if you'd like to catch up sometime. Best regards, [Your Name] |
一些现象:虽然对话整体流畅,但是有时候会出现中文问,英文答的问题,但这也是大模型的通病,应该与基座模型使用多种语言数据预训练有关,可以通过prompt指令指定目标语言来解决这类问题,当然后续也可以根据微调优化这类问题。