nlp_mt5_zero-shot-augment_chinese-base模型进行微调,其中pytorch_model.bin文件一直没有输出,并且报错:
MemoryError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "F:\study\graduationProject\vue2_mt5\vue_flask\finetune.py", line 61, in
trainer.train()
File "D:\tool\Anaconda\anaconda3\envs\modelscope\lib\site-packages\modelscope\trainers\trainer.py", line 711, in train
self.train_loop(self.train_dataloader)
File "D:\tool\Anaconda\anaconda3\envs\modelscope\lib\site-packages\modelscope\trainers\trainer.py", line 1243, in train_loop
self.invoke_hook(TrainerStages.after_train_epoch)
File "D:\tool\Anaconda\anaconda3\envs\modelscope\lib\site-packages\modelscope\trainers\trainer.py", line 1395, in invoke_hook
getattr(hook, fn_name)(self)
File "D:\tool\Anaconda\anaconda3\envs\modelscope\lib\site-packages\modelscope\trainers\hooks\checkpoint\checkpoint_hook.py", line 177, in after_train_epoch
self._do_save(trainer, CheckpointStrategy.by_epoch)
File "D:\tool\Anaconda\anaconda3\envs\modelscope\lib\site-packages\modelscope\trainers\hooks\checkpoint\checkpoint_hook.py", line 160, in _do_save
self._save_checkpoint(trainer, prefix)
File "D:\tool\Anaconda\anaconda3\envs\modelscope\lib\site-packages\modelscope\trainers\hooks\checkpoint\checkpoint_hook.py", line 224, in _save_checkpoint
self.processor.save_checkpoints(trainer, checkpoint_path_prefix,
File "D:\tool\Anaconda\anaconda3\envs\modelscope\lib\site-packages\modelscope\trainers\hooks\checkpoint\checkpoint_processor.py", line 126, in save_checkpoints
self.save_trainer_state(trainer, model, _train_state_file, meta,
File "D:\tool\Anaconda\anaconda3\envs\modelscope\lib\site-packages\modelscope\trainers\hooks\checkpoint\checkpoint_processor.py", line 192, in save_trainer_state
save_checkpoint(
File "D:\tool\Anaconda\anaconda3\envs\modelscope\lib\site-packages\modelscope\utils\checkpoint.py", line 114, in save_checkpoint
torch.save(checkpoint, f)
File "D:\tool\Anaconda\anaconda3\envs\modelscope\lib\site-packages\torch\serialization.py", line 620, in save
return
File "D:\tool\Anaconda\anaconda3\envs\modelscope\lib\site-packages\torch\serialization.py", line 482, in exit
self.file_like.write_end_of_file()
RuntimeError: [enforce fail at inline_container.cc:424] . unexpected pos 1237267904 vs 1237267856
减小批处理大小:尝试减小训练时的批处理大小(batch size),以减少内存消耗。
training_args = TrainingArguments(
...,
per_device_train_batch_size=8, # 调整批处理大小
...
)
文件系统或路径问题 检查磁盘空间:确保磁盘空间足够。
df -h # 查看磁盘空间使用情况