修改1. 加载单个文件 示例代码,通过相对路径加载文件
from modelscope.msdatasets import MsDataset
# 默认分隔符为英文逗号','
ds = MsDataset.load('path/to/my_file.csv')
print(next(iter(ds)))
# 自定义分隔符
# input_kwargs = {'delimiter': '\t'}
# ds = MsDataset.load('/path/to/my_file.csv', **input_kwargs)
# print(next(iter(ds)))
# 通过data_files参数传入list的形式
# my_csv = '/path/to/my_file.csv'
# ds = MsDataset.load('csv', data_files=[my_csv])
# print(next(iter(ds)))
阿里云平台实验环境 运行成功:
2023-12-01 09:27:50,975 - modelscope - INFO - PyTorch version 2.0.1+cpu Found.
2023-12-01 09:27:50,978 - modelscope - INFO - TensorFlow version 2.13.0 Found.
2023-12-01 09:27:50,978 - modelscope - INFO - Loading ast index from /mnt/workspace/.cache/modelscope/ast_indexer
2023-12-01 09:27:50,979 - modelscope - INFO - No valid ast index found from /mnt/workspace/.cache/modelscope/ast_indexer, generating ast index from prebuilt!
2023-12-01 09:27:51,034 - modelscope - INFO - Loading done! Current index file version is 1.9.5, with md5 79827826d04c54fc06982662c5095533 and a total number of 945 components indexed
/opt/conda/lib/python3.8/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
/opt/conda/lib/python3.8/site-packages/datasets/load.py:1748: FutureWarning: 'ignore_verifications' was deprecated in favor of 'verification_mode' in version 2.9.1 and will be removed in 3.0.0.
You can remove this warning by passing 'verification_mode=no_checks' instead.
warnings.warn(
Downloading and preparing dataset csv/default to /root/.cache/modelscope/hub/datasets/csv/default-da73667c81cf1c01/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d...
Downloading data files: 100%|██████████| 1/1 [00:00<00:00, 8542.37it/s]
Extracting data files: 100%|██████████| 1/1 [00:00<00:00, 1621.93it/s]
Dataset csv downloaded and prepared to /root/.cache/modelscope/hub/datasets/csv/default-da73667c81cf1c01/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d. Subsequent calls will reuse this data.
100%|██████████| 1/1 [00:00<00:00, 406.23it/s]
{'name\tage': 'Tom\t23'}
Windows本地运行报错:
NotImplementedError: Loading a dataset cached in a LocalFileSystem is not supported.
文档下载到本次临时目录都一致
2023-12-01 09:29:41,747 - modelscope - INFO - PyTorch version 2.1.1 Found.
2023-12-01 09:29:41,757 - modelscope - INFO - TensorFlow version 2.13.0 Found.
2023-12-01 09:29:41,758 - modelscope - INFO - Loading ast index from C:\Users\ccqtgb\.cache\modelscope\ast_indexer
2023-12-01 09:29:42,168 - modelscope - INFO - Loading done! Current index file version is 1.9.5, with md5 d5475f1c512234788c88217bd5eeb500 and a total number of 945 components indexed
C:\Users\ccqtgb\.conda\envs\modelscope\lib\site-packages\tqdm\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
C:\Users\ccqtgb\.conda\envs\modelscope\lib\site-packages\datasets\load.py:1748: FutureWarning: 'ignore_verifications' was deprecated in favor of 'verification_mode' in version 2.9.1 and will be removed in 3.0.0.
You can remove this warning by passing 'verification_mode=no_checks' instead.
warnings.warn(
Downloading and preparing dataset csv/default to file://C:/Users/ccqtgb/.cache/modelscope/hub/datasets/csv/default-82dc50d428d587db/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d...
Downloading data files: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1003.90it/s]
Extracting data files: 100%|████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 501.23it/s]
Dataset csv downloaded and prepared to file://C:/Users/ccqtgb/.cache/modelscope/hub/datasets/csv/default-82dc50d428d587db/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d. Subsequent calls will reuse this data.
随即报错:
NotImplementedError Traceback (most recent call last)
Cell In[1], line 4
1 from modelscope.msdatasets import MsDataset
3 # 默认分隔符为英文逗号','
----> 4 ds = MsDataset.load('C:/Users/ccqtgb/Downloads/path/to/my_file.csv')
5 print(next(iter(ds)))
7 # 自定义分隔符
File ~\.conda\envs\modelscope\lib\site-packages\modelscope\msdatasets\ms_dataset.py:256, in MsDataset.load(dataset_name, namespace, target, version, hub, subset_name, split, data_dir, data_files, download_mode, cache_dir, use_streaming, stream_batch_size, custom_cfg, token, **config_kwargs)
253 # Load from local disk
254 if dataset_name in _PACKAGED_DATASETS_MODULES or os.path.isdir(
255 dataset_name) or os.path.isfile(dataset_name):
--> 256 dataset_inst = LocalDataLoaderManager(
257 dataset_context_config).load_dataset(
258 LocalDataLoaderType.HF_DATA_LOADER)
259 dataset_inst = MsDataset.to_ms_dataset(dataset_inst, target=target)
260 if isinstance(dataset_inst, MsDataset):
File ~\.conda\envs\modelscope\lib\site-packages\modelscope\msdatasets\data_loader\data_loader_manager.py:74, in LocalDataLoaderManager.load_dataset(self, data_loader_type)
70 # Select local data loader
71 # TODO: more loaders to be supported.
72 if data_loader_type == LocalDataLoaderType.HF_DATA_LOADER:
73 # Build huggingface data loader and return dataset.
---> 74 return hf_data_loader(
75 dataset_name,
76 name=subset_name,
77 revision=version,
78 split=split,
79 data_dir=data_dir,
80 data_files=data_files,
81 cache_dir=cache_root_dir,
82 download_mode=download_mode.value,
83 streaming=use_streaming,
84 ignore_verifications=True,
85 **input_config_kwargs)
86 raise f'Expected local data loader type: {LocalDataLoaderType.HF_DATA_LOADER.value}.'
File ~\.conda\envs\modelscope\lib\site-packages\datasets\load.py:1810, in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, use_auth_token, task, streaming, num_proc, storage_options, **config_kwargs)
1806 # Build dataset for splits
1807 keep_in_memory = (
1808 keep_in_memory if keep_in_memory is not None else is_small_dataset(builder_instance.info.dataset_size)
1809 )
-> 1810 ds = builder_instance.as_dataset(split=split, verification_mode=verification_mode, in_memory=keep_in_memory)
1811 # Rename and cast features to match task schema
1812 if task is not None:
File ~\.conda\envs\modelscope\lib\site-packages\datasets\builder.py:1128, in DatasetBuilder.as_dataset(self, split, run_post_process, verification_mode, ignore_verifications, in_memory)
1126 is_local = not is_remote_filesystem(self._fs)
1127 if not is_local:
-> 1128 raise NotImplementedError(f"Loading a dataset cached in a {type(self._fs).__name__} is not supported.")
1129 if not os.path.exists(self._output_dir):
1130 raise FileNotFoundError(
1131 f"Dataset {self.name}: could not find data in {self._output_dir}. Please make sure to call "
1132 "builder.download_and_prepare(), or use "
1133 "datasets.load_dataset() before trying to access the Dataset object."
1134 )
NotImplementedError: Loading a dataset cached in a LocalFileSystem is not supported.
在Windows环境下,ModelScope不支持从本地文件系统加载数据集。你可以尝试将数据集上传到ModelScope的数据集仓库,然后使用ModelScope提供的数据集名称进行加载。
根据报错信息和错误堆栈,可以看出在Windows环境下使用MsDataset.load()
函数加载本地文件时出现了NotImplementedError异常。该异常发生在试图从缓存的LocalFileSystem加载数据集时。
当前的实现中,并未支持在Windows环境下加载本地文件。这是一个已知的问题,可能是由于一些平台相关性导致的。为了解决此问题,您可以尝试以下替代方案:
ds = MsDataset.load
print(next(iter(ds)))
2. 在Windows环境下,直接使用Pandas或其他库加载本地文件,并将其转换为适合您的数据处理流程的格式。例如:
```python
import pandas as pd
# 使用Pandas加载本地文件
df = pd.read_csv('C:/Users/ccqtgb/Downloads/path/to/my_file.csv')
# 将DataFrame转换为需要的数据结构
# ...
# 继续进行后续处理
根据报错信息,错误是由于在Windows本地环境中不支持从LocalFileSystem加载缓存的dataset。这可能是因为某些库或功能在不同操作系统上有所不同。要解决这个问题,您可以尝试以下几种方法:
发现Windows执行环境下Lib\site-packages\datasets\builder.py 判断结果is_local 出错, 参考代码如下:
if self._file_format is not None and self._file_format != "arrow":
raise FileFormatError('Loading a dataset not written in the "arrow" format is not supported.')
# is_local返回结果为false
is_local = not is_remote_filesystem(self._fs)
# 抛出异常
if not is_local:
raise NotImplementedError(f"Loading a dataset cached in a {type(self._fs).__name__} is not supported.")
if not os.path.exists(self._output_dir):
raise FileNotFoundError(
f"Dataset {self.name}: could not find data in {self._output_dir}. Please make sure to call "
"builder.download_and_prepare(), or use "
"datasets.load_dataset() before trying to access the Dataset object."
)
根据Windows下调试反馈结果,修改Lib\site-packages\datasets\filesystems__init__.py 文件暂时解决
# Windows环境下存在以List形式返回的fs.protocol对象 2023/12/01
# if fs is not None and fs.protocol != "file"
if fs is not None and fs.protocol != "file" and fs.protocol[0] != "file":
return True
else:
return False
重新打开jupyter 代码执行成功
ModelScope旨在打造下一代开源的模型即服务共享平台,为泛AI开发者提供灵活、易用、低成本的一站式模型服务产品,让模型应用更简单!欢迎加入技术交流群:微信公众号:魔搭ModelScope社区,钉钉群号:44837352