开发者社区 > ModelScope模型即服务 > 正文

ModelScope Windows环境下 数据集使用指南.ipynb 示例报错

ModelScope Windows环境下 数据集使用指南.ipynb 示例报错

修改1. 加载单个文件 示例代码,通过相对路径加载文件

from modelscope.msdatasets import MsDataset

# 默认分隔符为英文逗号','
ds = MsDataset.load('path/to/my_file.csv')
print(next(iter(ds)))

# 自定义分隔符
# input_kwargs = {'delimiter': '\t'}
# ds = MsDataset.load('/path/to/my_file.csv', **input_kwargs)
# print(next(iter(ds)))

# 通过data_files参数传入list的形式
# my_csv = '/path/to/my_file.csv'
# ds = MsDataset.load('csv', data_files=[my_csv])
# print(next(iter(ds)))

阿里云平台实验环境 运行成功:



2023-12-01 09:27:50,975 - modelscope - INFO - PyTorch version 2.0.1+cpu Found.
2023-12-01 09:27:50,978 - modelscope - INFO - TensorFlow version 2.13.0 Found.
2023-12-01 09:27:50,978 - modelscope - INFO - Loading ast index from /mnt/workspace/.cache/modelscope/ast_indexer
2023-12-01 09:27:50,979 - modelscope - INFO - No valid ast index found from /mnt/workspace/.cache/modelscope/ast_indexer, generating ast index from prebuilt!
2023-12-01 09:27:51,034 - modelscope - INFO - Loading done! Current index file version is 1.9.5, with md5 79827826d04c54fc06982662c5095533 and a total number of 945 components indexed
/opt/conda/lib/python3.8/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
/opt/conda/lib/python3.8/site-packages/datasets/load.py:1748: FutureWarning: 'ignore_verifications' was deprecated in favor of 'verification_mode' in version 2.9.1 and will be removed in 3.0.0.
You can remove this warning by passing 'verification_mode=no_checks' instead.
  warnings.warn(
Downloading and preparing dataset csv/default to /root/.cache/modelscope/hub/datasets/csv/default-da73667c81cf1c01/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d...
Downloading data files: 100%|██████████| 1/1 [00:00<00:00, 8542.37it/s]
Extracting data files: 100%|██████████| 1/1 [00:00<00:00, 1621.93it/s]

Dataset csv downloaded and prepared to /root/.cache/modelscope/hub/datasets/csv/default-da73667c81cf1c01/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d. Subsequent calls will reuse this data.
100%|██████████| 1/1 [00:00<00:00, 406.23it/s]
{'name\tage': 'Tom\t23'}

Windows本地运行报错:

NotImplementedError: Loading a dataset cached in a LocalFileSystem is not supported.

文档下载到本次临时目录都一致


2023-12-01 09:29:41,747 - modelscope - INFO - PyTorch version 2.1.1 Found.
2023-12-01 09:29:41,757 - modelscope - INFO - TensorFlow version 2.13.0 Found.
2023-12-01 09:29:41,758 - modelscope - INFO - Loading ast index from C:\Users\ccqtgb\.cache\modelscope\ast_indexer
2023-12-01 09:29:42,168 - modelscope - INFO - Loading done! Current index file version is 1.9.5, with md5 d5475f1c512234788c88217bd5eeb500 and a total number of 945 components indexed
C:\Users\ccqtgb\.conda\envs\modelscope\lib\site-packages\tqdm\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
C:\Users\ccqtgb\.conda\envs\modelscope\lib\site-packages\datasets\load.py:1748: FutureWarning: 'ignore_verifications' was deprecated in favor of 'verification_mode' in version 2.9.1 and will be removed in 3.0.0.
You can remove this warning by passing 'verification_mode=no_checks' instead.
  warnings.warn(

Downloading and preparing dataset csv/default to file://C:/Users/ccqtgb/.cache/modelscope/hub/datasets/csv/default-82dc50d428d587db/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d...
Downloading data files: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1003.90it/s]
Extracting data files: 100%|████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 501.23it/s]

Dataset csv downloaded and prepared to file://C:/Users/ccqtgb/.cache/modelscope/hub/datasets/csv/default-82dc50d428d587db/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d. Subsequent calls will reuse this data.

随即报错:

NotImplementedError                       Traceback (most recent call last)
Cell In[1], line 4
      1 from modelscope.msdatasets import MsDataset
      3 # 默认分隔符为英文逗号','
----> 4 ds = MsDataset.load('C:/Users/ccqtgb/Downloads/path/to/my_file.csv')
      5 print(next(iter(ds)))
      7 # 自定义分隔符

File ~\.conda\envs\modelscope\lib\site-packages\modelscope\msdatasets\ms_dataset.py:256, in MsDataset.load(dataset_name, namespace, target, version, hub, subset_name, split, data_dir, data_files, download_mode, cache_dir, use_streaming, stream_batch_size, custom_cfg, token, **config_kwargs)
    253 # Load from local disk
    254 if dataset_name in _PACKAGED_DATASETS_MODULES or os.path.isdir(
    255         dataset_name) or os.path.isfile(dataset_name):
--> 256     dataset_inst = LocalDataLoaderManager(
    257         dataset_context_config).load_dataset(
    258             LocalDataLoaderType.HF_DATA_LOADER)
    259     dataset_inst = MsDataset.to_ms_dataset(dataset_inst, target=target)
    260     if isinstance(dataset_inst, MsDataset):

File ~\.conda\envs\modelscope\lib\site-packages\modelscope\msdatasets\data_loader\data_loader_manager.py:74, in LocalDataLoaderManager.load_dataset(self, data_loader_type)
     70 # Select local data loader
     71 # TODO: more loaders to be supported.
     72 if data_loader_type == LocalDataLoaderType.HF_DATA_LOADER:
     73     # Build huggingface data loader and return dataset.
---> 74     return hf_data_loader(
     75         dataset_name,
     76         name=subset_name,
     77         revision=version,
     78         split=split,
     79         data_dir=data_dir,
     80         data_files=data_files,
     81         cache_dir=cache_root_dir,
     82         download_mode=download_mode.value,
     83         streaming=use_streaming,
     84         ignore_verifications=True,
     85         **input_config_kwargs)
     86 raise f'Expected local data loader type: {LocalDataLoaderType.HF_DATA_LOADER.value}.'

File ~\.conda\envs\modelscope\lib\site-packages\datasets\load.py:1810, in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, use_auth_token, task, streaming, num_proc, storage_options, **config_kwargs)
   1806 # Build dataset for splits
   1807 keep_in_memory = (
   1808     keep_in_memory if keep_in_memory is not None else is_small_dataset(builder_instance.info.dataset_size)
   1809 )
-> 1810 ds = builder_instance.as_dataset(split=split, verification_mode=verification_mode, in_memory=keep_in_memory)
   1811 # Rename and cast features to match task schema
   1812 if task is not None:

File ~\.conda\envs\modelscope\lib\site-packages\datasets\builder.py:1128, in DatasetBuilder.as_dataset(self, split, run_post_process, verification_mode, ignore_verifications, in_memory)
   1126 is_local = not is_remote_filesystem(self._fs)
   1127 if not is_local:
-> 1128     raise NotImplementedError(f"Loading a dataset cached in a {type(self._fs).__name__} is not supported.")
   1129 if not os.path.exists(self._output_dir):
   1130     raise FileNotFoundError(
   1131         f"Dataset {self.name}: could not find data in {self._output_dir}. Please make sure to call "
   1132         "builder.download_and_prepare(), or use "
   1133         "datasets.load_dataset() before trying to access the Dataset object."
   1134     )

NotImplementedError: Loading a dataset cached in a LocalFileSystem is not supported.

展开
收起
toly_sun 2023-12-01 10:42:14 308 0
4 条回答
写回答
取消 提交回答
  • 在Windows环境下,ModelScope不支持从本地文件系统加载数据集。你可以尝试将数据集上传到ModelScope的数据集仓库,然后使用ModelScope提供的数据集名称进行加载。

    2023-12-02 16:16:08
    赞同 展开评论 打赏
  • 面对过去,不要迷离;面对未来,不必彷徨;活在今天,你只要把自己完全展示给别人看。

    根据报错信息和错误堆栈,可以看出在Windows环境下使用MsDataset.load()函数加载本地文件时出现了NotImplementedError异常。该异常发生在试图从缓存的LocalFileSystem加载数据集时。

    当前的实现中,并未支持在Windows环境下加载本地文件。这是一个已知的问题,可能是由于一些平台相关性导致的。为了解决此问题,您可以尝试以下替代方案:

    1. 将数据集文件上传到云存储(如S3、GCS等)或网络位置,然后使用远程URL加载数据集。例如:
      ```python
      from modelscope.msdatasets import MsDataset

    使用远程URL加载数据集

    ds = MsDataset.load
    print(next(iter(ds)))

    
    2. 在Windows环境下,直接使用Pandas或其他库加载本地文件,并将其转换为适合您的数据处理流程的格式。例如:
    ```python
    import pandas as pd
    
    # 使用Pandas加载本地文件
    df = pd.read_csv('C:/Users/ccqtgb/Downloads/path/to/my_file.csv')
    
    # 将DataFrame转换为需要的数据结构
    # ...
    
    # 继续进行后续处理
    
    2023-12-01 21:15:06
    赞同 展开评论 打赏
  • 根据报错信息,错误是由于在Windows本地环境中不支持从LocalFileSystem加载缓存的dataset。这可能是因为某些库或功能在不同操作系统上有所不同。要解决这个问题,您可以尝试以下几种方法:

    1. 尝试使用Docker容器来运行示例代码。这样可以确保环境的一致性,并且避免因操作系统的差异导致的问题。
    2. 如果您必须在本地Windows环境下运行,可以考虑将数据集下载到一个支持的文件系统中(例如,网络文件系统),然后尝试重新加载数据集。
    3. 查看相关库(如datasets和modelscope)的官方文档和社区论坛,看看是否有针对此问题的解决方案或者已知的工作绕过方式。
    2023-12-01 14:28:00
    赞同 展开评论 打赏
  • 发现Windows执行环境下Lib\site-packages\datasets\builder.py 判断结果is_local 出错, 参考代码如下:

            if self._file_format is not None and self._file_format != "arrow":
                raise FileFormatError('Loading a dataset not written in the "arrow" format is not supported.')
    
            # is_local返回结果为false 
            is_local = not is_remote_filesystem(self._fs)
            # 抛出异常
            if not is_local:
                raise NotImplementedError(f"Loading a dataset cached in a {type(self._fs).__name__} is not supported.")
            if not os.path.exists(self._output_dir):
                raise FileNotFoundError(
                    f"Dataset {self.name}: could not find data in {self._output_dir}. Please make sure to call "
                    "builder.download_and_prepare(), or use "
                    "datasets.load_dataset() before trying to access the Dataset object."
                )
    

    根据Windows下调试反馈结果,修改Lib\site-packages\datasets\filesystems__init__.py 文件暂时解决

        # Windows环境下存在以List形式返回的fs.protocol对象 2023/12/01
        # if fs is not None and fs.protocol != "file"
        if fs is not None and fs.protocol != "file" and fs.protocol[0] != "file":
            return True
        else:
            return False
    

    重新打开jupyter 代码执行成功

    2023-12-01 12:56:52
    赞同 展开评论 打赏

ModelScope旨在打造下一代开源的模型即服务共享平台,为泛AI开发者提供灵活、易用、低成本的一站式模型服务产品,让模型应用更简单!欢迎加入技术交流群:微信公众号:魔搭ModelScope社区,钉钉群号:44837352

相关电子书

更多
《云服务器运维之Windows篇》 立即下载
TAKING WINDOWS 10 KERNEL 立即下载
ECS运维指南之Windows系统诊断 立即下载