我有一个文本和一个概念列表如下。

concepts = ["data mining", "data", "data source"]
text = "levels and data mining of dna data source methylation"
我想，以确定是否concepts在列表都在text和替换所有出现concepts[1:]用concepts[0]。因此，上述文字的结果应该是;

"levels and data mining of dna data mining methylation"
我的代码如下：

concepts = ["data mining", "data", "data source"]
text = "levels and data mining of dna data source methylation"

if any(word in text for word in concepts):

for terms in concepts[1:]:
    if terms in text:
        text=text.replace(terms,concepts[0])
    text=' '.join(text.split())
print(text)

但是，我得到输出为;

levels and data mining mining of dna data mining source methylation
看起来概念data被替换data mining为不正确的概念。更具体地说，我希望在更换时首先考虑最长的选项。

即使我改变顺序，它也不起作用concepts。

concepts = ["data mining", "data source", "data"]
text = "levels and data mining of dna data source methylation"

if any(word in text for word in concepts):

for terms in concepts[1:]:
    if terms in text:
        text=text.replace(terms,concepts[0])
    text=' '.join(text.split())
print(text)

我得到了以上代码的以下输出。

levels and data mining mining of dna data mining mining methylation

这里的问题是您的迭代策略，一次替换一个术语。由于您的替换术语包含您要替换的术语之一，因此您最终会在之前的迭代中对已更改为替换术语的内容进行替换。

解决这个问题的一种方法是以原子方式完成所有这些替换，以便它们全部同时发生，并且输出不会影响其他替换的结果。有几种策略：

您可以将字符串分解为与您的各种术语匹配的标记，并在事实之后替换它们（并确保没有任何重叠）。
您可以使用一个对多个选项进行原子替换的函数。
＃2的一个例子是sub()Python re库的方法。以下是其使用示例：

import re

concepts = ["data mining", "data source", "data"]
text = "levels and data mining of dna data source methylation"

Sort targets by descending length, so longer targets that

might contain shorter ones are found first

targets = sorted(concepts[1:], key=lambda x: len(x), reverse=True)

Use re.escape to generate version of the targets with special characters escaped

target_re = "|".join(re.escape(item) for item in targets)

result = re.sub(target_re, concepts[0], text)
请注意，这仍然会导致data mining mining与原来的一套替换的，因为它没有现有概念mining自带之后data。如果你想避免这种情况，你可以简单地将你要替换的实际项目作为替换目标包含在内，以便在短期内匹配：

import re

concepts = ["data mining", "data source", "data"]
text = "levels and data mining of dna data source methylation"

Sort targets by descending length, so longer targets that

might contain shorter ones are found first

!!!No [1:] !!!

targets = sorted(concepts, key=lambda x: len(x), reverse=True)

Use re.escape to generate version of the targets with special characters escaped

target_re = "|".join(re.escape(item) for item in targets)

result = re.sub(target_re, concepts[0], text)

如何检查python中最长的子串

Sort targets by descending length, so longer targets that

might contain shorter ones are found first

Use re.escape to generate version of the targets with special characters escaped

Sort targets by descending length, so longer targets that

might contain shorter ones are found first

!!!No [1:] !!!

Use re.escape to generate version of the targets with special characters escaped

相关课程

相关电子书

相关实验场景

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

如何检查python中最长的子串

Sort targets by descending length, so longer targets that

might contain shorter ones are found first

Use re.escape to generate version of the targets with special characters escaped

Sort targets by descending length, so longer targets that

might contain shorter ones are found first

!!!No [1:] !!!

Use re.escape to generate version of the targets with special characters escaped

相关课程

相关文章

相关电子书

相关实验场景

相关镜像