正则表达式
正则表达式(Regular Expression)是一种强大的文本处理工具,用于匹配、查找、替换或提取字符串中的特定模式。它由普通字符和特殊字符(称为“元字符”)组成,这些特殊字符具有特殊的含义,用于定义匹配规则.
练习文件sample.txt的内容如下
[root@RHEL7-1 ~]# pwd /root [root@RHEL7-1 ~]# cat /root/sample.txt "Open Source" is a good mechanism to develop programs. apple is my favorite food. Football game is not use feet only. this dress doesn't fit me. However, this dress is about $ 3183 dollars.^M GNU is free air not free beer.^M Her hair is very beauty.^M I can't finish the test.^M Oh! The soup taste good.^M motorcycle is cheap than car. This window is clear. the symbol '*' is represented as start. Oh! My god! The gd software is a library for drafting programs.^M You are the best is mean you are the no. 1. The world <Happy> is the same with "glad". I like dog. google is the best tools for search keyword. goooooogle yes! go! go! Let's go. # I am Bobby
(1)查找特定字符串。
假设我们要从文件sample.txt当中取得“the”这个特定字符串,最简单的方式是:
[root@RHEL7-1 ~]# grep -n 'the' /root/sample.txt 8:I can't finish the test. 12:the symbol '*' is represented as start. 15:You are the best is mean you are the no. 1. 16:The world <Happy> is the same with "glad". 18:google is the best tools for search keyword.
如果想要反向选择呢,即当该行没有“the”这个字符串时才显示在屏幕上:
[root@RHEL7-1 ~]# grep -vn 'the' /root/sample.txt
如果你想要获得不论大小写的“the”这个字符串,则执行:
[root@RHEL7-1 ~]# grep -in 'the' /root/sample.txt 8:I can't finish the test. 9:Oh! The soup taste good. 12:the symbol '*' is represented as start. 14:The gd software is a library for drafting programs. 15:You are the best is mean you are the no. 1. 16:The world <Happy> is the same with "glad". 18:google is the best tools for search keyword.
(2)利用中括号 [] 来搜寻集合字符。
对比“test”或“taste”这两个单词可以发现,它们有共同点“t?st”存在。这个时候,可以这样来查寻:
[root@RHEL7-1 ~]# grep -n 't[ae]st' /root/sample.txt 8:I can't finish the test. 9:Oh! The soup taste good.
其实 [] 里面不论有几个字符,都只代表某一个字符,所以,上面的例子说明需要的字符串是tast或test。而如果想要搜寻到有“oo”的字符时,则使用:
[root@RHEL7-1 ~]# grep -n 'oo' /root/sample.txt 1:"Open Source" is a good mechanism to develop programs. 2:apple is my favorite food. 3:Football game is not use feet only. 9:Oh! The soup taste good. 18:google is the best tools for search keyword. 19:goooooogle yes!
如果不想要“oo”前面有“g”的行显示出来。此时,可以利用在集合字节的反向选择[^]来完成:
[root@RHEL7-1 ~]# grep -n '[^g]oo' /root/sample.txt 2:apple is my favorite food. 3:Football game is not use feet only. 18:google is the best tools for search keyword. 19:goooooogle yes!
假设oo前面不想有小写字母,可以这样写:[^abcd....z]oo。但是这样似乎不怎么方便,由于小写字母的ASCII上编码的顺序是连续的,因此,我们可以将之简化:
[root@RHEL7-1 ~]# grep -n '[^a-z]oo' sample.txt 3:Football game is not use feet only.
如果要求字符串是数字与英文呢?那就将其全部写在一起,变成:[a-zA-Z0-9]。例如,我们要获取有数字的那一行:
[root@RHEL7-1 ~]# grep -n '[0-9]' /root/sample.txt 5:However, this dress is about $ 3183 dollars. 15:You are the best is mean you are the no. 1.
由于考虑到语系对于编码顺序的影响,所以除了连续编码使用减号“-”之外,也可以使用如下的方法来取得前面两个测试的结果:
[root@RHEL7-1 ~]# grep -n '[^[:lower:]]oo' /root/sample.txt # [:lower:]代表的就是a-z的意思 [root@RHEL7-1 ~]# grep -n '[[:digit:]]' /root/sample.txt
3)行首与行尾字节^ $。
在前面,可以查询到一行字串里面有“the”,那如果想要让“the”只在行首才列出呢?
[root@RHEL7-1 ~]# grep -n '^the' /root/sample.txt 12:the symbol '*' is represented as start.
如果想要开头是小写字母的那些行列出呢?可以这样写:
[root@RHEL7-1 ~]# grep -n '^[a-z]' /root/sample.txt 2:apple is my favorite food. 4:this dress doesn't fit me. 10:motorcycle is cheap than car. 12:the symbol '*' is represented as start. 18:google is the best tools for search keyword. 19:goooooogle yes! 20:go! go! Let's go.
如果不想要开头是英文字母,则可以这样:
[root@RHEL7-1 ~]# grep -n '^[^a-zA-Z]' /root/sample.txt 1:"Open Source" is a good mechanism to develop programs. 21:# I am Bobby
特别提示:“^”符号在字符集合符号(括号[])之内与之外的意义是不同的。在 [] 内代表“反向选择”,在 [] 之外则代表定位在行首。反过来思考,如果想要找出行尾结束为小数点(.)的那些行,该如何处理?
[root@RHEL7-1 ~]# grep -n '\.$' /root/sample.txt 1:"Open Source" is a good mechanism to develop programs. 2:apple is my favorite food. 3:Football game is not use feet only. 4:this dress doesn't fit me. 10:motorcycle is cheap than car. 11:This window is clear. 12:the symbol '*' is represented as start. 15:You are the best is mean you are the no. 1. 16:The world <Happy> is the same with "glad". 17:I like dog. 18:google is the best tools for search keyword. 20:go! go! Let's go.
特别注意:因为小数点具有其他意义(下面会介绍),所以必须要使用跳转字节(\)来解除其特殊意义。不过,你或许会觉得奇怪,第5~9行最后面也是“.”啊。怎么无法打印出来?这里就牵涉到Windows平台的软件对于断行字符的判断问题了!我们使用cat -A将第5行显示出来,你会发现(命令cat中的-A参数含义:显示不可打印字符,行尾显示“$”):
[root@RHEL7-1 ~]# cat -An /root/sample.txt | head -n 10 | tail -n 6 5 However, this dress is about $ 3183 dollars.^M$ 6 GNU is free air not free beer.^M$ 7 Her hair is very beauty.^M$ 8 I can't finish the test.^M$ 9 Oh! The soup taste good.^M$ 10 motorcycle is cheap than car.$
由此,可以发现第5~9行为Windows的断行字节(^M$),而正常的Linux应该仅有第10行显示的那样($)。所以,也就找不到5~9行了。这样就可以了解“^”与“$”的意义了。
如果想要找出哪一行是空白行,即该行没有输入任何数据,该如何搜寻?
[root@RHEL7-1 ~]# grep -n '^$' /root/sample.txt 22:
技巧:假设已经知道在一个程序脚本(shell script)或者是配置文件中,空白行与开头为# 的那些行是注解,因此如果你要将数据打印出参考时,可以将这些数据省略掉以节省纸张,那么怎么操作呢?我们以/etc/rsyslog.conf这个文件来作范例,可以自行参考以下输出的结果(-v选项表示输出除之外的所有行):
[root@RHEL7-1 ~]# cat -n /etc/rsyslog.conf #结果可以发现有91行的输出,其中包含很多空白行与 # 开头的注释行 [root@RHEL7-1 ~]# grep -v '^$' /etc/rsyslog.conf | grep -v '^#' # 结果仅有10行,其中第一个“-v '^$'”代表不要空白行 # 第二个“-v '^#'”代表不要开头是 # 的那行
任意一个字符“.”与重复字节“*”。
万用字符“*”可以用来代表任意(0或多个)字符,但是正则表示法并不是万用字符,两者之间是不相同的。至于正则表示法当中的“.”则代表“绝对有一个任意字符”的意思。这两个符号在正则表示法的意义如下。
l . (小数点):代表一个任意字符。
l *(星号):代表重复前一个字符0次到无穷多次的意思,为组合形态。
假设需要找出“g??d”的字符串,即共有4个字符,开头是“g”而结束是“d”,可以这样做:
[root@RHEL7-1 ~]# grep -n 'g..d' /root/sample.txt 1:"Open Source" is a good mechanism to develop programs. 9:Oh! The soup taste good. 16:The world <Happy> is the same with "glad".
因为强调g与d之间一定要存在两个字符,因此,第13行的god与第14行的gd就不会被列出来。如果想要列出oo、ooo、oooo等数据,也就是说,至少要有两个(含)o以上,该如何操作呢?是o* 还是oo* 还是ooo* 呢?
因为 * 代表的是“重复0个或多个前面的RE字符”,因此,“o*”代表的是“拥有空字符或一个o以上的字符”。
特别注意:因为允许空字符(即有没有字符都可以),所以“grep -n 'o*' sample.txt”将会把所有的数据都列出来。
那如果是“oo*”呢?则第一个o肯定必须要存在,第二个o则是可有可无的多个o,所以,凡是含有o、oo、ooo、oooo等,都可以被列出来。
同理,当需要“至少两个o以上的字符串”时,就需要ooo*,即
[root@RHEL7-1 ~]# grep -n 'ooo*' /root/sample.txt 1:"Open Source" is a good mechanism to develop programs. 2:apple is my favorite food. 3:Football game is not use feet only. 9:Oh! The soup taste good. 18:google is the best tools for search keyword. 19:goooooogle yes!
如果想要字符串开头与结尾都是g,但是两个g之间仅能存在至少一个o,即gog、goog、gooog等,那该如何操作呢?
[root@RHEL7-1 ~]# grep -n 'goo*g' sample.txt 18:google is the best tools for search keyword. 19:goooooogle yes!
要找出以g开头且以g结尾的字符串,利用任意一个字符“.”,即“g.*g”。因为“*”可以是0个或多个重复前面的字符,而“.”是任意字节,所以“.*”就代表零个或多个任意字符.
[root@RHEL7-1 ~]# grep -n 'g.*g' /root/sample.txt 1:"Open Source" is a good mechanism to develop programs. 14:The gd software is a library for drafting programs. 18:google is the best tools for search keyword. 19:goooooogle yes! 20:go! go! Let's go.
如果想要找出“任意数字”的行列呢?因为仅有数字,所以这样做:
[root@RHEL7-1 ~]# grep -n '[0-9][0-9]*' /root/sample.txt 5:However, this dress is about $ 3183 dollars. 15:You are the best is mean you are the no. 1.
限定连续RE字符范围{}。
如果想要限制一个范围区间内的重复字符数该怎么办呢?举例来说,想要找出2个~5个o的连续字符串,该如何操作?这时候就要使用限定范围的字符{}了。但因为“{”与“}”的符号在shell里是有特殊意义的,所以必须使用转义字符“\”来让其失去特殊意义才行。
先来做一个练习,假设要找到含两个o的字符串的行,可以这样做:
[root@RHEL7-1 ~]# grep -n 'o\{2\}' /root/sample.txt 1:"Open Source" is a good mechanism to develop programs. 2:apple is my favorite food. 3:Football game is not use feet only. 9:Oh! The soup taste good. 18:google is the best tools for search keyword. 19:goooooogle yes!
似乎与ooo* 的字符没有什么差异,因为第19行有多个o依旧也出现了!那么换个搜寻的字符串试试。假设要找出g后面接2~5个o,然后再接一个g的字符串,应该这样操作:
[root@RHEL7-1 ~]# grep -n 'go\{2,5\}g' /root/sample.txt 18:google is the best tools for search keyword.
第19行没有被选中(因为19行有6个o)。那么,如果想要的是2个o以上的goooo....g呢?除了可以使用gooo*g外,也可以这样:
[root@RHEL7-1 ~]# grep -n 'go\{2,\}g' /root/sample.txt 18:google is the best tools for search keyword. 19:goooooogle yes!
/dev/null空设备的一个典型用法是丢弃从find或grep等命令送来的错误信息:
[root@RHEL7-1 ~]# grep delegate /etc/* 2>/dev/null