sed 中的正则表达式

  |   0 评论   |   3,300 浏览

GNU sed 中的基本正则与扩展正则在使用起来有些差异.有时为了简化语法会使用扩展正则.但是扩展正则后有些简单的语法亦有可能变得复杂.因此对两者之间的差异与细节有一个基本的了解很有必要;

可以看如下的几个case:

一个case:

输入: echo "abcwwwdc456" | gsed -E 's/(abc)([\w]+)(.*456)/\1替换后的字符\3/g'
输出:  abc替换后的字符dc456 
解释: [\w]+只匹配到了所有的`w`.没有匹配后面的`dc`,所以dc没有被替换. 
原因: 实际上字符集是`\`和`w`两者的集合.并不是 **匹配数字和字母下划线** , `\`在 `[ ]`中没有特殊的含义.所以也就无法构建出`\w`

第二个case:

输入: echo "abcwwwdc456" | gsed -E 's/(abc)(\w+)(.*456)/\1替换后的字符\3/g'
输出: abc替换后的字符456
解释: `\w`被识别为了字符集. **数字和字母下划线** , 因此匹配到了w以及后面的dc
原因: 在sed的普通的`s语句`中.是支持如 `\w`,`\b`此类的字符集的.

第三个case:

输入: echo "abcwwwdc456" | ggrep -o -P --color=auto  '(abc)([\w]+)'
输出: abcwwwdc456
解释: `grep`的正则的字符集的定义支持把`\w`放到`[]`中定义新的字符集. 这个更倾向于更常见的正则语法. 写起来也更自然一些
原因: 这个grep使用的是perl语法风格的正则. 因此支持这样处理.

字符加号 +

基本语法 [ BRE ]

$ echo 'a+b=c' > foo
$ sed -n '/a+b/p' foo
a+b=c

扩展语法 [ ERE ]

$ echo 'a+b=c' > foo
$ sed -E -n '/a\+b/p' foo
a+b=c

一个以上 a 跟随着字母 b [ 加号作为特殊元字符 ]

基本语法 [ BRE ]

$ echo aab > foo
$ sed -n '/a\+b/p' foo
aab

扩展语法 [ ERE ]

$ echo aab > foo
$ sed -E -n '/a+b/p' foo
aab

BRE 语法概览

语法说明备忘
char单个普通字符,匹配自身
*Matches a sequence of zero or more instances of matches for the preceding regular expression, which must be an ordinary character, a special character preceded by\, a., a grouped regexp (see below), or a bracket expression. As a GNU extension, a postfixed regular expression can also be followed by*; for example,a**is equivalent toa*. POSIX 1003.1-2001 says that*stands for itself when it appears at the start of a regular expression or subexpression, but many nonGNU implementations do not support this and portable scripts should instead use\*in these contexts.
.Matches any character, including newline.
^
$It is the same as^, but refers to end of pattern space.$also acts as a special character only at the end of the regular expression or subexpression (that is, before\)or\|), and its use at the end of a subexpression is not portable.
\+As*, but matches one or more. It is a GNU extension.
\?As*, but only matches zero or one. It is a GNU extension.
\{i\}As*, but matches exactlyisequences (iis a decimal integer; for portability, keep it between 0 and 255 inclusive).
\{i,j\}
\{i,\}
\(regexp\)
regexp1\|regexp2
regexp1regexp2
\digitMatches thedigit-th\(…\)parenthesized subexpression in the regular expression.This is called a_back reference_. Subexpressions are implicitly numbered by counting occurrences of\(left-to-right.
\nMatches the newline character.
\charMatcheschar, wherecharis one of$,*,.,[,\, or^. Note that the only C-like backslash sequences that you can portably assume to be interpreted are\nand\\; in particular\tis not portable, and matches a ‘t’ under most implementations ofsed, rather than a tab character.
[list] or [^list]

ERE 语法概览

The only difference between basic and extended regular expressions is in the behavior of a few characters: ‘?’, ‘+’, parentheses, braces (‘{}’), and ‘|’. While basic regular expressions require these to be escaped if you want them to behave as special characters, when using extended regular expressions you must escape them if you want them_to match a literal character_. ‘|’ is special here because ‘|’ is a GNU extension – standard basic regular expressions do not provide its functionality.

关于字符集的问题

SED支持一些正则的字符集. 然后也有一些些shell风格的字符集定义. 下面进行一些简单的说明.

支持基本的一些元字符.

  • . 这个是最常用通用字符, 一般的正则表达式都支持.表示任意字符.Matches any character, including newline.也就是说新行字符也在里面
  • *支持星号,表示出现0次或者多次>=1.
  • ^单独使用时,是匹配行首. (null Sting)
  • $ 匹配行结束
  • [list] 匹配在list中的任意单个字符. 例如: aeiou 匹配所有的元音
  • [^list]匹配字符集内部字符以外的字符.
  • + 至少出现一次,与一般正则一致. GNU扩展
  • ? 任意字符.出现0次或者1次. GNU扩展
  • {i} 出现精确的i次.为了兼容性,这个i的取值最好是在 0-255之间.
  • {i,j} 出现ij次.
  • {i,}出现 >= i
  • (regexp) 把括号内部的正则表达式作为一个完整的组. 后向引用时会用到
  • [a-zA-Z0-9] 匹配任意数字或者字母

字符集的定义

  • [:alnum:] Alphanumeric characters: [:alpha:] and [:digit:]; in the C locale and ASCII character encoding, this is the same as [0-9A-Za-z].
  • [:alpha:] Alphabetic characters: [:lower:] and [:upper:]; in the C locale and ASCII character encoding, this is the same as [A-Za-z].
  • [:blank:] Blank characters: space and tab.
  • [:cntrl:] Control characters. In ASCII, these characters have octal codes 000 through 037, and 177 (DEL). In other character sets, these are the equivalent characters, if any.
  • [:digit:] Digits: 0 1 2 3 4 5 6 7 8 9.
  • [:graph:] Graphical characters: [:alnum:] and [:punct:].
  • [:lower:] Lower-case letters; in the C locale and ASCII character encoding, this is a b c d e f g h i j k l m n o p q r s t u v w x y z.
  • [:print:] Printable characters: [:alnum:], [:punct:], and space.
  • [:punct:] Punctuation characters; in the C locale and ASCII character encoding, this is ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ { | } ~`.
  • [:space:] Space characters: in the C locale, this is tab, newline, vertical tab, form feed, carriage return, and space.
  • [:upper:] Upper-case letters: in the C locale and ASCII character encoding, this is A B C D E F G H I J K L M N O P Q R S T U V W X Y Z.
  • [:xdigit:] Hexadecimal digits: 0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f.

Note that the brackets in these class names are part of the symbolic names, and must be included in addition to the brackets delimiting the bracket expression.

注: 大多数的元字符失去了它的原来含义,当包含在一个括号表达式里面的时候

  • ] 结束括号表达式如果它不是第一个元素. 因此,如果你需要]字符在字符集中.那那需要放他到第一个.
  • - 代表一个范围.(如果不是第一个或者最后一个字符的话)
  • ^ 代表是取反. (第一个) 如果要表达原有的意义. 可以放到非第一个字符即可.

其它的如下的元字符和转义字符也失去了特殊意义.都变成了变通含义的字符

如: [\*] 这个字符集是匹配\*两字字符.因为\在这里没有特殊含义.
但是也有例外: 当[. , = 以及 :挨着的时候.就又有了特殊的含义.
另外,在非POSIXLY_CORRECT模式下. 特殊的转义字符发:\t\n 会被识别

The characters $, *, ., [, and \ are normally not special within list. For example, [\*] matches either \ or *, because the \ is not special here.
However, strings like [.ch.], [=a=], and [:space:] are special within list and represent
collating symbols,
equivalence classes, and
character classes, respectively, and [ is therefore special within list when it is followed by ., =, or :.
Also, when not in POSIXLY_CORRECT mode, special escapes like \n and \t are recognized within list. See Escapes.

‘[.’
represents the open collating symbol.

‘.]’
represents the close collating symbol.

‘[=’
represents the open equivalence class.

‘=]’
represents the close equivalence class.

‘[:’
represents the open character class symbol, and should be followed by a valid character class name.

‘:]’
represents the close character class symbol.

注意:
这里有说\在字符集括号内没有了特殊的含义. 因此也解释了如[\w-]这样的字符集只包含三个字符:\w-并不表示\w这个扩展字符集.但是这个特性在如:grep (GNU)这样的命令的正则中是生效的.同时如grep中的\d此类字符表示.在sed中也是没有的.切记

参考文档:

  1. ERE-syntax

评论

发表评论


取消