[分享] <转载>解读 Rob Pike 编写的正则表达式程序

247153481   2015-11-10 19:40 楼主
ADdress If the regular expression is a$at the end of the expression, then the text matches only if it too is at its end. Otherwise, if we are not at the end of the text string (that is,*text!='\0') and if the first character of the text string matches the first character of the regular expression, so far so good; we go on to test whether the next character of the regular expression matches the next character of the text by making a recursive call tomatchhere. This recursive call is the heart of the algorithm and is why the code is so compact and clean. If all of these attempts to match fail, then there can be no match at this point in the regular expression and the text, somatchherereturns 0. This code really uses C pointers. At each stage of the recursion, if something matches, the recursive call that follows uses pointer arithmetic (e.g.,regexp+1andtext+1) so that the next function is called with the next character of the regular expression and of the text. The depth of recursion is no more than the length of the pattern, which in normal use is quite short, so there is no danger of running out of space. 译者信息 [url=][/url]译者信息 袁不语 翻译于 2年前0人 [url=]顶[/url] 此译文 如果正则表达式最后一个字符是$,字符串也到了末尾,那么就匹配了。 否则,如果不是在字符串的末尾(也就是,*text!='\0'),字符串第一个字符匹配了正则表达式第一个字符,那么就递归调用matchhere检查正则表达式下一个字符是否匹配字符串下一个字符。递归调用是算法的核心也是算法如此紧凑简洁的原因。 如果所有的匹配尝试都失败了,那么正则表达式和字符串没有匹配,所有matchhere返回0。 代码使用了C指针。在递归的每个阶段,如果某部分匹配了,接下来的递归调用使用了指针运算(比如,regexp+1和text+1)使得接下来的函数调用传入正则表达式和字符串的下一个字符。递归的深度不超过模式的深度,在通常的使用中都很短,所有没有空间不足的风险。
Alternatives This is a very elegant and well-written piece of code, but it's not perfect. What might one do differently? I might rearrangematchhereto deal with$before*. Although it makes no difference here, it feels a bit more natural, and a good rule is to do easy cases before hard ones. In general, however, the order of tests is critical. For instance, in this test frommatchstar: } while (*text != '\0' && (*text++ == c || c == '.')); no matter what, we must advance over one more character of the text string, so the increment intext++must always be performed. This code is careful about termination conditions. Generally, the success of a match is determined by whether the regular expression runs out at the same time as the text does. If they do run out together, that indicates a match; if one runs out before the other, there is no match. This is perhaps most obvious in a line like if (regexp[0] == '$' && regexp[1] == '\0') return *text == '\0'; but subtle termination conditions show up in other places as well.译者信息 [url=][/url]译者信息 bigtiger02 翻译于 2年前0人 [url=]顶[/url] 此译文 其他方案 这是一段优雅的写的很好的代码,但是它并不完美。有人可能会按照其他的方式来写。在这里我可能重新组织匹配的顺序,将$匹配的探测置于*之前,尽管这两种处理方式本质上并没有区别,但是这样做看起来更自然些。在处理难题之前先做一些容易的事是一个很好的习惯。 一般情况下,匹配测试的顺序是非常重要的。比如下面的frommatchstar测试: } while (*text != '\0' && (*text++ == c || c == '.')); 无论什么时候,我们必须向后取一个字符,所以增量intext++必须每次都被执行。 这段代码对于终止条件敏感。一般情况下,匹配成功与否是由正则表达式和文本是否同时结束决定,如果正则表达式和文本一起结束,就表明匹配成功,反之如则匹配失败。这和下面的这种写法很像: if (regexp[0] == '$' && regexp[1] == '\0') return *text == '\0';但这种终止条件也可能出现在其他地方。
The version ofmatchstarthat implements leftmost longest matching begins by identifying a maximal sequence of occurrences of the input characterc. Then it usesmatchhereto try to extend the match to the rest of the pattern and the rest of the text. Each failure reduces the number ofc's by one and tries again, including the case of zero occurrences. /* matchstar: leftmost longest search for c*regexp */int matchstar(int c, char *regexp, char *text){ char *t; for (t = text; *t != '\0' && (*t == c || c == '.'); t++) ; do { /* * matches zero or more */ if (matchhere(regexp, t)) return 1; } while (t-- > text); return 0;} Consider the regular expression(.*), which matches arbitrary text within parentheses. Given the target text for (t = text; *t != '\0' && (*t == c || c == '.'); t++) a longest match from the beginning will identify the entire parenthesized expression, while a shortest match will stop at the first right parenthesis. (Of course a longest match beginning from the second left parenthesis will extend to the end of the text.)译者信息 [url=][/url]译者信息 super0555 翻译于 2年前0人 [url=]顶[/url] 此译文 matchstar的这个版本实现了左边最长匹配,它开头定义了一个最长的序列,对输入字符进行检索。之后它使用matchhere去尝试将匹配扩展到模式的其余部分,以及文本的剩余部分。每次失败将使c的数值减少一,同时它会再次尝试,其中包含了零匹配的情形。 /* matchstar: leftmost longest search for c*regexp */int matchstar(int c, char *regexp, char *text){ char *t; for (t = text; *t != '\0' && (*t == c || c == '.'); t++) ; do { /* * matches zero or more */ if (matchhere(regexp, t)) return 1; } while (t-- > text); return 0;}考虑到通常的表达式(.*),它对括号内的任意字符进行匹配。我们给出目标文本 for (t = text; *t != '\0' && (*t == c || c == '.'); t++)最长的匹配是始自于对整个带括号表达式的定义,而最短的匹配则终止于第一个右括号。(当然从第二个左括号开始的最长匹配,将扩展到文本的最末端。)
Building On It The purpose of TPOP was to teach good programming. At the time the book was written, Rob and I were still at Bell Labs, so we did not have first-hand experience of how the book would be best used in a classroom. It has been gratifying to discover that some of the material does work well in classes. I have used this code since 2000 as a vehicle for teaching important points about programming. First, it shows how recursion is useful and leads to clean code, in a new setting; it's not yet another version of Quicksort (or factorial!), nor is it some kind of tree walk. It's also a good example for performance experiments. Its performance is not very different from the system versions of grep, which shows that the recursive technique is not too costly and that it's not worth trying to tune the code. 译者信息 [url=][/url]译者信息 super0555 翻译于 2年前0人 [url=]顶[/url] 此译文 基于它创建 TPOP的目的在于传授良好的编程方法。当此书撰写时,Rob和我仍然在贝尔实验室,所以我们没有关于此书怎样能最佳的应用于课堂的第一手经验。发现其中的一些材料在课堂教学中发挥了积极作用实在是令人高兴。我自2000年开始使用这些代码,作为教授关于编程的重要观点的辅助工具。 首先,它显示出在一个新的环境中,递归是如何有用并能带来干净的代码;它不是Quicksort(或阶乘!)的另一个版本,也不是某种对树的遍历。 它同时也是性能实验的良好例子。它的性能与系统版本的grep差异不是很大,这显示出递归技术并不是非常有价值,尝试优化代码也是并不值得的。
On the other hand, it is also a fine illustration of the importance of a good algorithm. If a pattern includes several.*s, the straightforward implementation requires a lot of backtracking, and in some cases will run very slowly indeed. (The standard Unix grep has the same properties.) For example, the command grep 'a.*a.*a.*a.a' takes about 20 seconds to process a 4 MB text file on a typical machine. An implementation based on converting a non-deterministic finite automaton to a deterministic automaton, as in egrep, will have much better performance on hard cases -- the same pattern and the same input is processed in less than a tenth of a second, and in general, the running time is independent of the pattern.译者信息 [url=][/url]译者信息 Cath 翻译于 2年前0人 [url=]顶[/url] 此译文 另一方面, 它也体现了一个好算法的重要性。 如果一个正则表达式包含了多个.*s, 直观的实现需要多次回溯, 而且在某些情况执行的确实非常慢。 (标准的Unix grep具有同样的特点。) 比如,如下命令 grep 'a.*a.*a.*a.a' 在一台普通设备上处理一个4 MB的文本文件花了大约20秒。 一种基于转换不确定有限自动机为确定自动机的实现方法,比如egrep, 会在困难的情况下表现得更好 -- 同样的表达式和输入处理起来只用了不到十分之一秒,通常运行时间和表达式是相会独立的。
Extensions to the regular expression class can form the basis of a variety of assignments. For example: (1) Add other metacharacters, like+for one or more occurrences of the previous character, or?for zero or one matches. Add some way to quote metacharacters, like\$to stand for a literal occurrence of$. (2) Separate regular expression processing into a "compilation" phase and an "execution" phase. Compilation converts the regular expression into an internal form that makes the matching code simpler or such that subsequent matching runs faster. This separation is not necessary for the simple class of regular expressions in the original design, but it makes sense in grep-like applications where the class is richer and the same regular expression is used for a large number of input lines. (3) Add character classes like[abc]and[0-9], which in conventional grep notation matchaorborcand a digit respectively. This can be done in several ways, the most natural of which seems to be replacing thechar*'s of the original code with a structure: typedef struct RE { int type; /* CHAR, STAR, etc. */ char ch; /* the character itself */ char *ccl; /* for [...] instead */ int nccl; /* true if class is negated [^...] */ } RE; and modifying the basic code to handle an array of these instead of an array of characters. It's not strictly necessary to separate compilation from execution for this situation, but it turns out to be a lot easier. Students who follow the advice to pre-compile into such a structure invariably do better than those who try to interpret some complicated pattern data structure on the fly.译者信息 [url=][/url]译者信息 几点人 翻译于 2年前0人 [url=]顶[/url] 此译文 对正则表达式类的扩展形成各种任务的基础。例如: (1)增加其他元字符,例如+表示前面字符的一个或多个出现,或者?表示前面字符零个或一个字符匹配。还可以增加一些引用元字符的方法,例如\$表示出现了$字母。 (2)将正则表达式处理过程分成编译阶段和执行阶段。编译阶段把正则表达式转换为内部形式,使匹配代码更为简单或者使这样的后续匹配运行得更快。对于最初设计里的简单正则表达式类来说,这种拆分并不是必须的,但在类似grep的应用里,这种拆分是有意义的,因为这种类型的正则表达式要更为复杂,而且同样的正则表达式将可用在大量的输入行上。 (3)增加像[abc]和[0-9]这样的字符类,在传统的grep表示里,它分别匹配a或b或c和一个数字。这可通过几种方式实现,其中最自然的方式似乎是使用下面的机构替代原始代码中的char*: typedef struct RE { int type; /* CHAR, STAR, etc. */ char ch; /* the character itself */ char *ccl; /* for [...] instead */ int nccl; /* true if class is negated [^...] */ } RE;并且修改相应的代码来处理这样的结构数组而不是处理字符数组。在这种情况下,严格的来说,不需要把编译阶段从执行阶段中拆分出来,但是这证明拆分是非常简单的。遵循预编译成这样的结构建议的学生一定比哪些试图在运行过程中解释一些复杂模式的数据结构的学生做得更好。
Writing clear and unambiguous specifications for character classes is tough, and implementing them perfectly is worse, requiring a lot of tedious and uninstructive coding. I have simplified this assignment over time, and today most often ask for Perl-like shorthands such as\dfor digit and\Dfor non-digit instead of the original bracketed ranges. (4) Use an opaque type to hide the RE structure and all the implementation details. This is a good way to show object-oriented programming in C, which doesn't support much beyond this. In effect, one makes a regular expression class but with function names likeRE_new()andRE_match()for the methods instead of the syntactic sugar of an object-oriented language. (5) Modify the class of regular expressions to be like the wild cards in various shells: matches are implicitly anchored at both ends,*matches any number of characters, and?matches any single character. One can modify the algorithm or map the input into the existing algorithm. 译者信息 [url=][/url]译者信息 几点人 翻译于 2年前0人 [url=]顶[/url] 此译文 为字符类编写清晰且无歧义的规范是很困难的,而且完美地实现它们更难,这需要大量的枯燥且无意义的编码。随着时间的推移,我简化了这个任务,之后最常请求的就是像Perl那样的速记,例如\d表示数字,\D表示非数字,而不是最初的那种方括号内限定范围。 (4)使用不透明的类型来隐藏RE结构和所有实现细节。在C语言里这是展示面向对象编程的好方法,不过除此之外它不支持其他东西。实际中,你可以创建一个正则表达式类,并且成员函数的名字像RE_new( )和RE_match( )这样,而不是像面向对象语言的语法那样。 (5)修改正则表达式类,使得它更像各种shell中的通配符:这种匹配隐含地固定匹配的起始为两端,*匹配任意数量的字符,而?匹配任何单个字符。你可以修改这个算法或者把输入引入到已有的算法里。
(6) Convert the code to Java. The original code uses C pointers very well, and it's good practice to figure out the alternatives in a different language. Java versions use eitherString.charAt(indexing instead of pointers) orString.substring(closer to the pointer version). Neither seems as clear as the C code, and neither is as compact. Although performance isn't really part of this exercise, it is interesting to see that the Java implementation runs roughly six or seven times slower than the C versions. (7) Write a wrapper class that converts from regular expressions of this class to Java's Pattern and Matcher classes, which separate the compilation and matching in a quite different way. This is a good example of the Adapter or Facade pattern, which puts a different face on an existing class or set of functions. I've also used this code extensively to explore testing techniques. Regular expressions are rich enough that testing is far from trivial, but small enough that one can quickly write down a substantial collection of tests to be performed mechanically. For extensions like those just listed, I ask students to write a large number of tests in a compact language (yet another example of "notation") and use those tests on their own code; naturally I use their tests on other students' code as well. 译者信息 [url=][/url]译者信息 dodojava 翻译于 2年前0人 [url=]顶[/url] 此译文 6.将这段代码转换成Java。最初的代码很好地使用了C指针,并且把这个算法在不同的语言中实现出来是一个很好的实践过程。在Java版本的代码中将使 用String.charAt(使用索引而不是指针)或者String.subString(更接近于指针)。这两种方法都没有C代码整洁,并且都不紧凑。 虽然在这个练习中性能并不是主要的问题,但有趣的是可以发现了Java版本比C版本在运行速度上要慢6到7倍。 7.编写一个包装类把这种类型的正则表达式转换成Java的Pattern类和Matcher类,这些类将以一种与众不同的方式来拆分编译阶段和匹配阶段。 这是适配器(Adapter)模式或者外观(Façade)模式的很好示例,这两种模式用来在现有的类或者函数集合外部设置不同的接口。 我曾经大量地使用了这段代码来研究测试技术。正则表达式非常的丰富,而测试也不是无足轻重的,但正则表达式又是很短小的,程序员可以很快地写出一组测试代码来执行。对于前面所列出的各种扩展,我要求学生用一种紧凑的语言写出大量的测试代码(这是“记法”的另一种示例),并且在他们自己的代码中使用这些测试用例;自然我在其他学生的代码中也使用了他们的测试用例。
Conclusions I was amazed by how compact and elegant this code was when Rob Pike first wrote it -- it was much smaller and more powerful than I had thought possible. In hindsight, one can see a number of reasons why the code is so small. First, the features are well chosen to be the most useful and to give the most insight into implementation, without any frills. For example, the implementation of the anchored patterns^and$requires only 3 or 4 lines, but shows how to handle special cases cleanly before handling the general cases uniformly. The closure operation*is a fundamental notion in regular expressions and provides the only way to handle patterns of unspecified lengths, so it has to be present. But it would add no insight to also provide+and?so those are left as exercises. 译者信息 [url=][/url]译者信息 eastasiasnow 翻译于 2年前0人 [url=]顶[/url] 此译文 结尾 当Rob Pike把刚写的代码给我时,我惊叹于它是如此的简洁和优雅--比我能想到的还要小巧且功能强大。事后才发现 为什么代码如此短小的原因。 首先,选取最有用的特性并给出最具见解的实现,没有任何冗余。比如,定位模式“^”和“$”的实现仅用了3-4行代码,却展现了在统一处理一般情况前如何利索的处理特殊情形。闭合运算*(closure operation)是正则表达式中一个重要的概念,它提供了处理不定长度模式的唯一的方式,因此代码中必须包含该功能。但是同样提供诸如"+"和"?"的实现对正则表达式原理的理解一点用处也没有,因此这些功能留作练习供读者实现。
Second, recursion is a win. This fundamental programming technique almost always leads to smaller, cleaner and more elegant code than the equivalent written with explicit loops, and that is the case here. The idea of peeling off one matching character from the front of the regular expression and from the text, then recursing for the rest, echoes the recursive structure of the traditional factorial or string length examples, but in a much more interesting and useful setting. Rob has told me that the recursion was not so much an explicit design decision as a consequence of how he approached the problem: given a pattern and a text, write a function that looks for a match; that in turn needs a "matchhere" function; and so on.
"I have pretty vivid memories of watching the code almost write itself this way. The only challenge was getting the edge conditions right to break the recursion. Put another way, the recursion is not only the implementation method, it's also a reflection of the thought process taken when writing the code, which is partly responsible for the code's simplicity. Most important, perhaps, I didn't have a design when I started, I just began to code and saw what developed. Suddenly, I was done."
译者信息 [url=][/url]译者信息 eastasiasnow 翻译于 2年前0人 [url=]顶[/url] 此译文 其次,使用递归是一个优势。作为基本的编程技术,递归几乎总会产生更短小,更简洁和更优雅的代码,这是同样清晰的循环实现无法比拟的,在Bob的代码中同样如此。匹配的思想是从正则表达式和文本的开头剥离匹配的字符,然后对后面的文本重复该过程。该过程类似传统的求阶乘或字符串长度例程中的递归结构,但是以一种更有趣并且更有用的方式调整。 Rob告诉我递归不是那么清晰的设计决策,缘于其如何逼近问题的方式:考虑一个匹配模式和一段文本,写出匹配函数,该函数转而又需要一个“匹配这里”的函数,等等。 “我对代码本身以这样的方式写出来记忆犹新。仅有的挑战是找到正确的结束(边缘)条件(edge condition)跳出(结束)递归。换句话说,递归不仅仅是一种实现方法,当你写代码时,它还是你思维过程的反映。这样的思维过程对代码的简明性负有部分责任。可能,最重要的是,当我着手开始时并没有一个设计方案,我只是开始编码,然后看到开发的结果,突然,我就完成了。”
If the regular expression is a$at the end of the expression, then the text matches only if it too is at its end. Otherwise, if we are not at the end of the text string (that is,*text!='\0') and if the first character of the text string matches the first character of the regular expression, so far so good; we go on to test whether the next character of the regular expression matches the next character of the text by making a recursive call tomatchhere. This recursive call is the heart of the algorithm and is why the code is so compact and clean. If all of these attempts to match fail, then there can be no match at this point in the regular expression and the text, somatchherereturns 0. This code really uses C pointers. At each stage of the recursion, if something matches, the recursive call that follows uses pointer arithmetic (e.g.,regexp+1andtext+1) so that the next function is called with the next character of the regular expression and of the text. The depth of recursion is no more than the length of the pattern, which in normal use is quite short, so there is no danger of running out of space. 译者信息 [url=][/url]译者信息 袁不语 翻译于 2年前0人 [url=]顶[/url] 此译文 如果正则表达式最后一个字符是$,字符串也到了末尾,那么就匹配了。 否则,如果不是在字符串的末尾(也就是,*text!='\0'),字符串第一个字符匹配了正则表达式第一个字符,那么就递归调用matchhere检查正则表达式下一个字符是否匹配字符串下一个字符。递归调用是算法的核心也是算法如此紧凑简洁的原因。 如果所有的匹配尝试都失败了,那么正则表达式和字符串没有匹配,所有matchhere返回0。 代码使用了C指针。在递归的每个阶段,如果某部分匹配了,接下来的递归调用使用了指针运算(比如,regexp+1和text+1)使得接下来的函数调用传入正则表达式和字符串的下一个字符。递归的深度不超过模式的深度,在通常的使用中都很短,所有没有空间不足的风险。
Alternatives This is a very elegant and well-written piece of code, but it's not perfect. What might one do differently? I might rearrangematchhereto deal with$before*. Although it makes no difference here, it feels a bit more natural, and a good rule is to do easy cases before hard ones. In general, however, the order of tests is critical. For instance, in this test frommatchstar: } while (*text != '\0' && (*text++ == c || c == '.')); no matter what, we must advance over one more character of the text string, so the increment intext++must always be performed. This code is careful about termination conditions. Generally, the success of a match is determined by whether the regular expression runs out at the same time as the text does. If they do run out together, that indicates a match; if one runs out before the other, there is no match. This is perhaps most obvious in a line like if (regexp[0] == '$' && regexp[1] == '\0') return *text == '\0'; but subtle termination conditions show up in other places as well.译者信息 [url=][/url]译者信息 bigtiger02 翻译于 2年前0人 [url=]顶[/url] 此译文 其他方案 这是一段优雅的写的很好的代码,但是它并不完美。有人可能会按照其他的方式来写。在这里我可能重新组织匹配的顺序,将$匹配的探测置于*之前,尽管这两种处理方式本质上并没有区别,但是这样做看起来更自然些。在处理难题之前先做一些容易的事是一个很好的习惯。 一般情况下,匹配测试的顺序是非常重要的。比如下面的frommatchstar测试: } while (*text != '\0' && (*text++ == c || c == '.')); 无论什么时候,我们必须向后取一个字符,所以增量intext++必须每次都被执行。 这段代码对于终止条件敏感。一般情况下,匹配成功与否是由正则表达式和文本是否同时结束决定,如果正则表达式和文本一起结束,就表明匹配成功,反之如则匹配失败。这和下面的这种写法很像: if (regexp[0] == '$' && regexp[1] == '\0') return *text == '\0';但这种终止条件也可能出现在其他地方。
The version ofmatchstarthat implements leftmost longest matching begins by identifying a maximal sequence of occurrences of the input characterc. Then it usesmatchhereto try to extend the match to the rest of the pattern and the rest of the text. Each failure reduces the number ofc's by one and tries again, including the case of zero occurrences. /* matchstar: leftmost longest search for c*regexp */int matchstar(int c, char *regexp, char *text){ char *t; for (t = text; *t != '\0' && (*t == c || c == '.'); t++) ; do { /* * matches zero or more */ if (matchhere(regexp, t)) return 1; } while (t-- > text); return 0;} Consider the regular expression(.*), which matches arbitrary text within parentheses. Given the target text for (t = text; *t != '\0' && (*t == c || c == '.'); t++) a longest match from the beginning will identify the entire parenthesized expression, while a shortest match will stop at the first right parenthesis. (Of course a longest match beginning from the second left parenthesis will extend to the end of the text.)译者信息 [url=][/url]译者信息 super0555 翻译于 2年前0人 [url=]顶[/url] 此译文 matchstar的这个版本实现了左边最长匹配,它开头定义了一个最长的序列,对输入字符进行检索。之后它使用matchhere去尝试将匹配扩展到模式的其余部分,以及文本的剩余部分。每次失败将使c的数值减少一,同时它会再次尝试,其中包含了零匹配的情形。 /* matchstar: leftmost longest search for c*regexp */int matchstar(int c, char *regexp, char *text){ char *t; for (t = text; *t != '\0' && (*t == c || c == '.'); t++) ; do { /* * matches zero or more */ if (matchhere(regexp, t)) return 1; } while (t-- > text); return 0;}考虑到通常的表达式(.*),它对括号内的任意字符进行匹配。我们给出目标文本 for (t = text; *t != '\0' && (*t == c || c == '.'); t++)最长的匹配是始自于对整个带括号表达式的定义,而最短的匹配则终止于第一个右括号。(当然从第二个左括号开始的最长匹配,将扩展到文本的最末端。)
Building On It The purpose of TPOP was to teach good programming. At the time the book was written, Rob and I were still at Bell Labs, so we did not have first-hand experience of how the book would be best used in a classroom. It has been gratifying to discover that some of the material does work well in classes. I have used this code since 2000 as a vehicle for teaching important points about programming. First, it shows how recursion is useful and leads to clean code, in a new setting; it's not yet another version of Quicksort (or factorial!), nor is it some kind of tree walk. It's also a good example for performance experiments. Its performance is not very different from the system versions of grep, which shows that the recursive technique is not too costly and that it's not worth trying to tune the code. 译者信息 [url=][/url]译者信息 super0555 翻译于 2年前0人 [url=]顶[/url] 此译文 基于它创建 TPOP的目的在于传授良好的编程方法。当此书撰写时,Rob和我仍然在贝尔实验室,所以我们没有关于此书怎样能最佳的应用于课堂的第一手经验。发现其中的一些材料在课堂教学中发挥了积极作用实在是令人高兴。我自2000年开始使用这些代码,作为教授关于编程的重要观点的辅助工具。 首先,它显示出在一个新的环境中,递归是如何有用并能带来干净的代码;它不是Quicksort(或阶乘!)的另一个版本,也不是某种对树的遍历。 它同时也是性能实验的良好例子。它的性能与系统版本的grep差异不是很大,这显示出递归技术并不是非常有价值,尝试优化代码也是并不值得的。
On the other hand, it is also a fine illustration of the importance of a good algorithm. If a pattern includes several.*s, the straightforward implementation requires a lot of backtracking, and in some cases will run very slowly indeed. (The standard Unix grep has the same properties.) For example, the command grep 'a.*a.*a.*a.a' takes about 20 seconds to process a 4 MB text file on a typical machine. An implementation based on converting a non-deterministic finite automaton to a deterministic automaton, as in egrep, will have much better performance on hard cases -- the same pattern and the same input is processed in less than a tenth of a second, and in general, the running time is independent of the pattern.译者信息 [url=][/url]译者信息 Cath 翻译于 2年前0人 [url=]顶[/url] 此译文 另一方面, 它也体现了一个好算法的重要性。 如果一个正则表达式包含了多个.*s, 直观的实现需要多次回溯, 而且在某些情况执行的确实非常慢。 (标准的Unix grep具有同样的特点。) 比如,如下命令 grep 'a.*a.*a.*a.a' 在一台普通设备上处理一个4 MB的文本文件花了大约20秒。 一种基于转换不确定有限自动机为确定自动机的实现方法,比如egrep, 会在困难的情况下表现得更好 -- 同样的表达式和输入处理起来只用了不到十分之一秒,通常运行时间和表达式是相会独立的。
Extensions to the regular expression class can form the basis of a variety of assignments. For example: (1) Add other metacharacters, like+for one or more occurrences of the previous character, or?for zero or one matches. Add some way to quote metacharacters, like\$to stand for a literal occurrence of$. (2) Separate regular expression processing into a "compilation" phase and an "execution" phase. Compilation converts the regular expression into an internal form that makes the matching code simpler or such that subsequent matching runs faster. This separation is not necessary for the simple class of regular expressions in the original design, but it makes sense in grep-like applications where the class is richer and the same regular expression is used for a large number of input lines. (3) Add character classes like[abc]and[0-9], which in conventional grep notation matchaorborcand a digit respectively. This can be done in several ways, the most natural of which seems to be replacing thechar*'s of the original code with a structure: typedef struct RE { int type; /* CHAR, STAR, etc. */ char ch; /* the character itself */ char *ccl; /* for [...] instead */ int nccl; /* true if class is negated [^...] */ } RE; and modifying the basic code to handle an array of these instead of an array of characters. It's not strictly necessary to separate compilation from execution for this situation, but it turns out to be a lot easier. Students who follow the advice to pre-compile into such a structure invariably do better than those who try to interpret some complicated pattern data structure on the fly.译者信息 [url=][/url]译者信息 几点人 翻译于 2年前0人 [url=]顶[/url] 此译文 对正则表达式类的扩展形成各种任务的基础。例如: (1)增加其他元字符,例如+表示前面字符的一个或多个出现,或者?表示前面字符零个或一个字符匹配。还可以增加一些引用元字符的方法,例如\$表示出现了$字母。 (2)将正则表达式处理过程分成编译阶段和执行阶段。编译阶段把正则表达式转换为内部形式,使匹配代码更为简单或者使这样的后续匹配运行得更快。对于最初设计里的简单正则表达式类来说,这种拆分并不是必须的,但在类似grep的应用里,这种拆分是有意义的,因为这种类型的正则表达式要更为复杂,而且同样的正则表达式将可用在大量的输入行上。 (3)增加像[abc]和[0-9]这样的字符类,在传统的grep表示里,它分别匹配a或b或c和一个数字。这可通过几种方式实现,其中最自然的方式似乎是使用下面的机构替代原始代码中的char*: typedef struct RE { int type; /* CHAR, STAR, etc. */ char ch; /* the character itself */ char *ccl; /* for [...] instead */ int nccl; /* true if class is negated [^...] */ } RE;并且修改相应的代码来处理这样的结构数组而不是处理字符数组。在这种情况下,严格的来说,不需要把编译阶段从执行阶段中拆分出来,但是这证明拆分是非常简单的。遵循预编译成这样的结构建议的学生一定比哪些试图在运行过程中解释一些复杂模式的数据结构的学生做得更好。
Writing clear and unambiguous specifications for character classes is tough, and implementing them perfectly is worse, requiring a lot of tedious and uninstructive coding. I have simplified this assignment over time, and today most often ask for Perl-like shorthands such as\dfor digit and\Dfor non-digit instead of the original bracketed ranges. (4) Use an opaque type to hide the RE structure and all the implementation details. This is a good way to show object-oriented programming in C, which doesn't support much beyond this. In effect, one makes a regular expression class but with function names likeRE_new()andRE_match()for the methods instead of the syntactic sugar of an object-oriented language. (5) Modify the class of regular expressions to be like the wild cards in various shells: matches are implicitly anchored at both ends,*matches any number of characters, and?matches any single character. One can modify the algorithm or map the input into the existing algorithm. 译者信息 [url=][/url]译者信息 几点人 翻译于 2年前0人 [url=]顶[/url] 此译文 为字符类编写清晰且无歧义的规范是很困难的,而且完美地实现它们更难,这需要大量的枯燥且无意义的编码。随着时间的推移,我简化了这个任务,之后最常请求的就是像Perl那样的速记,例如\d表示数字,\D表示非数字,而不是最初的那种方括号内限定范围。 (4)使用不透明的类型来隐藏RE结构和所有实现细节。在C语言里这是展示面向对象编程的好方法,不过除此之外它不支持其他东西。实际中,你可以创建一个正则表达式类,并且成员函数的名字像RE_new( )和RE_match( )这样,而不是像面向对象语言的语法那样。 (5)修改正则表达式类,使得它更像各种shell中的通配符:这种匹配隐含地固定匹配的起始为两端,*匹配任意数量的字符,而?匹配任何单个字符。你可以修改这个算法或者把输入引入到已有的算法里。
(6) Convert the code to Java. The original code uses C pointers very well, and it's good practice to figure out the alternatives in a different language. Java versions use eitherString.charAt(indexing instead of pointers) orString.substring(closer to the pointer version). Neither seems as clear as the C code, and neither is as compact. Although performance isn't really part of this exercise, it is interesting to see that the Java implementation runs roughly six or seven times slower than the C versions. (7) Write a wrapper class that converts from regular expressions of this class to Java's Pattern and Matcher classes, which separate the compilation and matching in a quite different way. This is a good example of the Adapter or Facade pattern, which puts a different face on an existing class or set of functions. I've also used this code extensively to explore testing techniques. Regular expressions are rich enough that testing is far from trivial, but small enough that one can quickly write down a substantial collection of tests to be performed mechanically. For extensions like those just listed, I ask students to write a large number of tests in a compact language (yet another example of "notation") and use those tests on their own code; naturally I use their tests on other students' code as well. 译者信息 [url=][/url]译者信息 dodojava 翻译于 2年前0人 [url=]顶[/url] 此译文 6.将这段代码转换成Java。最初的代码很好地使用了C指针,并且把这个算法在不同的语言中实现出来是一个很好的实践过程。在Java版本的代码中将使 用String.charAt(使用索引而不是指针)或者String.subString(更接近于指针)。这两种方法都没有C代码整洁,并且都不紧凑。 虽然在这个练习中性能并不是主要的问题,但有趣的是可以发现了Java版本比C版本在运行速度上要慢6到7倍。 7.编写一个包装类把这种类型的正则表达式转换成Java的Pattern类和Matcher类,这些类将以一种与众不同的方式来拆分编译阶段和匹配阶段。 这是适配器(Adapter)模式或者外观(Façade)模式的很好示例,这两种模式用来在现有的类或者函数集合外部设置不同的接口。 我曾经大量地使用了这段代码来研究测试技术。正则表达式非常的丰富,而测试也不是无足轻重的,但正则表达式又是很短小的,程序员可以很快地写出一组测试代码来执行。对于前面所列出的各种扩展,我要求学生用一种紧凑的语言写出大量的测试代码(这是“记法”的另一种示例),并且在他们自己的代码中使用这些测试用例;自然我在其他学生的代码中也使用了他们的测试用例。
Conclusions I was amazed by how compact and elegant this code was when Rob Pike first wrote it -- it was much smaller and more powerful than I had thought possible. In hindsight, one can see a number of reasons why the code is so small. First, the features are well chosen to be the most useful and to give the most insight into implementation, without any frills. For example, the implementation of the anchored patterns^and$requires only 3 or 4 lines, but shows how to handle special cases cleanly before handling the general cases uniformly. The closure operation*is a fundamental notion in regular expressions and provides the only way to handle patterns of unspecified lengths, so it has to be present. But it would add no insight to also provide+and?so those are left as exercises. 译者信息 [url=][/url]译者信息 eastasiasnow 翻译于 2年前0人 [url=]顶[/url] 此译文 结尾 当Rob Pike把刚写的代码给我时,我惊叹于它是如此的简洁和优雅--比我能想到的还要小巧且功能强大。事后才发现 为什么代码如此短小的原因。 首先,选取最有用的特性并给出最具见解的实现,没有任何冗余。比如,定位模式“^”和“$”的实现仅用了3-4行代码,却展现了在统一处理一般情况前如何利索的处理特殊情形。闭合运算*(closure operation)是正则表达式中一个重要的概念,它提供了处理不定长度模式的唯一的方式,因此代码中必须包含该功能。但是同样提供诸如"+"和"?"的实现对正则表达式原理的理解一点用处也没有,因此这些功能留作练习供读者实现。
Second, recursion is a win. This fundamental programming technique almost always leads to smaller, cleaner and more elegant code than the equivalent written with explicit loops, and that is the case here. The idea of peeling off one matching character from the front of the regular expression and from the text, then recursing for the rest, echoes the recursive structure of the traditional factorial or string length examples, but in a much more interesting and useful setting. Rob has told me that the recursion was not so much an explicit design decision as a consequence of how he approached the problem: given a pattern and a text, write a function that looks for a match; that in turn needs a "matchhere" function; and so on.
"I have pretty vivid memories of watching the code almost write itself this way. The only challenge was getting the edge conditions right to break the recursion. Put another way, the recursion is not only the implementation method, it's also a reflection of the thought process taken when writing the code, which is partly responsible for the code's simplicity. Most important, perhaps, I didn't have a design when I started, I just began to code and saw what developed. Suddenly, I was done."
译者信息 [url=][/url]译者信息 eastasiasnow 翻译于 2年前0人 [url=]顶[/url] 此译文 其次,使用递归是一个优势。作为基本的编程技术,递归几乎总会产生更短小,更简洁和更优雅的代码,这是同样清晰的循环实现无法比拟的,在Bob的代码中同样如此。匹配的思想是从正则表达式和文本的开头剥离匹配的字符,然后对后面的文本重复该过程。该过程类似传统的求阶乘或字符串长度例程中的递归结构,但是以一种更有趣并且更有用的方式调整。 Rob告诉我递归不是那么清晰的设计决策,缘于其如何逼近问题的方式:考虑一个匹配模式和一段文本,写出匹配函数,该函数转而又需要一个“匹配这里”的函数,等等。 “我对代码本身以这样的方式写出来记忆犹新。仅有的挑战是找到正确的结束(边缘)条件(edge condition)跳出(结束)递归。换句话说,递归不仅仅是一种实现方法,当你写代码时,它还是你思维过程的反映。这样的思维过程对代码的简明性负有部分责任。可能,最重要的是,当我着手开始时并没有一个设计方案,我只是开始编码,然后看到开发的结果,突然,我就完成了。”
Third, this code really uses the underlying language to good effect. Pointers can be mis-used, of course, but here they are used to create compact expressions that naturally express the extraction of individual characters and advancing to the next character. The same effect can be achieved by array indexing or substrings, but in this code, pointers do a better job, especially when coupled with C idioms for auto-increment and implicit conversion of truth values. I don't know of another piece of code that does so much in so few lines, while providing such a rich source of insight and further ideas. 译者信息 [url=][/url]译者信息 eastasiasnow 翻译于 2年前0人 [url=]顶[/url] 此译文 最后,这段代码使用基础的编程语言获得好的效果。当然,指针可能被误用,但在这里,它们用来创建紧凑的表达式。这些表达式自然的展示了单个字符的提取并推移到下一个字符。同样的效果可通过数组索引或子串的方法获得,但在该段代码中,尤其是在与C的自增惯用法和明确的真值转换联合使用时,指针要技高一筹。 我没听说过存在另一段代码能以如此少行数完成这么多的功能,与此同时,提供如此丰富的见解和进一步想法。
本帖最后由 247153481 于 2015-11-10 19:44 编辑

回复评论

暂无评论,赶紧抢沙发吧
电子工程世界版权所有 京B2-20211791 京ICP备10001474号-1 京公网安备 11010802033920号
    写回复