From 7888439420ea51d275950f95c3ca7bda6c42ed29 Mon Sep 17 00:00:00 2001 From: Ilya Kantor Date: Sat, 2 Mar 2019 12:17:42 +0300 Subject: [PATCH] regexp draft --- .../11-regexp-alternation/article.md | 29 ++++++----------- .../12-regexp-anchors/article.md | 2 +- .../article.md | 31 ++++++++++--------- .../20-regexp-unicode/article.md | 21 ++++++++++--- 4 files changed, 42 insertions(+), 41 deletions(-) diff --git a/5-regular-expressions/11-regexp-alternation/article.md b/5-regular-expressions/11-regexp-alternation/article.md index a01eac0f..0caa6de0 100644 --- a/5-regular-expressions/11-regexp-alternation/article.md +++ b/5-regular-expressions/11-regexp-alternation/article.md @@ -20,46 +20,35 @@ alert( str.match(reg) ); // 'HTML', 'CSS', 'JavaScript' We already know a similar thing -- square brackets. They allow to choose between multiple character, for instance `pattern:gr[ae]y` matches `match:gray` or `match:grey`. -Alternation works not on a character level, but on expression level. A regexp `pattern:A|B|C` means one of expressions `A`, `B` or `C`. +Square brackets allow only characters or character sets. Alternation allows any expressions. A regexp `pattern:A|B|C` means one of expressions `A`, `B` or `C`. For instance: - `pattern:gr(a|e)y` means exactly the same as `pattern:gr[ae]y`. -- `pattern:gra|ey` means "gra" or "ey". +- `pattern:gra|ey` means `match:gra` or `match:ey`. To separate a part of the pattern for alternation we usually enclose it in parentheses, like this: `pattern:before(XXX|YYY)after`. ## Regexp for time -In previous chapters there was a task to build a regexp for searching time in the form `hh:mm`, for instance `12:00`. But a simple `pattern:\d\d:\d\d` is too vague. It accepts `25:99` as the time. +In previous chapters there was a task to build a regexp for searching time in the form `hh:mm`, for instance `12:00`. But a simple `pattern:\d\d:\d\d` is too vague. It accepts `25:99` as the time (99 seconds is valid, but shouldn't be). How can we make a better one? -We can apply more careful matching: +We can apply more careful matching. First, the hours: -- The first digit must be `0` or `1` followed by any digit. -- Or `2` followed by `pattern:[0-3]` +- If the first digit is `0` or `1`, then the next digit can by anything. +- Or, if the first digit is `2`, then the next must be `pattern:[0-3]`. As a regexp: `pattern:[01]\d|2[0-3]`. -Then we can add a colon and the minutes part. - -The minutes must be from `0` to `59`, in the regexp language that means the first digit `pattern:[0-5]` followed by any other digit `\d`. +Next, the minutes must be from `0` to `59`. In the regexp language that means `pattern:[0-5]\d`: the first digit `0-5`, and then any digit. Let's glue them together into the pattern: `pattern:[01]\d|2[0-3]:[0-5]\d`. -We're almost done, but there's a problem. The alternation `|` is between the `pattern:[01]\d` and `pattern:2[0-3]:[0-5]\d`. That's wrong, because it will match either the left or the right pattern: +We're almost done, but there's a problem. The alternation `pattern:|` now happens to be between `pattern:[01]\d` and `pattern:2[0-3]:[0-5]\d`. - -```js run -let reg = /[01]\d|2[0-3]:[0-5]\d/g; - -alert("12".match(reg)); // 12 (matched [01]\d) -``` - -That's rather obvious, but still an often mistake when starting to work with regular expressions. - -We need to add parentheses to apply alternation exactly to hours: `[01]\d` OR `2[0-3]`. +That's wrong, as it should be applied only to hours `[01]\d` OR `2[0-3]`. That's a common mistake when starting to work with regular expressions. The correct variant: diff --git a/5-regular-expressions/12-regexp-anchors/article.md b/5-regular-expressions/12-regexp-anchors/article.md index b4981e09..0c2dd578 100644 --- a/5-regular-expressions/12-regexp-anchors/article.md +++ b/5-regular-expressions/12-regexp-anchors/article.md @@ -18,7 +18,7 @@ The pattern `pattern:^Mary` means: "the string start and then Mary". Now let's test whether the text ends with an email. -To match an email, we can use a regexp `pattern:[-.\w]+@([\w-]+\.)+[\w-]{2,20}`. It's not perfect, but mostly works. +To match an email, we can use a regexp `pattern:[-.\w]+@([\w-]+\.)+[\w-]{2,20}`. To test whether the string ends with the email, let's add `pattern:$` to the pattern: diff --git a/5-regular-expressions/15-regexp-infinite-backtracking-problem/article.md b/5-regular-expressions/15-regexp-infinite-backtracking-problem/article.md index cdb5f40e..90c1e2fb 100644 --- a/5-regular-expressions/15-regexp-infinite-backtracking-problem/article.md +++ b/5-regular-expressions/15-regexp-infinite-backtracking-problem/article.md @@ -10,7 +10,7 @@ That may even be a vulnerability. For instance, if JavaScript is on the server, So the problem is definitely worth to deal with. -## Example +## Introductin The plan will be like this: @@ -24,23 +24,22 @@ We want to find all tags, with or without attributes -- like `subject:" href="#">` -- with `<` and `>` in attributes. That's allowed by [HTML standard](https://html.spec.whatwg.org/multipage/syntax.html#syntax-attributes). -Now we can see that a simple regexp like `pattern:<[^>]+>` doesn't work, because it stops at the first `>`, and we need to ignore `<>` inside an attribute. +Now we can see that a simple regexp like `pattern:<[^>]+>` doesn't work, because it stops at the first `>`, and we need to ignore `<>` if inside an attribute. ```js run // the match doesn't reach the end of the tag - wrong! alert( ''.match(/<[^>]+>/) ); // `: +1. For the `tag` name: `pattern:\w+`, +2. For the `key` name: `pattern:\w+`, +3. And the `value` can be a word `pattern:\w+` or a quoted string `pattern:"[^"]*"`. -1. `pattern:<\w+` -- is the tag start, -2. `pattern:(\s*\w+=(\w+|"[^"]*")\s*)*` -- is an arbitrary number of pairs `word=value`, where the value can be either a word `pattern:\w+` or a quoted string `pattern:"[^"]*"`. +If we substitute these into the pattern above, the full regexp is: `pattern:<\w+(\s*\w+=(\w+|"[^"]*")\s*)*>`. -That doesn't yet support few details of HTML grammar, for instance strings in 'single' quotes, but they can be added later, so that's somewhat close to real life. For now we want the regexp to be simple. +That doesn't yet support all details of HTML, for instance strings in 'single' quotes. But they could be added easily, let's keep the regexp simple for now. Let's try it in action: @@ -54,9 +53,11 @@ alert( str.match(reg) ); // , Great, it works! It found both the long tag `match:` and the short one `match:`. -Now let's see the problem. +Now, that we've got a seemingly working solution, let's get to the infinite backtracking itself. -If you run the example below, it may hang the browser (or whatever JavaScript engine runs): +## Infinite backtracking + +If you run our regexp on the input below, it may hang the browser (or another JavaScript host): ```js run let reg = /<\w+(\s*\w+=(\w+|"[^"]*")\s*)*>/g; @@ -65,18 +66,18 @@ let str = `. ``` -There are also other derived categories, like `Alphabetic` (`Alpha`), that includes Letters `L`, plus letter numbers `Nl`, plus some other symbols `Other_Alphabetic` (`OAltpa`). +There are also other derived categories, like: +- `Alphabetic` (`Alpha`), includes Letters `L`, plus letter numbers `Nl` (e.g. roman numbers Ⅻ), plus some other symbols `Other_Alphabetic` (`OAltpa`). +- `Hex_Digit` includes hexadimal digits: `0-9`, `a-f`. +- ...Unicode is a big beast, it includes a lot of properties. -Unicode is a big beast, it includes a lot of properties. +For instance, let's look for a 6-digit hex number: -One of properties is `Script` (`sc`), a collection of letters and other written signs used to represent textual information in one or more writing systems. There are about 150 scripts, including Cyrillic, Greek, Arabic, Han (Chinese) etc, the [list is long]("https://en.wikipedia.org/wiki/Script_(Unicode)"). +```js run +let reg = /\p{Hex_Digit}{6}/u; // flag 'u' is requireds -The `Script` property needs a value, e.g. to search for cyrillic letters: `\p{sc=Cyrillic}`. +alert("color: #123ABC".match(reg)); // 123ABC +``` + +There are also properties with a value. For instance, Unicode "Script" (a writing system) can be Cyrillic, Greek, Arabic, Han (Chinese) etc, the [list is long]("https://en.wikipedia.org/wiki/Script_(Unicode)"). + +To search for certain scripts, we should supply `Script=`, e.g. to search for cyrillic letters: `\p{sc=Cyrillic}`, for Chinese glyphs: `\p{sc=Han}`, etc. + +### Universal \w Let's make a "universal" regexp for `pattern:\w`, for any language. That task has a standard solution in many programming languages with unicode-aware regexps, e.g. Perl.