regexp draft

This commit is contained in:
Ilya Kantor 2019-03-02 12:17:42 +03:00
parent 65184edf76
commit 7888439420
4 changed files with 42 additions and 41 deletions

View file

@ -20,46 +20,35 @@ alert( str.match(reg) ); // 'HTML', 'CSS', 'JavaScript'
We already know a similar thing -- square brackets. They allow to choose between multiple character, for instance `pattern:gr[ae]y` matches `match:gray` or `match:grey`. We already know a similar thing -- square brackets. They allow to choose between multiple character, for instance `pattern:gr[ae]y` matches `match:gray` or `match:grey`.
Alternation works not on a character level, but on expression level. A regexp `pattern:A|B|C` means one of expressions `A`, `B` or `C`. Square brackets allow only characters or character sets. Alternation allows any expressions. A regexp `pattern:A|B|C` means one of expressions `A`, `B` or `C`.
For instance: For instance:
- `pattern:gr(a|e)y` means exactly the same as `pattern:gr[ae]y`. - `pattern:gr(a|e)y` means exactly the same as `pattern:gr[ae]y`.
- `pattern:gra|ey` means "gra" or "ey". - `pattern:gra|ey` means `match:gra` or `match:ey`.
To separate a part of the pattern for alternation we usually enclose it in parentheses, like this: `pattern:before(XXX|YYY)after`. To separate a part of the pattern for alternation we usually enclose it in parentheses, like this: `pattern:before(XXX|YYY)after`.
## Regexp for time ## Regexp for time
In previous chapters there was a task to build a regexp for searching time in the form `hh:mm`, for instance `12:00`. But a simple `pattern:\d\d:\d\d` is too vague. It accepts `25:99` as the time. In previous chapters there was a task to build a regexp for searching time in the form `hh:mm`, for instance `12:00`. But a simple `pattern:\d\d:\d\d` is too vague. It accepts `25:99` as the time (99 seconds is valid, but shouldn't be).
How can we make a better one? How can we make a better one?
We can apply more careful matching: We can apply more careful matching. First, the hours:
- The first digit must be `0` or `1` followed by any digit. - If the first digit is `0` or `1`, then the next digit can by anything.
- Or `2` followed by `pattern:[0-3]` - Or, if the first digit is `2`, then the next must be `pattern:[0-3]`.
As a regexp: `pattern:[01]\d|2[0-3]`. As a regexp: `pattern:[01]\d|2[0-3]`.
Then we can add a colon and the minutes part. Next, the minutes must be from `0` to `59`. In the regexp language that means `pattern:[0-5]\d`: the first digit `0-5`, and then any digit.
The minutes must be from `0` to `59`, in the regexp language that means the first digit `pattern:[0-5]` followed by any other digit `\d`.
Let's glue them together into the pattern: `pattern:[01]\d|2[0-3]:[0-5]\d`. Let's glue them together into the pattern: `pattern:[01]\d|2[0-3]:[0-5]\d`.
We're almost done, but there's a problem. The alternation `|` is between the `pattern:[01]\d` and `pattern:2[0-3]:[0-5]\d`. That's wrong, because it will match either the left or the right pattern: We're almost done, but there's a problem. The alternation `pattern:|` now happens to be between `pattern:[01]\d` and `pattern:2[0-3]:[0-5]\d`.
That's wrong, as it should be applied only to hours `[01]\d` OR `2[0-3]`. That's a common mistake when starting to work with regular expressions.
```js run
let reg = /[01]\d|2[0-3]:[0-5]\d/g;
alert("12".match(reg)); // 12 (matched [01]\d)
```
That's rather obvious, but still an often mistake when starting to work with regular expressions.
We need to add parentheses to apply alternation exactly to hours: `[01]\d` OR `2[0-3]`.
The correct variant: The correct variant:

View file

@ -18,7 +18,7 @@ The pattern `pattern:^Mary` means: "the string start and then Mary".
Now let's test whether the text ends with an email. Now let's test whether the text ends with an email.
To match an email, we can use a regexp `pattern:[-.\w]+@([\w-]+\.)+[\w-]{2,20}`. It's not perfect, but mostly works. To match an email, we can use a regexp `pattern:[-.\w]+@([\w-]+\.)+[\w-]{2,20}`.
To test whether the string ends with the email, let's add `pattern:$` to the pattern: To test whether the string ends with the email, let's add `pattern:$` to the pattern:

View file

@ -10,7 +10,7 @@ That may even be a vulnerability. For instance, if JavaScript is on the server,
So the problem is definitely worth to deal with. So the problem is definitely worth to deal with.
## Example ## Introductin
The plan will be like this: The plan will be like this:
@ -24,23 +24,22 @@ We want to find all tags, with or without attributes -- like `subject:<a href=".
In particular, we need it to match tags like `<a test="<>" href="#">` -- with `<` and `>` in attributes. That's allowed by [HTML standard](https://html.spec.whatwg.org/multipage/syntax.html#syntax-attributes). In particular, we need it to match tags like `<a test="<>" href="#">` -- with `<` and `>` in attributes. That's allowed by [HTML standard](https://html.spec.whatwg.org/multipage/syntax.html#syntax-attributes).
Now we can see that a simple regexp like `pattern:<[^>]+>` doesn't work, because it stops at the first `>`, and we need to ignore `<>` inside an attribute. Now we can see that a simple regexp like `pattern:<[^>]+>` doesn't work, because it stops at the first `>`, and we need to ignore `<>` if inside an attribute.
```js run ```js run
// the match doesn't reach the end of the tag - wrong! // the match doesn't reach the end of the tag - wrong!
alert( '<a test="<>" href="#">'.match(/<[^>]+>/) ); // <a test="<> alert( '<a test="<>" href="#">'.match(/<[^>]+>/) ); // <a test="<>
``` ```
We need the whole tag.
To correctly handle such situations we need a more complex regular expression. It will have the form `pattern:<tag (key=value)*>`. To correctly handle such situations we need a more complex regular expression. It will have the form `pattern:<tag (key=value)*>`.
In the regexp language that is: `pattern:<\w+(\s*\w+=(\w+|"[^"]*")\s*)*>`: 1. For the `tag` name: `pattern:\w+`,
2. For the `key` name: `pattern:\w+`,
3. And the `value` can be a word `pattern:\w+` or a quoted string `pattern:"[^"]*"`.
1. `pattern:<\w+` -- is the tag start, If we substitute these into the pattern above, the full regexp is: `pattern:<\w+(\s*\w+=(\w+|"[^"]*")\s*)*>`.
2. `pattern:(\s*\w+=(\w+|"[^"]*")\s*)*` -- is an arbitrary number of pairs `word=value`, where the value can be either a word `pattern:\w+` or a quoted string `pattern:"[^"]*"`.
That doesn't yet support few details of HTML grammar, for instance strings in 'single' quotes, but they can be added later, so that's somewhat close to real life. For now we want the regexp to be simple. That doesn't yet support all details of HTML, for instance strings in 'single' quotes. But they could be added easily, let's keep the regexp simple for now.
Let's try it in action: Let's try it in action:
@ -54,9 +53,11 @@ alert( str.match(reg) ); // <a test="<>" href="#">, <b>
Great, it works! It found both the long tag `match:<a test="<>" href="#">` and the short one `match:<b>`. Great, it works! It found both the long tag `match:<a test="<>" href="#">` and the short one `match:<b>`.
Now let's see the problem. Now, that we've got a seemingly working solution, let's get to the infinite backtracking itself.
If you run the example below, it may hang the browser (or whatever JavaScript engine runs): ## Infinite backtracking
If you run our regexp on the input below, it may hang the browser (or another JavaScript host):
```js run ```js run
let reg = /<\w+(\s*\w+=(\w+|"[^"]*")\s*)*>/g; let reg = /<\w+(\s*\w+=(\w+|"[^"]*")\s*)*>/g;
@ -65,18 +66,18 @@ let str = `<tag a=b a=b a=b a=b a=b a=b a=b a=b
a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b`; a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b`;
*!* *!*
// The search will take a long long time // The search will take a long, long time
alert( str.match(reg) ); alert( str.match(reg) );
*/!* */!*
``` ```
Some regexp engines can handle that search, but most of them don't. Some regexp engines can handle that search, but most of them can't.
What's the matter? Why a simple regular expression on such a small string "hangs"? What's the matter? Why a simple regular expression "hangs" on such a small string?
Let's simplify the situation by removing the tag and quoted strings. Let's simplify the situation by looking only for attributes.
Here we look only for attributes: Here we removed the tag and quoted strings from the regexp.
```js run ```js run
// only search for space-delimited attributes // only search for space-delimited attributes

View file

@ -92,7 +92,7 @@ alert( '𝒴'.match(/[𝒳-𝒵]/u) ); // 𝒴
[Unicode](https://en.wikipedia.org/wiki/Unicode), the encoding format used by Javascript strings, has a lot of properties for different characters (or, technically, code points). They describe which "categories" character belongs to, and a variety of technical details. [Unicode](https://en.wikipedia.org/wiki/Unicode), the encoding format used by Javascript strings, has a lot of properties for different characters (or, technically, code points). They describe which "categories" character belongs to, and a variety of technical details.
In regular expressions these can be set by `\p{…}`. In regular expressions these can be set by `\p{…}`. And there must be flag `'u'`.
For instance, `\p{Letter}` denotes a letter in any of language. We can also use `\p{L}`, as `L` is an alias of `Letter`, there are shorter aliases for almost every property. For instance, `\p{Letter}` denotes a letter in any of language. We can also use `\p{L}`, as `L` is an alias of `Letter`, there are shorter aliases for almost every property.
@ -121,13 +121,24 @@ You could also explore properties at [Character Property Index](http://unicode.o
For the full Unicode Character Database in text format (along with all properties), see <https://www.unicode.org/Public/UCD/latest/ucd/>. For the full Unicode Character Database in text format (along with all properties), see <https://www.unicode.org/Public/UCD/latest/ucd/>.
``` ```
There are also other derived categories, like `Alphabetic` (`Alpha`), that includes Letters `L`, plus letter numbers `Nl`, plus some other symbols `Other_Alphabetic` (`OAltpa`). There are also other derived categories, like:
- `Alphabetic` (`Alpha`), includes Letters `L`, plus letter numbers `Nl` (e.g. roman numbers Ⅻ), plus some other symbols `Other_Alphabetic` (`OAltpa`).
- `Hex_Digit` includes hexadimal digits: `0-9`, `a-f`.
- ...Unicode is a big beast, it includes a lot of properties.
Unicode is a big beast, it includes a lot of properties. For instance, let's look for a 6-digit hex number:
One of properties is `Script` (`sc`), a collection of letters and other written signs used to represent textual information in one or more writing systems. There are about 150 scripts, including Cyrillic, Greek, Arabic, Han (Chinese) etc, the [list is long]("https://en.wikipedia.org/wiki/Script_(Unicode)"). ```js run
let reg = /\p{Hex_Digit}{6}/u; // flag 'u' is requireds
The `Script` property needs a value, e.g. to search for cyrillic letters: `\p{sc=Cyrillic}`. alert("color: #123ABC".match(reg)); // 123ABC
```
There are also properties with a value. For instance, Unicode "Script" (a writing system) can be Cyrillic, Greek, Arabic, Han (Chinese) etc, the [list is long]("https://en.wikipedia.org/wiki/Script_(Unicode)").
To search for certain scripts, we should supply `Script=<value>`, e.g. to search for cyrillic letters: `\p{sc=Cyrillic}`, for Chinese glyphs: `\p{sc=Han}`, etc.
### Universal \w
Let's make a "universal" regexp for `pattern:\w`, for any language. That task has a standard solution in many programming languages with unicode-aware regexps, e.g. Perl. Let's make a "universal" regexp for `pattern:\w`, for any language. That task has a standard solution in many programming languages with unicode-aware regexps, e.g. Perl.