merge prs

This commit is contained in:
Ilya Kantor 2019-03-02 23:31:11 +03:00
parent 7888439420
commit d043b11516
26 changed files with 259 additions and 139 deletions

View file

@ -275,7 +275,7 @@ In this example `count` is found on step `2`. When an outer variable is modifie
Here are two questions to consider: Here are two questions to consider:
1. Can we somehow reset the `counter` from the code that doesn't belong to `makeCounter`? E.g. after `alert` calls in the example above. 1. Can we somehow reset the counter `count` from the code that doesn't belong to `makeCounter`? E.g. after `alert` calls in the example above.
2. If we call `makeCounter()` multiple times -- it returns many `counter` functions. Are they independent or do they share the same `count`? 2. If we call `makeCounter()` multiple times -- it returns many `counter` functions. Are they independent or do they share the same `count`?
Try to answer them before you continue reading. Try to answer them before you continue reading.
@ -286,8 +286,8 @@ All done?
Okay, let's go over the answers. Okay, let's go over the answers.
1. There is no way. The `counter` is a local function variable, we can't access it from the outside. 1. There is no way: `count` is a local function variable, we can't access it from the outside.
2. For every call to `makeCounter()` a new function Lexical Environment is created, with its own `counter`. So the resulting `counter` functions are independent. 2. For every call to `makeCounter()` a new function Lexical Environment is created, with its own `count`. So the resulting `counter` functions are independent.
Here's the demo: Here's the demo:

View file

@ -1,2 +0,0 @@
# Miscellaneous

View file

@ -1,2 +0,0 @@
# Miscellaneous

View file

@ -103,7 +103,7 @@ There are only 5 of them in JavaScript:
: Enables full unicode support. The flag enables correct processing of surrogate pairs. More about that in the chapter <info:regexp-unicode>. : Enables full unicode support. The flag enables correct processing of surrogate pairs. More about that in the chapter <info:regexp-unicode>.
`y` `y`
: Sticky mode (covered in the [next chapter](info:regexp-methods#y-flag)) : Sticky mode (covered in the chapter <info:regexp-sticky>)
We'll cover all these flags further in the tutorial. We'll cover all these flags further in the tutorial.

View file

@ -0,0 +1,99 @@
# Lookahead and lookbehind
Sometimes we need to match a pattern only if followed by another pattern. For instance, we'd like to get the price from a string like `subject:1 turkey costs 30€`.
We need a number (let's say a price has no decimal point) followed by `subject:€` sign.
That's what lookahead is for.
## Lookahead
The syntax is: `pattern:x(?=y)`, it means "match `pattern:x` only if followed by `pattern:y`".
The euro sign is often written after the amount, so the regexp will be `pattern:\d+(?=€)` (assuming the price has no decimal point):
```js run
let str = "1 turkey costs 30€";
alert( str.match(/\d+(?=€)/) ); // 30 (correctly skipped the sole number 1)
```
Or, if we wanted a quantity, then a negative lookahead can be applied.
The syntax is: `pattern:x(?!y)`, it means "match `pattern:x` only if not followed by `pattern:y`".
```js run
let str = "2 turkeys cost 60€";
alert( str.match(/\d+(?!€)/) ); // 2 (correctly skipped the price)
```
## Lookbehind
Lookbehind allows to match a pattern only if there's something before.
The syntax is:
- Positive lookbehind: `pattern:(?<=y)x`, matches `pattern:x`, but only if it follows after `pattern:y`.
- Negative lookbehind: `pattern:(?<!y)x`, matches `pattern:x`, but only if there's no `pattern:y` before.
For example, let's change the price to US dollars. The dollar sign is usually before the number, so to look for `$30` we'll use `pattern:(?<=\$)\d+`:
```js run
let str = "1 turkey costs $30";
alert( str.match(/(?<=\$)\d+/) ); // 30 (correctly skipped the sole number 1)
```
And for the quantity let's use a negative lookbehind `pattern:(?<!\$)\d+`:
```js run
let str = "2 turkeys cost $60";
alert( str.match(/(?<!\$)\d+/) ); // 2 (correctly skipped the price)
```
## Capture groups
Generally, what's inside the lookaround (a common name for both lookahead and lookbehind) parentheses does not become a part of the match.
But if we want to capture something, that's doable. Just need to wrap that into additional parentheses.
For instance, here the currency `pattern:(€|kr)` is captured, along with the amount:
```js run
let str = "1 turkey costs 30€";
let reg = /\d+(?=(€|kr))/;
alert( str.match(reg) ); // 30, €
```
And here's the same for lookbehind:
```js run
let str = "1 turkey costs $30";
let reg = /(?<=(\$|£))\d+/;
alert( str.match(reg) ); // 30, $
```
Please note that for lookbehind the order stays be same, even though lookahead parentheses are before the main pattern.
Usually parentheses are numbered left-to-right, but lookbehind is an exception, it is always captured after the main pattern. So the match for `pattern:\d+` goes in the result first, and then for `pattern:(\$|£)`.
## Summary
Lookahead and lookbehind (commonly referred to as "lookaround") are useful for simple regular expressions, when we'd like not to take something into the match depending on the context before/after it.
Sometimes we can do the same manually, that is: match all and filter by context in the loop. Remember, `str.matchAll` and `reg.exec` return matches with `.index` property, so we know where exactly in the text it is. But generally regular expressions can do it better.
Lookaround types:
| Pattern | type | matches |
|--------------------|------------------|---------|
| `pattern:x(?=y)` | Positive lookahead | `x` if followed by `y` |
| `pattern:x(?!y)` | Negative lookahead | `x` if not followed by `y` |
| `pattern:(?<=y)x` | Positive lookbehind | `x` if after `y` |
| `pattern:(?<!y)x` | Negative lookbehind | `x` if not after `y` |
Lookahead can also used to disable backtracking. Why that may be needed -- see in the next chapter.

View file

@ -1,3 +0,0 @@
# Lookahead (in progress)
The article is under development, will be here when it's ready.

View file

@ -6,7 +6,9 @@ Sooner or later most developers occasionally face such behavior.
The typical situation -- a regular expression works fine sometimes, but for certain strings it "hangs" consuming 100% of CPU. The typical situation -- a regular expression works fine sometimes, but for certain strings it "hangs" consuming 100% of CPU.
That may even be a vulnerability. For instance, if JavaScript is on the server, and it uses regular expressions to process user data, then such an input may cause denial of service. The author personally saw and reported such vulnerabilities even for well-known and widely used programs. In a web-browser it kills the page. Not a good thing for sure.
For server-side Javascript it may become a vulnerability, and it uses regular expressions to process user data. Bad input will make the process hang, causing denial of service. The author personally saw and reported such vulnerabilities even for very well-known and widely used programs.
So the problem is definitely worth to deal with. So the problem is definitely worth to deal with.
@ -35,23 +37,23 @@ To correctly handle such situations we need a more complex regular expression. I
1. For the `tag` name: `pattern:\w+`, 1. For the `tag` name: `pattern:\w+`,
2. For the `key` name: `pattern:\w+`, 2. For the `key` name: `pattern:\w+`,
3. And the `value` can be a word `pattern:\w+` or a quoted string `pattern:"[^"]*"`. 3. And the `value`: a quoted string `pattern:"[^"]*"`.
If we substitute these into the pattern above, the full regexp is: `pattern:<\w+(\s*\w+=(\w+|"[^"]*")\s*)*>`. If we substitute these into the pattern above and throw in some optional spaces `pattern:\s`, the full regexp becomes: `pattern:<\w+(\s*\w+="[^"]*"\s*)*>`.
That doesn't yet support all details of HTML, for instance strings in 'single' quotes. But they could be added easily, let's keep the regexp simple for now. That regexp is not perfect! It doesn't yet support all details of HTML, for instance unquoted values, and there are other ways to improve, but let's not add complexity. It will demonstrate the problem for us.
Let's try it in action: The regexp seems to work:
```js run ```js run
let reg = /<\w+(\s*\w+=(\w+|"[^"]*")\s*)*>/g; let reg = /<\w+(\s*\w+="[^"]*"\s*)*>/g;
let str='...<a test="<>" href="#">... <b>...'; let str='...<a test="<>" href="#">... <b>...';
alert( str.match(reg) ); // <a test="<>" href="#">, <b> alert( str.match(reg) ); // <a test="<>" href="#">, <b>
``` ```
Great, it works! It found both the long tag `match:<a test="<>" href="#">` and the short one `match:<b>`. Great! It found both the long tag `match:<a test="<>" href="#">` and the short one `match:<b>`.
Now, that we've got a seemingly working solution, let's get to the infinite backtracking itself. Now, that we've got a seemingly working solution, let's get to the infinite backtracking itself.
@ -60,10 +62,10 @@ Now, that we've got a seemingly working solution, let's get to the infinite back
If you run our regexp on the input below, it may hang the browser (or another JavaScript host): If you run our regexp on the input below, it may hang the browser (or another JavaScript host):
```js run ```js run
let reg = /<\w+(\s*\w+=(\w+|"[^"]*")\s*)*>/g; let reg = /<\w+(\s*\w+="[^"]*"\s*)*>/g;
let str = `<tag a=b a=b a=b a=b a=b a=b a=b a=b let str = `<tag a="b" a="b" a="b" a="b" a="b" a="b" a="b" a="b"
a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b`; a="b" a="b" a="b" a="b" a="b" a="b" a="b" a="b" a="b" a="b" a="b" a="b" a="b"`;
*!* *!*
// The search will take a long, long time // The search will take a long, long time
@ -75,16 +77,16 @@ Some regexp engines can handle that search, but most of them can't.
What's the matter? Why a simple regular expression "hangs" on such a small string? What's the matter? Why a simple regular expression "hangs" on such a small string?
Let's simplify the situation by looking only for attributes. Let's simplify the regexp by stripping the tag name and the quotes. So that we look only for `key=value` attributes: `pattern:<(\s*\w+=\w+\s*)*>`.
Here we removed the tag and quoted strings from the regexp. Unfortunately, the regexp still hangs:
```js run ```js run
// only search for space-delimited attributes // only search for space-delimited attributes
let reg = /<(\s*\w+=\w+\s*)*>/g; let reg = /<(\s*\w+=\w+\s*)*>/g;
let str = `<a=b a=b a=b a=b a=b a=b a=b a=b let str = `<a=b a=b a=b a=b a=b a=b a=b a=b
a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b`; a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b`;
*!* *!*
// the search will take a long, long time // the search will take a long, long time
@ -92,11 +94,9 @@ alert( str.match(reg) );
*/!* */!*
``` ```
The same problem persists. Here we end the demo of the problem and start looking into what's going on, why it hangs and how to fix it.
Here we end the demo of the problem and start looking into what's going on and why it hangs. ## Detailed example
## Backtracking
To make an example even simpler, let's consider `pattern:(\d+)*$`. To make an example even simpler, let's consider `pattern:(\d+)*$`.
@ -110,7 +110,7 @@ So what's wrong with the regexp?
First, one may notice that the regexp is a little bit strange. The quantifier `pattern:*` looks extraneous. If we want a number, we can use `pattern:\d+$`. First, one may notice that the regexp is a little bit strange. The quantifier `pattern:*` looks extraneous. If we want a number, we can use `pattern:\d+$`.
Indeed, the regexp is artificial. But the reason why it is slow is the same as those we saw above. So let's understand it, and then return to the real-life examples. Indeed, the regexp is artificial. But the reason why it is slow is the same as those we saw above. So let's understand it, and then the previous example will become obvious.
What happen during the search of `pattern:(\d+)*$` in the line `subject:123456789z`? What happen during the search of `pattern:(\d+)*$` in the line `subject:123456789z`?
@ -120,9 +120,9 @@ What happen during the search of `pattern:(\d+)*$` in the line `subject:12345678
\d+....... \d+.......
(123456789)z (123456789)z
``` ```
2. Then it tries to apply the star around the parentheses `pattern:(\d+)*`, but there are no more digits, so it the star doesn't give anything. 2. Then it tries to apply the star quantifier, but there are no more digits, so it the star doesn't give anything.
Then the pattern has the string end anchor `pattern:$`, and in the text we have `subject:z`. 3. Then the pattern expects to see the string end `pattern:$`, and in the text we have `subject:z`, so there's no match:
``` ```
X X
@ -130,17 +130,16 @@ What happen during the search of `pattern:(\d+)*$` in the line `subject:12345678
(123456789)z (123456789)z
``` ```
No match! 4. As there's no match, the greedy quantifier `pattern:+` decreases the count of repetitions (backtracks).
3. There's no match, so the greedy quantifier `pattern:+` decreases the count of repetitions (backtracks).
Now `\d+` is not all digits, but all except the last one: Now `\d+` doesn't take all digits, but all except the last one:
``` ```
\d+....... \d+.......
(12345678)9z (12345678)9z
``` ```
4. Now the engine tries to continue the search from the new position (`9`). 5. Now the engine tries to continue the search from the new position (`9`).
The start `pattern:(\d+)*` can now be applied -- it gives the number `match:9`: The star `pattern:(\d+)*` can be applied -- it gives the number `match:9`:
``` ```
@ -156,8 +155,8 @@ What happen during the search of `pattern:(\d+)*$` in the line `subject:12345678
(12345678)(9)z (12345678)(9)z
``` ```
There's no match, so the engine will continue backtracking.
5. Now the first number `pattern:\d+` will have 7 digits, and the rest of the string `subject:89` becomes the second `pattern:\d+`: 5. There's no match, so the engine will continue backtracking, decreasing the number of repetitions for `pattern:\d+` down to 7 digits. So the rest of the string `subject:89` becomes the second `pattern:\d+`:
``` ```
X X
@ -193,38 +192,41 @@ What happen during the search of `pattern:(\d+)*$` in the line `subject:12345678
The regular expression engine goes through all combinations of `123456789` and their subsequences. There are a lot of them, that's why it takes so long. The regular expression engine goes through all combinations of `123456789` and their subsequences. There are a lot of them, that's why it takes so long.
A smart guy can say here: "Backtracking? Let's turn on the lazy mode -- and no more backtracking!". What to do?
Let's replace `pattern:\d+` with `pattern:\d+?` and see if it works (careful, can hang the browser) Should we turn on the lazy mode?
Unfortunately, it doesn't: if we replace `pattern:\d+` with `pattern:\d+?`, that still hangs:
```js run ```js run
// sloooooowwwwww // sloooooowwwwww
alert( '12345678901234567890123456789123456789z'.match(/(\d+?)*$/) ); alert( '12345678901234567890123456789123456789z'.match(/(\d+?)*$/) );
``` ```
No, it doesn't. Lazy quantifiers actually do the same, but in the reverse order.
Lazy quantifiers actually do the same, but in the reverse order. Just think about how the search engine would work in this case. Just think about how the search engine would work in this case.
Some regular expression engines have tricky built-in checks to detect infinite backtracking or other means to work around them, but there's no universal solution. Some regular expression engines have tricky built-in checks to detect infinite backtracking or other means to work around them, but there's no universal solution.
## Back to tags
In the example above, when we search `pattern:<(\s*\w+=\w+\s*)*>` in the string `subject:<a=b a=b a=b a=b` -- the similar thing happens. In the example above, when we search `pattern:<(\s*\w+=\w+\s*)*>` in the string `subject:<a=b a=b a=b a=b` -- the similar thing happens.
The string has no `>` at the end, so the match is impossible, but the regexp engine does not know about it. The search backtracks trying different combinations of `pattern:(\s*\w+=\w+\s*)`: The string has no `>` at the end, so the match is impossible, but the regexp engine doesn't know about it. The search backtracks trying different combinations of `pattern:(\s*\w+=\w+\s*)`:
``` ```
(a=b a=b a=b) (a=b) (a=b a=b a=b) (a=b)
(a=b a=b) (a=b a=b) (a=b a=b) (a=b a=b)
(a=b) (a=b a=b a=b)
... ...
``` ```
## How to fix? ## How to fix?
The problem -- too many variants in backtracking even if we don't need them. The backtracking checks many variants that are an obvious fail for a human.
For instance, in the pattern `pattern:(\d+)*$` we (people) can easily see that `pattern:(\d+)` does not need to backtrack. For instance, in the pattern `pattern:(\d+)*$` a human can easily see that `pattern:(\d+)*` does not need to backtrack `pattern:+`. There's no difference between one or two `\d+`:
Decreasing the count of `pattern:\d+` can not help to find a match, there's no matter between these two:
``` ```
\d+........ \d+........
@ -234,40 +236,58 @@ Decreasing the count of `pattern:\d+` can not help to find a match, there's no m
(1234)(56789)z (1234)(56789)z
``` ```
Let's get back to more real-life example: `pattern:<(\s*\w+=\w+\s*)*>`. We want it to find pairs `name=value` (as many as it can). There's no need in backtracking here. Let's get back to more real-life example: `pattern:<(\s*\w+=\w+\s*)*>`. We want it to find pairs `name=value` (as many as it can).
In other words, if it found many `name=value` pairs and then can't find `>`, then there's no need to decrease the count of repetitions. Even if we match one pair less, it won't give us the closing `>`: What we would like to do is to forbid backtracking.
There's totally no need to decrease the number of repetitions.
In other words, if it found three `name=value` pairs and then can't find `>` after them, then there's no need to decrease the count of repetitions. There are definitely no `>` after those two (we backtracked one `name=value` pair, it's there):
```
(name=value) name=value
```
Modern regexp engines support so-called "possessive" quantifiers for that. They are like greedy, but don't backtrack at all. Pretty simple, they capture whatever they can, and the search continues. There's also another tool called "atomic groups" that forbid backtracking inside parentheses. Modern regexp engines support so-called "possessive" quantifiers for that. They are like greedy, but don't backtrack at all. Pretty simple, they capture whatever they can, and the search continues. There's also another tool called "atomic groups" that forbid backtracking inside parentheses.
Unfortunately, but both these features are not supported by JavaScript. Unfortunately, but both these features are not supported by JavaScript.
Although we can get a similar affect using lookahead. There's more about the relation between possessive quantifiers and lookahead in articles [Regex: Emulate Atomic Grouping (and Possessive Quantifiers) with LookAhead](http://instanceof.me/post/52245507631/regex-emulate-atomic-grouping-with-lookahead) and [Mimicking Atomic Groups](http://blog.stevenlevithan.com/archives/mimic-atomic-groups). ### Lookahead to the rescue
We can get forbid backtracking using lookahead.
The pattern to take as much repetitions as possible without backtracking is: `pattern:(?=(a+))\1`. The pattern to take as much repetitions as possible without backtracking is: `pattern:(?=(a+))\1`.
In other words, the lookahead `pattern:?=` looks for the maximal count `pattern:a+` from the current position. And then they are "consumed into the result" by the backreference `pattern:\1`. In other words:
- The lookahead `pattern:?=` looks for the maximal count `pattern:a+` from the current position.
- And then they are "consumed into the result" by the backreference `pattern:\1` (`pattern:\1` corresponds to the content of the second parentheses, that is `pattern:a+`).
There will be no backtracking, because lookahead does not backtrack. If it found like 5 times of `pattern:a+` and the further match failed, then it doesn't go back to 4. There will be no backtracking, because lookahead does not backtrack. If it found like 5 times of `pattern:a+` and the further match failed, then it doesn't go back to 4.
```smart
There's more about the relation between possessive quantifiers and lookahead in articles [Regex: Emulate Atomic Grouping (and Possessive Quantifiers) with LookAhead](http://instanceof.me/post/52245507631/regex-emulate-atomic-grouping-with-lookahead) and [Mimicking Atomic Groups](http://blog.stevenlevithan.com/archives/mimic-atomic-groups).
```
So this trick makes the problem disappear.
Let's fix the regexp for a tag with attributes from the beginning of the chapter`pattern:<\w+(\s*\w+=(\w+|"[^"]*")\s*)*>`. We'll use lookahead to prevent backtracking of `name=value` pairs: Let's fix the regexp for a tag with attributes from the beginning of the chapter`pattern:<\w+(\s*\w+=(\w+|"[^"]*")\s*)*>`. We'll use lookahead to prevent backtracking of `name=value` pairs:
```js run ```js run
// regexp to search name=value // regexp to search name=value
let attrReg = /(\s*\w+=(\w+|"[^"]*")\s*)/ let reg = /(\s*\w+=(\w+|"[^"]*")\s*)/
// use it inside the regexp for tag // use new RegExp to nicely insert its source into (?=(a+))\1
let reg = new RegExp('<\\w+(?=(' + attrReg.source + '*))\\1>', 'g'); let fixedReg = new RegExp(`<\\w+(?=(${attrReg.source}*))\\1>`, 'g');
let good = '...<a test="<>" href="#">... <b>...'; let goodInput = '...<a test="<>" href="#">... <b>...';
let bad = `<tag a=b a=b a=b a=b a=b a=b a=b a=b let badInput = `<tag a=b a=b a=b a=b a=b a=b a=b a=b
a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b`; a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b`;
alert( good.match(reg) ); // <a test="<>" href="#">, <b> alert( goodInput.match(fixedReg) ); // <a test="<>" href="#">, <b>
alert( bad.match(reg) ); // null (no results, fast!) alert( badInput.match(fixedReg) ); // null (no results, fast!)
``` ```
Great, it works! We found a long tag `match:<a test="<>" href="#">` and a small one `match:<b>` and didn't hang the engine. Great, it works! We found both a long tag `match:<a test="<>" href="#">` and a small one `match:<b>`, and (!) didn't hang the engine on the bad input.
Please note the `attrReg.source` property. `RegExp` objects provide access to their source string in it. That's convenient when we want to insert one regexp into another. Please note the `attrReg.source` property. `RegExp` objects provide access to their source string in it. That's convenient when we want to insert one regexp into another.

View file

@ -1,5 +1,5 @@
# Unicode: flag "u", character properties "\\p" # Unicode: flag "u"
The unicode flag `/.../u` enables the correct support of surrogate pairs. The unicode flag `/.../u` enables the correct support of surrogate pairs.
@ -87,81 +87,3 @@ Using the `u` flag makes it work right:
```js run ```js run
alert( '𝒴'.match(/[𝒳-𝒵]/u) ); // 𝒴 alert( '𝒴'.match(/[𝒳-𝒵]/u) ); // 𝒴
``` ```
## Unicode character properies
[Unicode](https://en.wikipedia.org/wiki/Unicode), the encoding format used by Javascript strings, has a lot of properties for different characters (or, technically, code points). They describe which "categories" character belongs to, and a variety of technical details.
In regular expressions these can be set by `\p{…}`. And there must be flag `'u'`.
For instance, `\p{Letter}` denotes a letter in any of language. We can also use `\p{L}`, as `L` is an alias of `Letter`, there are shorter aliases for almost every property.
Here's the main tree of properties:
- Letter `L`:
- lowercase `Ll`, modifier `Lm`, titlecase `Lt`, uppercase `Lu`, other `Lo`
- Number `N`:
- decimal digit `Nd`, letter number `Nl`, other `No`:
- Punctuation `P`:
- connector `Pc`, dash `Pd`, initial quote `Pi`, final quote `Pf`, open `Ps`, close `Pe`, other `Po`
- Mark `M` (accents etc):
- spacing combining `Mc`, enclosing `Me`, non-spacing `Mn`
- Symbol `S`:
- currency `Sc`, modifier `Sk`, math `Sm`, other `So`
- Separator `Z`:
- line `Zl`, paragraph `Zp`, space `Zs`
- Other `C`:
- control `Cc`, format `Cf`, not assigned `Cn`, private use `Co`, surrogate `Cs`.
```smart header="More information"
Interested to see which characters belong to a property? There's a tool at <http://cldr.unicode.org/unicode-utilities/list-unicodeset> for that.
You could also explore properties at [Character Property Index](http://unicode.org/cldr/utility/properties.jsp).
For the full Unicode Character Database in text format (along with all properties), see <https://www.unicode.org/Public/UCD/latest/ucd/>.
```
There are also other derived categories, like:
- `Alphabetic` (`Alpha`), includes Letters `L`, plus letter numbers `Nl` (e.g. roman numbers Ⅻ), plus some other symbols `Other_Alphabetic` (`OAltpa`).
- `Hex_Digit` includes hexadimal digits: `0-9`, `a-f`.
- ...Unicode is a big beast, it includes a lot of properties.
For instance, let's look for a 6-digit hex number:
```js run
let reg = /\p{Hex_Digit}{6}/u; // flag 'u' is requireds
alert("color: #123ABC".match(reg)); // 123ABC
```
There are also properties with a value. For instance, Unicode "Script" (a writing system) can be Cyrillic, Greek, Arabic, Han (Chinese) etc, the [list is long]("https://en.wikipedia.org/wiki/Script_(Unicode)").
To search for certain scripts, we should supply `Script=<value>`, e.g. to search for cyrillic letters: `\p{sc=Cyrillic}`, for Chinese glyphs: `\p{sc=Han}`, etc.
### Universal \w
Let's make a "universal" regexp for `pattern:\w`, for any language. That task has a standard solution in many programming languages with unicode-aware regexps, e.g. Perl.
```
/[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]/u
```
Let's decipher. Remember, `pattern:\w` is actually the same as `pattern:[a-zA-Z0-9_]`.
So the character set includes:
- `Alphabetic` for letters,
- `Mark` for accents, as in Unicode accents may be represented by separate code points,
- `Decimal_Number` for numbers,
- `Connector_Punctuation` for the `'_'` character and alike,
- `Join_Control` - two special code points with hex codes `200c` and `200d`, used in ligatures e.g. in arabic.
Or, if we replace long names with aliases (a list of aliases [here](https://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt)):
```js run
let regexp = /([\p{Alpha}\p{M}\p{Nd}\p{Pc}\p{Join_C}]+)/gu;
let str = `Hello Привет 你好 123_456`;
alert( str.match(regexp) ); // Hello,Привет,你好,123_456
```

View file

@ -0,0 +1,86 @@
# Unicode character properies \p
[Unicode](https://en.wikipedia.org/wiki/Unicode), the encoding format used by Javascript strings, has a lot of properties for different characters (or, technically, code points). They describe which "categories" character belongs to, and a variety of technical details.
In regular expressions these can be set by `\p{…}`. And there must be flag `'u'`.
For instance, `\p{Letter}` denotes a letter in any of language. We can also use `\p{L}`, as `L` is an alias of `Letter`, there are shorter aliases for almost every property.
Here's the main tree of properties:
- Letter `L`:
- lowercase `Ll`, modifier `Lm`, titlecase `Lt`, uppercase `Lu`, other `Lo`
- Number `N`:
- decimal digit `Nd`, letter number `Nl`, other `No`:
- Punctuation `P`:
- connector `Pc`, dash `Pd`, initial quote `Pi`, final quote `Pf`, open `Ps`, close `Pe`, other `Po`
- Mark `M` (accents etc):
- spacing combining `Mc`, enclosing `Me`, non-spacing `Mn`
- Symbol `S`:
- currency `Sc`, modifier `Sk`, math `Sm`, other `So`
- Separator `Z`:
- line `Zl`, paragraph `Zp`, space `Zs`
- Other `C`:
- control `Cc`, format `Cf`, not assigned `Cn`, private use `Co`, surrogate `Cs`.
```smart header="More information"
Interested to see which characters belong to a property? There's a tool at <http://cldr.unicode.org/unicode-utilities/list-unicodeset> for that.
You could also explore properties at [Character Property Index](http://unicode.org/cldr/utility/properties.jsp).
For the full Unicode Character Database in text format (along with all properties), see <https://www.unicode.org/Public/UCD/latest/ucd/>.
```
There are also other derived categories, like:
- `Alphabetic` (`Alpha`), includes Letters `L`, plus letter numbers `Nl` (e.g. roman numbers Ⅻ), plus some other symbols `Other_Alphabetic` (`OAltpa`).
- `Hex_Digit` includes hexadimal digits: `0-9`, `a-f`.
- ...Unicode is a big beast, it includes a lot of properties.
For instance, let's look for a 6-digit hex number:
```js run
let reg = /\p{Hex_Digit}{6}/u; // flag 'u' is requireds
alert("color: #123ABC".match(reg)); // 123ABC
```
There are also properties with a value. For instance, Unicode "Script" (a writing system) can be Cyrillic, Greek, Arabic, Han (Chinese) etc, the [list is long]("https://en.wikipedia.org/wiki/Script_(Unicode)").
To search for certain scripts, we should supply `Script=<value>`, e.g. to search for cyrillic letters: `\p{sc=Cyrillic}`, for Chinese glyphs: `\p{sc=Han}`, etc:
```js run
let regexp = /\p{sc=Han}+/gu; // get chinese words
let str = `Hello Привет 你好 123_456`;
alert( str.match(regexp) ); // 你好
```
## Building multi-language \w
Let's make a "universal" regexp for `pattern:\w`, for any language. That task has a standard solution in many programming languages with unicode-aware regexps, e.g. Perl.
```js
/[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]/u
```
Let's decipher. Remember, `pattern:\w` is actually the same as `pattern:[a-zA-Z0-9_]`.
So the character set includes:
- `Alphabetic` for letters,
- `Mark` for accents, as in Unicode accents may be represented by separate code points,
- `Decimal_Number` for numbers,
- `Connector_Punctuation` for the `'_'` character and alike,
- `Join_Control` - two special code points with hex codes `200c` and `200d`, used in ligatures e.g. in arabic.
Or, if we replace long names with aliases (a list of aliases [here](https://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt)):
```js run
let regexp = /([\p{Alpha}\p{M}\p{Nd}\p{Pc}\p{Join_C}]+)/gu;
let str = `Hello Привет 你好 123_456`;
alert( str.match(regexp) ); // Hello,Привет,你好,123_456
```

View file

@ -1,5 +1,5 @@
# "Sticky" flag `y`, searching at position [#y-flag] # Sticky flag "y", searching at position
To grasp the use case of `y` flag, and see how great it is, let's explore a practical use case. To grasp the use case of `y` flag, and see how great it is, let's explore a practical use case.