From d043b115166e3257ad37a757a5b39bd0ef78fbbd Mon Sep 17 00:00:00 2001 From: Ilya Kantor Date: Sat, 2 Mar 2019 23:31:11 +0300 Subject: [PATCH] merge prs --- .../03-closure/article.md | 6 +- 1-js/99-js-miscellaneous/index.md | 2 - 2-ui/99-ui-miscellaneous/index.md | 2 - .../01-popup-windows/article.md | 0 .../03-cross-window-communication/article.md | 0 .../postmessage.view/iframe.html | 0 .../postmessage.view/index.html | 0 .../sandbox.view/index.html | 0 .../sandbox.view/sandboxed.html | 0 .../06-clickjacking/article.md | 0 .../clickjacking-visible.view/facebook.html | 0 .../clickjacking-visible.view/index.html | 0 .../clickjacking.view/facebook.html | 0 .../clickjacking.view/index.html | 0 .../protector.view/iframe.html | 0 .../06-clickjacking/protector.view/index.html | 0 .../top-location.view/iframe.html | 0 .../top-location.view/index.html | 0 .../index.md | 0 .../01-regexp-introduction/article.md | 2 +- .../14-regexp-lookahead-lookbehind/article.md | 99 +++++++++++++++ .../14-regexp-lookahead/article.md | 3 - .../article.md | 116 ++++++++++-------- .../20-regexp-unicode/article.md | 80 +----------- .../21-regexp-unicode-properties/article.md | 86 +++++++++++++ .../article.md | 2 +- 26 files changed, 259 insertions(+), 139 deletions(-) delete mode 100644 1-js/99-js-miscellaneous/index.md delete mode 100644 2-ui/99-ui-miscellaneous/index.md rename {2-ui/5-frames-and-windows => 4-frames-and-windows}/01-popup-windows/article.md (100%) rename {2-ui/5-frames-and-windows => 4-frames-and-windows}/03-cross-window-communication/article.md (100%) rename {2-ui/5-frames-and-windows => 4-frames-and-windows}/03-cross-window-communication/postmessage.view/iframe.html (100%) rename {2-ui/5-frames-and-windows => 4-frames-and-windows}/03-cross-window-communication/postmessage.view/index.html (100%) rename {2-ui/5-frames-and-windows => 4-frames-and-windows}/03-cross-window-communication/sandbox.view/index.html (100%) rename {2-ui/5-frames-and-windows => 4-frames-and-windows}/03-cross-window-communication/sandbox.view/sandboxed.html (100%) rename {2-ui/5-frames-and-windows => 4-frames-and-windows}/06-clickjacking/article.md (100%) rename {2-ui/5-frames-and-windows => 4-frames-and-windows}/06-clickjacking/clickjacking-visible.view/facebook.html (100%) rename {2-ui/5-frames-and-windows => 4-frames-and-windows}/06-clickjacking/clickjacking-visible.view/index.html (100%) rename {2-ui/5-frames-and-windows => 4-frames-and-windows}/06-clickjacking/clickjacking.view/facebook.html (100%) rename {2-ui/5-frames-and-windows => 4-frames-and-windows}/06-clickjacking/clickjacking.view/index.html (100%) rename {2-ui/5-frames-and-windows => 4-frames-and-windows}/06-clickjacking/protector.view/iframe.html (100%) rename {2-ui/5-frames-and-windows => 4-frames-and-windows}/06-clickjacking/protector.view/index.html (100%) rename {2-ui/5-frames-and-windows => 4-frames-and-windows}/06-clickjacking/top-location.view/iframe.html (100%) rename {2-ui/5-frames-and-windows => 4-frames-and-windows}/06-clickjacking/top-location.view/index.html (100%) rename {2-ui/5-frames-and-windows => 4-frames-and-windows}/index.md (100%) create mode 100644 5-regular-expressions/14-regexp-lookahead-lookbehind/article.md delete mode 100644 5-regular-expressions/14-regexp-lookahead/article.md create mode 100644 5-regular-expressions/21-regexp-unicode-properties/article.md rename 5-regular-expressions/{21-regexp-sticky => 22-regexp-sticky}/article.md (98%) diff --git a/1-js/06-advanced-functions/03-closure/article.md b/1-js/06-advanced-functions/03-closure/article.md index 21408d78..ac0e44c6 100644 --- a/1-js/06-advanced-functions/03-closure/article.md +++ b/1-js/06-advanced-functions/03-closure/article.md @@ -275,7 +275,7 @@ In this example `count` is found on step `2`. When an outer variable is modifie Here are two questions to consider: -1. Can we somehow reset the `counter` from the code that doesn't belong to `makeCounter`? E.g. after `alert` calls in the example above. +1. Can we somehow reset the counter `count` from the code that doesn't belong to `makeCounter`? E.g. after `alert` calls in the example above. 2. If we call `makeCounter()` multiple times -- it returns many `counter` functions. Are they independent or do they share the same `count`? Try to answer them before you continue reading. @@ -286,8 +286,8 @@ All done? Okay, let's go over the answers. -1. There is no way. The `counter` is a local function variable, we can't access it from the outside. -2. For every call to `makeCounter()` a new function Lexical Environment is created, with its own `counter`. So the resulting `counter` functions are independent. +1. There is no way: `count` is a local function variable, we can't access it from the outside. +2. For every call to `makeCounter()` a new function Lexical Environment is created, with its own `count`. So the resulting `counter` functions are independent. Here's the demo: diff --git a/1-js/99-js-miscellaneous/index.md b/1-js/99-js-miscellaneous/index.md deleted file mode 100644 index 79cd72fe..00000000 --- a/1-js/99-js-miscellaneous/index.md +++ /dev/null @@ -1,2 +0,0 @@ - -# Miscellaneous diff --git a/2-ui/99-ui-miscellaneous/index.md b/2-ui/99-ui-miscellaneous/index.md deleted file mode 100644 index 79cd72fe..00000000 --- a/2-ui/99-ui-miscellaneous/index.md +++ /dev/null @@ -1,2 +0,0 @@ - -# Miscellaneous diff --git a/2-ui/5-frames-and-windows/01-popup-windows/article.md b/4-frames-and-windows/01-popup-windows/article.md similarity index 100% rename from 2-ui/5-frames-and-windows/01-popup-windows/article.md rename to 4-frames-and-windows/01-popup-windows/article.md diff --git a/2-ui/5-frames-and-windows/03-cross-window-communication/article.md b/4-frames-and-windows/03-cross-window-communication/article.md similarity index 100% rename from 2-ui/5-frames-and-windows/03-cross-window-communication/article.md rename to 4-frames-and-windows/03-cross-window-communication/article.md diff --git a/2-ui/5-frames-and-windows/03-cross-window-communication/postmessage.view/iframe.html b/4-frames-and-windows/03-cross-window-communication/postmessage.view/iframe.html similarity index 100% rename from 2-ui/5-frames-and-windows/03-cross-window-communication/postmessage.view/iframe.html rename to 4-frames-and-windows/03-cross-window-communication/postmessage.view/iframe.html diff --git a/2-ui/5-frames-and-windows/03-cross-window-communication/postmessage.view/index.html b/4-frames-and-windows/03-cross-window-communication/postmessage.view/index.html similarity index 100% rename from 2-ui/5-frames-and-windows/03-cross-window-communication/postmessage.view/index.html rename to 4-frames-and-windows/03-cross-window-communication/postmessage.view/index.html diff --git a/2-ui/5-frames-and-windows/03-cross-window-communication/sandbox.view/index.html b/4-frames-and-windows/03-cross-window-communication/sandbox.view/index.html similarity index 100% rename from 2-ui/5-frames-and-windows/03-cross-window-communication/sandbox.view/index.html rename to 4-frames-and-windows/03-cross-window-communication/sandbox.view/index.html diff --git a/2-ui/5-frames-and-windows/03-cross-window-communication/sandbox.view/sandboxed.html b/4-frames-and-windows/03-cross-window-communication/sandbox.view/sandboxed.html similarity index 100% rename from 2-ui/5-frames-and-windows/03-cross-window-communication/sandbox.view/sandboxed.html rename to 4-frames-and-windows/03-cross-window-communication/sandbox.view/sandboxed.html diff --git a/2-ui/5-frames-and-windows/06-clickjacking/article.md b/4-frames-and-windows/06-clickjacking/article.md similarity index 100% rename from 2-ui/5-frames-and-windows/06-clickjacking/article.md rename to 4-frames-and-windows/06-clickjacking/article.md diff --git a/2-ui/5-frames-and-windows/06-clickjacking/clickjacking-visible.view/facebook.html b/4-frames-and-windows/06-clickjacking/clickjacking-visible.view/facebook.html similarity index 100% rename from 2-ui/5-frames-and-windows/06-clickjacking/clickjacking-visible.view/facebook.html rename to 4-frames-and-windows/06-clickjacking/clickjacking-visible.view/facebook.html diff --git a/2-ui/5-frames-and-windows/06-clickjacking/clickjacking-visible.view/index.html b/4-frames-and-windows/06-clickjacking/clickjacking-visible.view/index.html similarity index 100% rename from 2-ui/5-frames-and-windows/06-clickjacking/clickjacking-visible.view/index.html rename to 4-frames-and-windows/06-clickjacking/clickjacking-visible.view/index.html diff --git a/2-ui/5-frames-and-windows/06-clickjacking/clickjacking.view/facebook.html b/4-frames-and-windows/06-clickjacking/clickjacking.view/facebook.html similarity index 100% rename from 2-ui/5-frames-and-windows/06-clickjacking/clickjacking.view/facebook.html rename to 4-frames-and-windows/06-clickjacking/clickjacking.view/facebook.html diff --git a/2-ui/5-frames-and-windows/06-clickjacking/clickjacking.view/index.html b/4-frames-and-windows/06-clickjacking/clickjacking.view/index.html similarity index 100% rename from 2-ui/5-frames-and-windows/06-clickjacking/clickjacking.view/index.html rename to 4-frames-and-windows/06-clickjacking/clickjacking.view/index.html diff --git a/2-ui/5-frames-and-windows/06-clickjacking/protector.view/iframe.html b/4-frames-and-windows/06-clickjacking/protector.view/iframe.html similarity index 100% rename from 2-ui/5-frames-and-windows/06-clickjacking/protector.view/iframe.html rename to 4-frames-and-windows/06-clickjacking/protector.view/iframe.html diff --git a/2-ui/5-frames-and-windows/06-clickjacking/protector.view/index.html b/4-frames-and-windows/06-clickjacking/protector.view/index.html similarity index 100% rename from 2-ui/5-frames-and-windows/06-clickjacking/protector.view/index.html rename to 4-frames-and-windows/06-clickjacking/protector.view/index.html diff --git a/2-ui/5-frames-and-windows/06-clickjacking/top-location.view/iframe.html b/4-frames-and-windows/06-clickjacking/top-location.view/iframe.html similarity index 100% rename from 2-ui/5-frames-and-windows/06-clickjacking/top-location.view/iframe.html rename to 4-frames-and-windows/06-clickjacking/top-location.view/iframe.html diff --git a/2-ui/5-frames-and-windows/06-clickjacking/top-location.view/index.html b/4-frames-and-windows/06-clickjacking/top-location.view/index.html similarity index 100% rename from 2-ui/5-frames-and-windows/06-clickjacking/top-location.view/index.html rename to 4-frames-and-windows/06-clickjacking/top-location.view/index.html diff --git a/2-ui/5-frames-and-windows/index.md b/4-frames-and-windows/index.md similarity index 100% rename from 2-ui/5-frames-and-windows/index.md rename to 4-frames-and-windows/index.md diff --git a/5-regular-expressions/01-regexp-introduction/article.md b/5-regular-expressions/01-regexp-introduction/article.md index b7b31641..cc50ba8e 100644 --- a/5-regular-expressions/01-regexp-introduction/article.md +++ b/5-regular-expressions/01-regexp-introduction/article.md @@ -103,7 +103,7 @@ There are only 5 of them in JavaScript: : Enables full unicode support. The flag enables correct processing of surrogate pairs. More about that in the chapter . `y` -: Sticky mode (covered in the [next chapter](info:regexp-methods#y-flag)) +: Sticky mode (covered in the chapter ) We'll cover all these flags further in the tutorial. diff --git a/5-regular-expressions/14-regexp-lookahead-lookbehind/article.md b/5-regular-expressions/14-regexp-lookahead-lookbehind/article.md new file mode 100644 index 00000000..481bdd63 --- /dev/null +++ b/5-regular-expressions/14-regexp-lookahead-lookbehind/article.md @@ -0,0 +1,99 @@ +# Lookahead and lookbehind + +Sometimes we need to match a pattern only if followed by another pattern. For instance, we'd like to get the price from a string like `subject:1 turkey costs 30€`. + +We need a number (let's say a price has no decimal point) followed by `subject:€` sign. + +That's what lookahead is for. + +## Lookahead + +The syntax is: `pattern:x(?=y)`, it means "match `pattern:x` only if followed by `pattern:y`". + +The euro sign is often written after the amount, so the regexp will be `pattern:\d+(?=€)` (assuming the price has no decimal point): + +```js run +let str = "1 turkey costs 30€"; + +alert( str.match(/\d+(?=€)/) ); // 30 (correctly skipped the sole number 1) +``` + +Or, if we wanted a quantity, then a negative lookahead can be applied. + +The syntax is: `pattern:x(?!y)`, it means "match `pattern:x` only if not followed by `pattern:y`". + +```js run +let str = "2 turkeys cost 60€"; + +alert( str.match(/\d+(?!€)/) ); // 2 (correctly skipped the price) +``` + +## Lookbehind + +Lookbehind allows to match a pattern only if there's something before. + +The syntax is: +- Positive lookbehind: `pattern:(?<=y)x`, matches `pattern:x`, but only if it follows after `pattern:y`. +- Negative lookbehind: `pattern:(?`. +If we substitute these into the pattern above and throw in some optional spaces `pattern:\s`, the full regexp becomes: `pattern:<\w+(\s*\w+="[^"]*"\s*)*>`. -That doesn't yet support all details of HTML, for instance strings in 'single' quotes. But they could be added easily, let's keep the regexp simple for now. +That regexp is not perfect! It doesn't yet support all details of HTML, for instance unquoted values, and there are other ways to improve, but let's not add complexity. It will demonstrate the problem for us. -Let's try it in action: +The regexp seems to work: ```js run -let reg = /<\w+(\s*\w+=(\w+|"[^"]*")\s*)*>/g; +let reg = /<\w+(\s*\w+="[^"]*"\s*)*>/g; let str='...... ...'; alert( str.match(reg) ); // , ``` -Great, it works! It found both the long tag `match:` and the short one `match:`. +Great! It found both the long tag `match:` and the short one `match:`. Now, that we've got a seemingly working solution, let's get to the infinite backtracking itself. @@ -60,10 +62,10 @@ Now, that we've got a seemingly working solution, let's get to the infinite back If you run our regexp on the input below, it may hang the browser (or another JavaScript host): ```js run -let reg = /<\w+(\s*\w+=(\w+|"[^"]*")\s*)*>/g; +let reg = /<\w+(\s*\w+="[^"]*"\s*)*>/g; -let str = ``. -Here we removed the tag and quoted strings from the regexp. +Unfortunately, the regexp still hangs: ```js run // only search for space-delimited attributes let reg = /<(\s*\w+=\w+\s*)*>/g; let str = `` in the string `subject:` at the end, so the match is impossible, but the regexp engine does not know about it. The search backtracks trying different combinations of `pattern:(\s*\w+=\w+\s*)`: +The string has no `>` at the end, so the match is impossible, but the regexp engine doesn't know about it. The search backtracks trying different combinations of `pattern:(\s*\w+=\w+\s*)`: ``` (a=b a=b a=b) (a=b) (a=b a=b) (a=b a=b) +(a=b) (a=b a=b a=b) ... ``` ## How to fix? -The problem -- too many variants in backtracking even if we don't need them. +The backtracking checks many variants that are an obvious fail for a human. -For instance, in the pattern `pattern:(\d+)*$` we (people) can easily see that `pattern:(\d+)` does not need to backtrack. - -Decreasing the count of `pattern:\d+` can not help to find a match, there's no matter between these two: +For instance, in the pattern `pattern:(\d+)*$` a human can easily see that `pattern:(\d+)*` does not need to backtrack `pattern:+`. There's no difference between one or two `\d+`: ``` \d+........ @@ -234,40 +236,58 @@ Decreasing the count of `pattern:\d+` can not help to find a match, there's no m (1234)(56789)z ``` -Let's get back to more real-life example: `pattern:<(\s*\w+=\w+\s*)*>`. We want it to find pairs `name=value` (as many as it can). There's no need in backtracking here. +Let's get back to more real-life example: `pattern:<(\s*\w+=\w+\s*)*>`. We want it to find pairs `name=value` (as many as it can). -In other words, if it found many `name=value` pairs and then can't find `>`, then there's no need to decrease the count of repetitions. Even if we match one pair less, it won't give us the closing `>`: +What we would like to do is to forbid backtracking. + +There's totally no need to decrease the number of repetitions. + +In other words, if it found three `name=value` pairs and then can't find `>` after them, then there's no need to decrease the count of repetitions. There are definitely no `>` after those two (we backtracked one `name=value` pair, it's there): + +``` +(name=value) name=value +``` Modern regexp engines support so-called "possessive" quantifiers for that. They are like greedy, but don't backtrack at all. Pretty simple, they capture whatever they can, and the search continues. There's also another tool called "atomic groups" that forbid backtracking inside parentheses. Unfortunately, but both these features are not supported by JavaScript. -Although we can get a similar affect using lookahead. There's more about the relation between possessive quantifiers and lookahead in articles [Regex: Emulate Atomic Grouping (and Possessive Quantifiers) with LookAhead](http://instanceof.me/post/52245507631/regex-emulate-atomic-grouping-with-lookahead) and [Mimicking Atomic Groups](http://blog.stevenlevithan.com/archives/mimic-atomic-groups). +### Lookahead to the rescue + +We can get forbid backtracking using lookahead. The pattern to take as much repetitions as possible without backtracking is: `pattern:(?=(a+))\1`. -In other words, the lookahead `pattern:?=` looks for the maximal count `pattern:a+` from the current position. And then they are "consumed into the result" by the backreference `pattern:\1`. +In other words: +- The lookahead `pattern:?=` looks for the maximal count `pattern:a+` from the current position. +- And then they are "consumed into the result" by the backreference `pattern:\1` (`pattern:\1` corresponds to the content of the second parentheses, that is `pattern:a+`). There will be no backtracking, because lookahead does not backtrack. If it found like 5 times of `pattern:a+` and the further match failed, then it doesn't go back to 4. +```smart +There's more about the relation between possessive quantifiers and lookahead in articles [Regex: Emulate Atomic Grouping (and Possessive Quantifiers) with LookAhead](http://instanceof.me/post/52245507631/regex-emulate-atomic-grouping-with-lookahead) and [Mimicking Atomic Groups](http://blog.stevenlevithan.com/archives/mimic-atomic-groups). +``` + +So this trick makes the problem disappear. + Let's fix the regexp for a tag with attributes from the beginning of the chapter`pattern:<\w+(\s*\w+=(\w+|"[^"]*")\s*)*>`. We'll use lookahead to prevent backtracking of `name=value` pairs: ```js run // regexp to search name=value -let attrReg = /(\s*\w+=(\w+|"[^"]*")\s*)/ +let reg = /(\s*\w+=(\w+|"[^"]*")\s*)/ -// use it inside the regexp for tag -let reg = new RegExp('<\\w+(?=(' + attrReg.source + '*))\\1>', 'g'); +// use new RegExp to nicely insert its source into (?=(a+))\1 +let fixedReg = new RegExp(`<\\w+(?=(${attrReg.source}*))\\1>`, 'g'); -let good = '...... ...'; +let goodInput = '...... ...'; -let bad = `, -alert( bad.match(reg) ); // null (no results, fast!) +alert( goodInput.match(fixedReg) ); // , +alert( badInput.match(fixedReg) ); // null (no results, fast!) ``` -Great, it works! We found a long tag `match:` and a small one `match:` and didn't hang the engine. +Great, it works! We found both a long tag `match:` and a small one `match:`, and (!) didn't hang the engine on the bad input. Please note the `attrReg.source` property. `RegExp` objects provide access to their source string in it. That's convenient when we want to insert one regexp into another. diff --git a/5-regular-expressions/20-regexp-unicode/article.md b/5-regular-expressions/20-regexp-unicode/article.md index 487682b8..68eebca8 100644 --- a/5-regular-expressions/20-regexp-unicode/article.md +++ b/5-regular-expressions/20-regexp-unicode/article.md @@ -1,5 +1,5 @@ -# Unicode: flag "u", character properties "\\p" +# Unicode: flag "u" The unicode flag `/.../u` enables the correct support of surrogate pairs. @@ -87,81 +87,3 @@ Using the `u` flag makes it work right: ```js run alert( '𝒴'.match(/[𝒳-𝒵]/u) ); // 𝒴 ``` - -## Unicode character properies - -[Unicode](https://en.wikipedia.org/wiki/Unicode), the encoding format used by Javascript strings, has a lot of properties for different characters (or, technically, code points). They describe which "categories" character belongs to, and a variety of technical details. - -In regular expressions these can be set by `\p{…}`. And there must be flag `'u'`. - -For instance, `\p{Letter}` denotes a letter in any of language. We can also use `\p{L}`, as `L` is an alias of `Letter`, there are shorter aliases for almost every property. - -Here's the main tree of properties: - -- Letter `L`: - - lowercase `Ll`, modifier `Lm`, titlecase `Lt`, uppercase `Lu`, other `Lo` -- Number `N`: - - decimal digit `Nd`, letter number `Nl`, other `No`: -- Punctuation `P`: - - connector `Pc`, dash `Pd`, initial quote `Pi`, final quote `Pf`, open `Ps`, close `Pe`, other `Po` -- Mark `M` (accents etc): - - spacing combining `Mc`, enclosing `Me`, non-spacing `Mn` -- Symbol `S`: - - currency `Sc`, modifier `Sk`, math `Sm`, other `So` -- Separator `Z`: - - line `Zl`, paragraph `Zp`, space `Zs` -- Other `C`: - - control `Cc`, format `Cf`, not assigned `Cn`, private use `Co`, surrogate `Cs`. - -```smart header="More information" -Interested to see which characters belong to a property? There's a tool at for that. - -You could also explore properties at [Character Property Index](http://unicode.org/cldr/utility/properties.jsp). - -For the full Unicode Character Database in text format (along with all properties), see . -``` - -There are also other derived categories, like: -- `Alphabetic` (`Alpha`), includes Letters `L`, plus letter numbers `Nl` (e.g. roman numbers Ⅻ), plus some other symbols `Other_Alphabetic` (`OAltpa`). -- `Hex_Digit` includes hexadimal digits: `0-9`, `a-f`. -- ...Unicode is a big beast, it includes a lot of properties. - -For instance, let's look for a 6-digit hex number: - -```js run -let reg = /\p{Hex_Digit}{6}/u; // flag 'u' is requireds - -alert("color: #123ABC".match(reg)); // 123ABC -``` - -There are also properties with a value. For instance, Unicode "Script" (a writing system) can be Cyrillic, Greek, Arabic, Han (Chinese) etc, the [list is long]("https://en.wikipedia.org/wiki/Script_(Unicode)"). - -To search for certain scripts, we should supply `Script=`, e.g. to search for cyrillic letters: `\p{sc=Cyrillic}`, for Chinese glyphs: `\p{sc=Han}`, etc. - -### Universal \w - -Let's make a "universal" regexp for `pattern:\w`, for any language. That task has a standard solution in many programming languages with unicode-aware regexps, e.g. Perl. - -``` -/[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]/u -``` - -Let's decipher. Remember, `pattern:\w` is actually the same as `pattern:[a-zA-Z0-9_]`. - -So the character set includes: - -- `Alphabetic` for letters, -- `Mark` for accents, as in Unicode accents may be represented by separate code points, -- `Decimal_Number` for numbers, -- `Connector_Punctuation` for the `'_'` character and alike, -- `Join_Control` -– two special code points with hex codes `200c` and `200d`, used in ligatures e.g. in arabic. - -Or, if we replace long names with aliases (a list of aliases [here](https://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt)): - -```js run -let regexp = /([\p{Alpha}\p{M}\p{Nd}\p{Pc}\p{Join_C}]+)/gu; - -let str = `Hello Привет 你好 123_456`; - -alert( str.match(regexp) ); // Hello,Привет,你好,123_456 -``` diff --git a/5-regular-expressions/21-regexp-unicode-properties/article.md b/5-regular-expressions/21-regexp-unicode-properties/article.md new file mode 100644 index 00000000..eb79d5d2 --- /dev/null +++ b/5-regular-expressions/21-regexp-unicode-properties/article.md @@ -0,0 +1,86 @@ + +# Unicode character properies \p + +[Unicode](https://en.wikipedia.org/wiki/Unicode), the encoding format used by Javascript strings, has a lot of properties for different characters (or, technically, code points). They describe which "categories" character belongs to, and a variety of technical details. + +In regular expressions these can be set by `\p{…}`. And there must be flag `'u'`. + +For instance, `\p{Letter}` denotes a letter in any of language. We can also use `\p{L}`, as `L` is an alias of `Letter`, there are shorter aliases for almost every property. + +Here's the main tree of properties: + +- Letter `L`: + - lowercase `Ll`, modifier `Lm`, titlecase `Lt`, uppercase `Lu`, other `Lo` +- Number `N`: + - decimal digit `Nd`, letter number `Nl`, other `No`: +- Punctuation `P`: + - connector `Pc`, dash `Pd`, initial quote `Pi`, final quote `Pf`, open `Ps`, close `Pe`, other `Po` +- Mark `M` (accents etc): + - spacing combining `Mc`, enclosing `Me`, non-spacing `Mn` +- Symbol `S`: + - currency `Sc`, modifier `Sk`, math `Sm`, other `So` +- Separator `Z`: + - line `Zl`, paragraph `Zp`, space `Zs` +- Other `C`: + - control `Cc`, format `Cf`, not assigned `Cn`, private use `Co`, surrogate `Cs`. + +```smart header="More information" +Interested to see which characters belong to a property? There's a tool at for that. + +You could also explore properties at [Character Property Index](http://unicode.org/cldr/utility/properties.jsp). + +For the full Unicode Character Database in text format (along with all properties), see . +``` + +There are also other derived categories, like: +- `Alphabetic` (`Alpha`), includes Letters `L`, plus letter numbers `Nl` (e.g. roman numbers Ⅻ), plus some other symbols `Other_Alphabetic` (`OAltpa`). +- `Hex_Digit` includes hexadimal digits: `0-9`, `a-f`. +- ...Unicode is a big beast, it includes a lot of properties. + +For instance, let's look for a 6-digit hex number: + +```js run +let reg = /\p{Hex_Digit}{6}/u; // flag 'u' is requireds + +alert("color: #123ABC".match(reg)); // 123ABC +``` + +There are also properties with a value. For instance, Unicode "Script" (a writing system) can be Cyrillic, Greek, Arabic, Han (Chinese) etc, the [list is long]("https://en.wikipedia.org/wiki/Script_(Unicode)"). + +To search for certain scripts, we should supply `Script=`, e.g. to search for cyrillic letters: `\p{sc=Cyrillic}`, for Chinese glyphs: `\p{sc=Han}`, etc: + +```js run +let regexp = /\p{sc=Han}+/gu; // get chinese words + +let str = `Hello Привет 你好 123_456`; + +alert( str.match(regexp) ); // 你好 +``` + +## Building multi-language \w + +Let's make a "universal" regexp for `pattern:\w`, for any language. That task has a standard solution in many programming languages with unicode-aware regexps, e.g. Perl. + +```js +/[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]/u +``` + +Let's decipher. Remember, `pattern:\w` is actually the same as `pattern:[a-zA-Z0-9_]`. + +So the character set includes: + +- `Alphabetic` for letters, +- `Mark` for accents, as in Unicode accents may be represented by separate code points, +- `Decimal_Number` for numbers, +- `Connector_Punctuation` for the `'_'` character and alike, +- `Join_Control` -– two special code points with hex codes `200c` and `200d`, used in ligatures e.g. in arabic. + +Or, if we replace long names with aliases (a list of aliases [here](https://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt)): + +```js run +let regexp = /([\p{Alpha}\p{M}\p{Nd}\p{Pc}\p{Join_C}]+)/gu; + +let str = `Hello Привет 你好 123_456`; + +alert( str.match(regexp) ); // Hello,Привет,你好,123_456 +``` diff --git a/5-regular-expressions/21-regexp-sticky/article.md b/5-regular-expressions/22-regexp-sticky/article.md similarity index 98% rename from 5-regular-expressions/21-regexp-sticky/article.md rename to 5-regular-expressions/22-regexp-sticky/article.md index ea4512fb..3799aa66 100644 --- a/5-regular-expressions/21-regexp-sticky/article.md +++ b/5-regular-expressions/22-regexp-sticky/article.md @@ -1,5 +1,5 @@ -# "Sticky" flag `y`, searching at position [#y-flag] +# Sticky flag "y", searching at position To grasp the use case of `y` flag, and see how great it is, let's explore a practical use case.