regexp draft

2019-03-02 01:02:01 +03:00 · 2019-03-02 01:02:01 +03:00 · 65184edf76
commit 65184edf76
parent 1369332661
11 changed files with 730 additions and 399 deletions
--- a/5-regular-expressions/03-regexp-character-classes/article.md
+++ b/5-regular-expressions/03-regexp-character-classes/article.md
@ -1,12 +1,14 @@
 # Character classes

-Consider a practical task -- we have a phone number `"+7(903)-123-45-67"`, and we need to find all digits in that string. Other characters do not interest us.
+Consider a practical task -- we have a phone number `"+7(903)-123-45-67"`, and we need to turn it into pure numbers: `79035419441`.

-A character class is a special notation that matches any symbol from the set.
+To do so, we can find and remove anything that's not a number. Character classes can help with that.

-For instance, there's a "digit" class. It's written as `\d`. We put it in the pattern, and during the search any digit matches it.
+A character class is a special notation that matches any symbol from a certain set.

-For instance, the regexp `pattern:/\d/` looks for a single digit:
+For the start, let's explore a "digit" class. It's written as `\d`. We put it in the pattern, that means "any single digit".
+
+For instance, the let's find the first digit in the phone number:

 ```js run
 let str = "+7(903)-123-45-67";
@ -16,9 +18,9 @@ let reg = /\d/;
 alert( str.match(reg) ); // 7
 ```

-The regexp is not global in the example above, so it only looks for the first match.
+Without the flag `g`, the regular expression only looks for the first match, that is the first digit `\d`.

-Let's add the `g` flag to look for all digits:
+Let's add the `g` flag to find all digits:

 ```js run
 let str = "+7(903)-123-45-67";
@ -26,9 +28,9 @@ let str = "+7(903)-123-45-67";
 let reg = /\d/g;

 alert( str.match(reg) ); // array of matches: 7,9,0,3,1,2,3,4,5,6,7
-```

-## Most used classes: \d \s \w
+alert( str.match(reg).join('') ); // 79035419441
+```

 That was a character class for digits. There are other character classes as well.

@ -43,9 +45,9 @@ Most used are:
 `\w` ("w" is from "word")
 : A "wordly" character: either a letter of English alphabet or a digit or an underscore. Non-english letters (like cyrillic or hindi) do not belong to `\w`.

-For instance, `pattern:\d\s\w` means a digit followed by a space character followed by a wordly character, like `"1 Z"`.
+For instance, `pattern:\d\s\w` means a "digit" followed by a "space character" followed by a "wordly character", like `"1 a"`.

-A regexp may contain both regular symbols and character classes.
+**A regexp may contain both regular symbols and character classes.**

 For instance, `pattern:CSS\d` matches a string `match:CSS` with a digit after it:

@ -68,7 +70,7 @@ The match (each character class corresponds to one result character):

 ## Word boundary: \b

-The word boundary `pattern:\b` -- is a special character class.
+A word boundary `pattern:\b` -- is a special character class.

 It does not denote a character, but rather a boundary between characters.

@ -79,32 +81,39 @@ alert( "Hello, Java!".match(/\bJava\b/) ); // Java
 alert( "Hello, JavaScript!".match(/\bJava\b/) ); // null
 ```

-The boundary has "zero width" in a sense that usually a character class means a character in the result (like a wordly or a digit), but not in this case.
+The boundary has "zero width" in a sense that usually a character class means a character in the result (like a wordly character or a digit), but not in this case.

 The boundary is a test.

 When regular expression engine is doing the search, it's moving along the string in an attempt to find the match. At each string position it tries to find the pattern.

-When the pattern contains `pattern:\b`, it tests that the position in string fits one of the conditions:
+When the pattern contains `pattern:\b`, it tests that the position in string is a word boundary, that is one of three variants:

- String start, and the first string character is `\w`.
- String end, and the last string character is `\w`.
- Inside the string: from one side is `\w`, from the other side -- not `\w`.
+- Immediately before is `\w`, and immediately after -- not `\w`, or vise versa.
+- At string start, and the first string character is `\w`.
+- At string end, and the last string character is `\w`.

 For instance, in the string `subject:Hello, Java!` the following positions match `\b`:

 ![](hello-java-boundaries.png)

-So it matches `pattern:\bHello\b` and `pattern:\bJava\b`, but not `pattern:\bHell\b` (because there's no word boundary after `l`) and not `Java!\b` (because the exclamation sign is not a wordly character, so there's no word boundary after it).
+So it matches `pattern:\bHello\b`, because:
+
+1. At the beginning of the string the first `\b` test matches.
+2. Then the word `Hello` matches.
+3. Then `\b` matches, as we're between `o` and a space.
+
+Pattern `pattern:\bJava\b` also matches. But not `pattern:\bHell\b` (because there's no word boundary after `l`) and not `Java!\b` (because the exclamation sign is not a wordly character, so there's no word boundary after it).
+

 ```js run
 alert( "Hello, Java!".match(/\bHello\b/) ); // Hello
 alert( "Hello, Java!".match(/\bJava\b/) );  // Java
-alert( "Hello, Java!".match(/\bHell\b/) );  // null
-alert( "Hello, Java!".match(/\bJava!\b/) ); // null
+alert( "Hello, Java!".match(/\bHell\b/) );  // null (no match)
+alert( "Hello, Java!".match(/\bJava!\b/) ); // null (no match)
 ```

-Once again let's note that `pattern:\b` makes the searching engine to test for the boundary, so that `pattern:Java\b` finds `match:Java` only when followed by a word boundary, but it does not add a letter to the result.
+Once again let's note that `pattern:\b` makes the searching engine to test for the boundary, so that `pattern:Java\b` finds `match:Java` only when followed by a word boundary, but it does not add a letter to the result.   §

 Usually we use `\b` to find standalone English words. So that if we want `"Java"` language then `pattern:\bJava\b` finds exactly a standalone word and ignores it when it's a part of `"JavaScript"`.

@ -119,9 +128,9 @@ The word boundary check `\b` tests for a boundary between `\w` and something els
 ```


-## Reverse classes
+## Inverse classes

-For every character class there exists a "reverse class", denoted with the same letter, but uppercased.
+For every character class there exists an "inverse class", denoted with the same letter, but uppercased.

 The "reverse" means that it matches all other characters, for instance:

@ -137,7 +146,9 @@ The "reverse" means that it matches all other characters, for instance:
 `\B`
 : Non-boundary: a test reverse to `\b`.

-In the beginning of the chapter we saw how to get all digits from the phone `subject:+7(903)-123-45-67`. Let's get a "pure" phone number from the string:
+In the beginning of the chapter we saw how to get all digits from the phone `subject:+7(903)-123-45-67`.
+
+One way was to match all digits and join them:

 ```js run
 let str = "+7(903)-123-45-67";
@ -145,7 +156,7 @@ let str = "+7(903)-123-45-67";
 alert( str.match(/\d/g).join('') ); // 79031234567
 ```

-An alternative way would be to find non-digits and remove them from the string:
+An alternative, shorter way is to find non-digits `\D` and remove them from the string:


 ```js run
@ -156,11 +167,9 @@ alert( str.replace(/\D/g, "") ); // 79031234567

 ## Spaces are regular characters

-Please note that regular expressions may include spaces. They are treated like regular characters.  
-
 Usually we pay little attention to spaces. For us strings `subject:1-5` and `subject:1 - 5` are nearly identical.

-But if a regexp does not take spaces into account, it won' work.
+But if a regexp doesn't take spaces into account, it may fail to work.

 Let's try to find digits separated by a dash:

@ -168,23 +177,25 @@ Let's try to find digits separated by a dash:
 alert( "1 - 5".match(/\d-\d/) ); // null, no match!
 ```

-Here we fix it by adding spaces into the regexp:
+Here we fix it by adding spaces into the regexp `pattern:\d - \d`:

 ```js run
 alert( "1 - 5".match(/\d - \d/) ); // 1 - 5, now it works
 ```

-Of course, spaces are needed only if we look for them. Extra spaces (just like any other extra characters) may prevent a match:
+**A space is a character. Equal in importance with any other character.**
+
+Of course, spaces in a regexp are needed only if we look for them. Extra spaces (just like any other extra characters) may prevent a match:

 ```js run
 alert( "1-5".match(/\d - \d/) ); // null, because the string 1-5 has no spaces
 ```

-In other words, in a regular expression all characters matter. Spaces too.
+In other words, in a regular expression all characters matter, spaces too.

 ## A dot is any character

-The dot `"."` is a special character class that matches *any character except a newline*.
+The dot `"."` is a special character class that matches "any character except a newline".

 For instance:

@ -208,19 +219,47 @@ Please note that the dot means "any character", but not the "absense of a charac
 alert( "CS4".match(/CS.4/) ); // null, no match because there's no character for the dot
 ```

+### The dotall "s" flag
+
+Usually a dot doesn't match a newline character.
+
+For instance, this doesn't match:
+
+```js run
+alert( "A\nB".match(/A.B/) ); // null (no match)
+
+// a space character would match
+// or a letter, but not \n
+```
+
+Sometimes it's inconvenient, we really want "any character", newline included.
+
+That's what `s` flag does. If a regexp has it, then the dot `"."` match literally any character:
+
+```js run
+alert( "A\nB".match(/A.B/s) ); // A\nB (match!)
+```
+

 ## Summary

-We covered character classes:
+There exist following character classes:

- `\d` -- digits.
- `\D` -- non-digits.
- `\s` -- space symbols, tabs, newlines.
- `\S` -- all but `\s`.
- `\w` -- English letters, digits, underscore `'_'`.
- `\W` -- all but `\w`.
- `'.'` -- any character except a newline.
+- `pattern:\d` -- digits.
+- `pattern:\D` -- non-digits.
+- `pattern:\s` -- space symbols, tabs, newlines.
+- `pattern:\S` -- all but `pattern:\s`.
+- `pattern:\w` -- English letters, digits, underscore `'_'`.
+- `pattern:\W` -- all but `pattern:\w`.
+- `pattern:.` -- any character if with the regexp `'s'` flag, otherwise any except a newline.

-If we want to search for a character that has a special meaning like a backslash or a dot, then we should escape it with a backslash: `pattern:\.`
+...But that's not all!

-Please note that a regexp may also contain string special characters such as a newline `\n`. There's no conflict with character classes, because other letters are used for them.
+Modern Javascript also allows to look for characters by their Unicode properties, for instance:
+
+- A cyrillic letter is: `pattern:\p{Script=Cyrillic}` or `pattern:\p{sc=Cyrillic}`.
+- A dash (be it a small hyphen `-` or a long dash `—`): `pattern:\p{Dash_Punctuation}` or `pattern:\p{pd}`.
+- A currency symbol: `pattern:\p{Currency_Symbol}` or `pattern:\p{sc}`.
+- ...And much more. Unicode has a lot of character categories that we can select from.
+
+These patterns require `'u'` regexp flag to work. More about that in the chapter [](info:regexp-unicode).