regexp draft

This commit is contained in:
Ilya Kantor 2019-03-02 01:02:01 +03:00
parent 1369332661
commit 65184edf76
11 changed files with 730 additions and 399 deletions

View file

@ -1,12 +1,14 @@
# Character classes
Consider a practical task -- we have a phone number `"+7(903)-123-45-67"`, and we need to find all digits in that string. Other characters do not interest us.
Consider a practical task -- we have a phone number `"+7(903)-123-45-67"`, and we need to turn it into pure numbers: `79035419441`.
A character class is a special notation that matches any symbol from the set.
To do so, we can find and remove anything that's not a number. Character classes can help with that.
For instance, there's a "digit" class. It's written as `\d`. We put it in the pattern, and during the search any digit matches it.
A character class is a special notation that matches any symbol from a certain set.
For instance, the regexp `pattern:/\d/` looks for a single digit:
For the start, let's explore a "digit" class. It's written as `\d`. We put it in the pattern, that means "any single digit".
For instance, the let's find the first digit in the phone number:
```js run
let str = "+7(903)-123-45-67";
@ -16,9 +18,9 @@ let reg = /\d/;
alert( str.match(reg) ); // 7
```
The regexp is not global in the example above, so it only looks for the first match.
Without the flag `g`, the regular expression only looks for the first match, that is the first digit `\d`.
Let's add the `g` flag to look for all digits:
Let's add the `g` flag to find all digits:
```js run
let str = "+7(903)-123-45-67";
@ -26,9 +28,9 @@ let str = "+7(903)-123-45-67";
let reg = /\d/g;
alert( str.match(reg) ); // array of matches: 7,9,0,3,1,2,3,4,5,6,7
```
## Most used classes: \d \s \w
alert( str.match(reg).join('') ); // 79035419441
```
That was a character class for digits. There are other character classes as well.
@ -43,9 +45,9 @@ Most used are:
`\w` ("w" is from "word")
: A "wordly" character: either a letter of English alphabet or a digit or an underscore. Non-english letters (like cyrillic or hindi) do not belong to `\w`.
For instance, `pattern:\d\s\w` means a digit followed by a space character followed by a wordly character, like `"1 Z"`.
For instance, `pattern:\d\s\w` means a "digit" followed by a "space character" followed by a "wordly character", like `"1 a"`.
A regexp may contain both regular symbols and character classes.
**A regexp may contain both regular symbols and character classes.**
For instance, `pattern:CSS\d` matches a string `match:CSS` with a digit after it:
@ -68,7 +70,7 @@ The match (each character class corresponds to one result character):
## Word boundary: \b
The word boundary `pattern:\b` -- is a special character class.
A word boundary `pattern:\b` -- is a special character class.
It does not denote a character, but rather a boundary between characters.
@ -79,32 +81,39 @@ alert( "Hello, Java!".match(/\bJava\b/) ); // Java
alert( "Hello, JavaScript!".match(/\bJava\b/) ); // null
```
The boundary has "zero width" in a sense that usually a character class means a character in the result (like a wordly or a digit), but not in this case.
The boundary has "zero width" in a sense that usually a character class means a character in the result (like a wordly character or a digit), but not in this case.
The boundary is a test.
When regular expression engine is doing the search, it's moving along the string in an attempt to find the match. At each string position it tries to find the pattern.
When the pattern contains `pattern:\b`, it tests that the position in string fits one of the conditions:
When the pattern contains `pattern:\b`, it tests that the position in string is a word boundary, that is one of three variants:
- String start, and the first string character is `\w`.
- String end, and the last string character is `\w`.
- Inside the string: from one side is `\w`, from the other side -- not `\w`.
- Immediately before is `\w`, and immediately after -- not `\w`, or vise versa.
- At string start, and the first string character is `\w`.
- At string end, and the last string character is `\w`.
For instance, in the string `subject:Hello, Java!` the following positions match `\b`:
![](hello-java-boundaries.png)
So it matches `pattern:\bHello\b` and `pattern:\bJava\b`, but not `pattern:\bHell\b` (because there's no word boundary after `l`) and not `Java!\b` (because the exclamation sign is not a wordly character, so there's no word boundary after it).
So it matches `pattern:\bHello\b`, because:
1. At the beginning of the string the first `\b` test matches.
2. Then the word `Hello` matches.
3. Then `\b` matches, as we're between `o` and a space.
Pattern `pattern:\bJava\b` also matches. But not `pattern:\bHell\b` (because there's no word boundary after `l`) and not `Java!\b` (because the exclamation sign is not a wordly character, so there's no word boundary after it).
```js run
alert( "Hello, Java!".match(/\bHello\b/) ); // Hello
alert( "Hello, Java!".match(/\bJava\b/) ); // Java
alert( "Hello, Java!".match(/\bHell\b/) ); // null
alert( "Hello, Java!".match(/\bJava!\b/) ); // null
alert( "Hello, Java!".match(/\bHell\b/) ); // null (no match)
alert( "Hello, Java!".match(/\bJava!\b/) ); // null (no match)
```
Once again let's note that `pattern:\b` makes the searching engine to test for the boundary, so that `pattern:Java\b` finds `match:Java` only when followed by a word boundary, but it does not add a letter to the result.
Once again let's note that `pattern:\b` makes the searching engine to test for the boundary, so that `pattern:Java\b` finds `match:Java` only when followed by a word boundary, but it does not add a letter to the result. §
Usually we use `\b` to find standalone English words. So that if we want `"Java"` language then `pattern:\bJava\b` finds exactly a standalone word and ignores it when it's a part of `"JavaScript"`.
@ -119,9 +128,9 @@ The word boundary check `\b` tests for a boundary between `\w` and something els
```
## Reverse classes
## Inverse classes
For every character class there exists a "reverse class", denoted with the same letter, but uppercased.
For every character class there exists an "inverse class", denoted with the same letter, but uppercased.
The "reverse" means that it matches all other characters, for instance:
@ -137,7 +146,9 @@ The "reverse" means that it matches all other characters, for instance:
`\B`
: Non-boundary: a test reverse to `\b`.
In the beginning of the chapter we saw how to get all digits from the phone `subject:+7(903)-123-45-67`. Let's get a "pure" phone number from the string:
In the beginning of the chapter we saw how to get all digits from the phone `subject:+7(903)-123-45-67`.
One way was to match all digits and join them:
```js run
let str = "+7(903)-123-45-67";
@ -145,7 +156,7 @@ let str = "+7(903)-123-45-67";
alert( str.match(/\d/g).join('') ); // 79031234567
```
An alternative way would be to find non-digits and remove them from the string:
An alternative, shorter way is to find non-digits `\D` and remove them from the string:
```js run
@ -156,11 +167,9 @@ alert( str.replace(/\D/g, "") ); // 79031234567
## Spaces are regular characters
Please note that regular expressions may include spaces. They are treated like regular characters.
Usually we pay little attention to spaces. For us strings `subject:1-5` and `subject:1 - 5` are nearly identical.
But if a regexp does not take spaces into account, it won' work.
But if a regexp doesn't take spaces into account, it may fail to work.
Let's try to find digits separated by a dash:
@ -168,23 +177,25 @@ Let's try to find digits separated by a dash:
alert( "1 - 5".match(/\d-\d/) ); // null, no match!
```
Here we fix it by adding spaces into the regexp:
Here we fix it by adding spaces into the regexp `pattern:\d - \d`:
```js run
alert( "1 - 5".match(/\d - \d/) ); // 1 - 5, now it works
```
Of course, spaces are needed only if we look for them. Extra spaces (just like any other extra characters) may prevent a match:
**A space is a character. Equal in importance with any other character.**
Of course, spaces in a regexp are needed only if we look for them. Extra spaces (just like any other extra characters) may prevent a match:
```js run
alert( "1-5".match(/\d - \d/) ); // null, because the string 1-5 has no spaces
```
In other words, in a regular expression all characters matter. Spaces too.
In other words, in a regular expression all characters matter, spaces too.
## A dot is any character
The dot `"."` is a special character class that matches *any character except a newline*.
The dot `"."` is a special character class that matches "any character except a newline".
For instance:
@ -208,19 +219,47 @@ Please note that the dot means "any character", but not the "absense of a charac
alert( "CS4".match(/CS.4/) ); // null, no match because there's no character for the dot
```
### The dotall "s" flag
Usually a dot doesn't match a newline character.
For instance, this doesn't match:
```js run
alert( "A\nB".match(/A.B/) ); // null (no match)
// a space character would match
// or a letter, but not \n
```
Sometimes it's inconvenient, we really want "any character", newline included.
That's what `s` flag does. If a regexp has it, then the dot `"."` match literally any character:
```js run
alert( "A\nB".match(/A.B/s) ); // A\nB (match!)
```
## Summary
We covered character classes:
There exist following character classes:
- `\d` -- digits.
- `\D` -- non-digits.
- `\s` -- space symbols, tabs, newlines.
- `\S` -- all but `\s`.
- `\w` -- English letters, digits, underscore `'_'`.
- `\W` -- all but `\w`.
- `'.'` -- any character except a newline.
- `pattern:\d` -- digits.
- `pattern:\D` -- non-digits.
- `pattern:\s` -- space symbols, tabs, newlines.
- `pattern:\S` -- all but `pattern:\s`.
- `pattern:\w` -- English letters, digits, underscore `'_'`.
- `pattern:\W` -- all but `pattern:\w`.
- `pattern:.` -- any character if with the regexp `'s'` flag, otherwise any except a newline.
If we want to search for a character that has a special meaning like a backslash or a dot, then we should escape it with a backslash: `pattern:\.`
...But that's not all!
Please note that a regexp may also contain string special characters such as a newline `\n`. There's no conflict with character classes, because other letters are used for them.
Modern Javascript also allows to look for characters by their Unicode properties, for instance:
- A cyrillic letter is: `pattern:\p{Script=Cyrillic}` or `pattern:\p{sc=Cyrillic}`.
- A dash (be it a small hyphen `-` or a long dash `—`): `pattern:\p{Dash_Punctuation}` or `pattern:\p{pd}`.
- A currency symbol: `pattern:\p{Currency_Symbol}` or `pattern:\p{sc}`.
- ...And much more. Unicode has a lot of character categories that we can select from.
These patterns require `'u'` regexp flag to work. More about that in the chapter [](info:regexp-unicode).