205 lines
6.8 KiB
Markdown
205 lines
6.8 KiB
Markdown
# Character classes
|
|
|
|
Consider a practical task -- we have a phone number like `"+7(903)-123-45-67"`, and we need to turn it into pure numbers: `79035419441`.
|
|
|
|
To do so, we can find and remove anything that's not a number. Character classes can help with that.
|
|
|
|
A *character class* is a special notation that matches any symbol from a certain set.
|
|
|
|
For the start, let's explore the "digit" class. It's written as `pattern:\d` and corresponds to "any single digit".
|
|
|
|
For instance, the let's find the first digit in the phone number:
|
|
|
|
```js run
|
|
let str = "+7(903)-123-45-67";
|
|
|
|
let regexp = /\d/;
|
|
|
|
alert( str.match(regexp) ); // 7
|
|
```
|
|
|
|
Without the flag `pattern:g`, the regular expression only looks for the first match, that is the first digit `pattern:\d`.
|
|
|
|
Let's add the `pattern:g` flag to find all digits:
|
|
|
|
```js run
|
|
let str = "+7(903)-123-45-67";
|
|
|
|
let regexp = /\d/g;
|
|
|
|
alert( str.match(regexp) ); // array of matches: 7,9,0,3,1,2,3,4,5,6,7
|
|
|
|
// let's make the digits-only phone number of them:
|
|
alert( str.match(regexp).join('') ); // 79035419441
|
|
```
|
|
|
|
That was a character class for digits. There are other character classes as well.
|
|
|
|
Most used are:
|
|
|
|
`pattern:\d` ("d" is from "digit")
|
|
: A digit: a character from `0` to `9`.
|
|
|
|
`pattern:\s` ("s" is from "space")
|
|
: A space symbol: includes spaces, tabs `\t`, newlines `\n` and few other rare characters, such as `\v`, `\f` and `\r`.
|
|
|
|
`pattern:\w` ("w" is from "word")
|
|
: A "wordly" character: either a letter of Latin alphabet or a digit or an underscore `_`. Non-Latin letters (like cyrillic or hindi) do not belong to `pattern:\w`.
|
|
|
|
For instance, `pattern:\d\s\w` means a "digit" followed by a "space character" followed by a "wordly character", such as `match:1 a`.
|
|
|
|
**A regexp may contain both regular symbols and character classes.**
|
|
|
|
For instance, `pattern:CSS\d` matches a string `match:CSS` with a digit after it:
|
|
|
|
```js run
|
|
let str = "Is there CSS4?";
|
|
let regexp = /CSS\d/
|
|
|
|
alert( str.match(regexp) ); // CSS4
|
|
```
|
|
|
|
Also we can use many character classes:
|
|
|
|
```js run
|
|
alert( "I love HTML5!".match(/\s\w\w\w\w\d/) ); // ' HTML5'
|
|
```
|
|
|
|
The match (each regexp character class has the corresponding result character):
|
|
|
|

|
|
|
|
## Inverse classes
|
|
|
|
For every character class there exists an "inverse class", denoted with the same letter, but uppercased.
|
|
|
|
The "inverse" means that it matches all other characters, for instance:
|
|
|
|
`pattern:\D`
|
|
: Non-digit: any character except `pattern:\d`, for instance a letter.
|
|
|
|
`pattern:\S`
|
|
: Non-space: any character except `pattern:\s`, for instance a letter.
|
|
|
|
`pattern:\W`
|
|
: Non-wordly character: anything but `pattern:\w`, e.g a non-latin letter or a space.
|
|
|
|
In the beginning of the chapter we saw how to make a number-only phone number from a string like `subject:+7(903)-123-45-67`: find all digits and join them.
|
|
|
|
```js run
|
|
let str = "+7(903)-123-45-67";
|
|
|
|
alert( str.match(/\d/g).join('') ); // 79031234567
|
|
```
|
|
|
|
An alternative, shorter way is to find non-digits `pattern:\D` and remove them from the string:
|
|
|
|
```js run
|
|
let str = "+7(903)-123-45-67";
|
|
|
|
alert( str.replace(/\D/g, "") ); // 79031234567
|
|
```
|
|
|
|
## A dot is "any character"
|
|
|
|
A dot `pattern:.` is a special character class that matches "any character except a newline".
|
|
|
|
For instance:
|
|
|
|
```js run
|
|
alert( "Z".match(/./) ); // Z
|
|
```
|
|
|
|
Or in the middle of a regexp:
|
|
|
|
```js run
|
|
let regexp = /CS.4/;
|
|
|
|
alert( "CSS4".match(regexp) ); // CSS4
|
|
alert( "CS-4".match(regexp) ); // CS-4
|
|
alert( "CS 4".match(regexp) ); // CS 4 (space is also a character)
|
|
```
|
|
|
|
Please note that a dot means "any character", but not the "absense of a character". There must be a character to match it:
|
|
|
|
```js run
|
|
alert( "CS4".match(/CS.4/) ); // null, no match because there's no character for the dot
|
|
```
|
|
|
|
### Dot as literally any character with "s" flag
|
|
|
|
By default, a dot doesn't match the newline character `\n`.
|
|
|
|
For instance, the regexp `pattern:A.B` matches `match:A`, and then `match:B` with any character between them, except a newline `\n`:
|
|
|
|
```js run
|
|
alert( "A\nB".match(/A.B/) ); // null (no match)
|
|
```
|
|
|
|
There are many situations when we'd like a dot to mean literally "any character", newline included.
|
|
|
|
That's what flag `pattern:s` does. If a regexp has it, then a dot `pattern:.` matches literally any character:
|
|
|
|
```js run
|
|
alert( "A\nB".match(/A.B/s) ); // A\nB (match!)
|
|
```
|
|
|
|
````warn header="Not supported in Firefox, IE, Edge"
|
|
Check <https://caniuse.com/#search=dotall> for the most recent state of support. At the time of writing it doesn't include Firefox, IE, Edge.
|
|
|
|
Luckily, there's an alternative, that works everywhere. We can use a regexp like `pattern:[\s\S]` to match "any character".
|
|
|
|
```js run
|
|
alert( "A\nB".match(/A[\s\S]B/) ); // A\nB (match!)
|
|
```
|
|
|
|
The pattern `pattern:[\s\S]` literally says: "a space character OR not a space character". In other words, "anything". We could use another pair of complementary classes, such as `pattern:[\d\D]`, that doesn't matter.
|
|
|
|
This trick works everywhere. Also we can use it if we don't want to set `pattern:s` flag, in cases when we want a regular "no-newline" dot too in the pattern.
|
|
|
|
Also worth mentioning is, besides `[\s\S]`, there is another regular expression that can match any character, which is `[^]` -- it means match any character except nothing, and that means to match any character without exception. `[^]` and `[\s\S]` are the two typical regular expressions to solve the missing of `s` flag problem.
|
|
````
|
|
|
|
````warn header="Pay attention to spaces"
|
|
Usually we pay little attention to spaces. For us strings `subject:1-5` and `subject:1 - 5` are nearly identical.
|
|
|
|
But if a regexp doesn't take spaces into account, it may fail to work.
|
|
|
|
Let's try to find digits separated by a hyphen:
|
|
|
|
```js run
|
|
alert( "1 - 5".match(/\d-\d/) ); // null, no match!
|
|
```
|
|
|
|
Let's fix it adding spaces into the regexp `pattern:\d - \d`:
|
|
|
|
```js run
|
|
alert( "1 - 5".match(/\d - \d/) ); // 1 - 5, now it works
|
|
// or we can use \s class:
|
|
alert( "1 - 5".match(/\d\s-\s\d/) ); // 1 - 5, also works
|
|
```
|
|
|
|
**A space is a character. Equal in importance with any other character.**
|
|
|
|
We can't add or remove spaces from a regular expression and expect to work the same.
|
|
|
|
In other words, in a regular expression all characters matter, spaces too.
|
|
````
|
|
|
|
## Summary
|
|
|
|
There exist following character classes:
|
|
|
|
- `pattern:\d` -- digits.
|
|
- `pattern:\D` -- non-digits.
|
|
- `pattern:\s` -- space symbols, tabs, newlines.
|
|
- `pattern:\S` -- all but `pattern:\s`.
|
|
- `pattern:\w` -- Latin letters, digits, underscore `'_'`.
|
|
- `pattern:\W` -- all but `pattern:\w`.
|
|
- `pattern:.` -- any character if with the regexp `'s'` flag, otherwise any except a newline `\n`.
|
|
|
|
...But that's not all!
|
|
|
|
Unicode encoding, used by JavaScript for strings, provides many properties for characters, like: which language the letter belongs to (if it's a letter) it is it a punctuation sign, etc.
|
|
|
|
We can search by these properties as well. That requires flag `pattern:u`, covered in the next article.
|