This commit is contained in:
Ilya Kantor 2019-05-09 11:48:14 +03:00
parent 7f5008e18a
commit 5f9597d0c5

View file

@ -43,7 +43,7 @@ Most used are:
: A space symbol: that includes spaces, tabs, newlines.
`\w` ("w" is from "word")
: A "wordly" character: either a letter of English alphabet or a digit or an underscore. Non-english letters (like cyrillic or hindi) do not belong to `\w`.
: A "wordly" character: either a letter of English alphabet or a digit or an underscore. Non-Latin letters (like cyrillic or hindi) do not belong to `\w`.
For instance, `pattern:\d\s\w` means a "digit" followed by a "space character" followed by a "wordly character", like `"1 a"`.
@ -115,7 +115,7 @@ alert( "Hello, Java!".match(/\bJava!\b/) ); // null (no match)
Once again let's note that `pattern:\b` makes the searching engine to test for the boundary, so that `pattern:Java\b` finds `match:Java` only when followed by a word boundary, but it does not add a letter to the result.
Usually we use `\b` to find standalone English words. So that if we want `"Java"` language then `pattern:\bJava\b` finds exactly a standalone word and ignores it when it's a part of `"JavaScript"`.
Usually we use `\b` to find standalone English words. So that if we want `"Java"` language then `pattern:\bJava\b` finds exactly a standalone word and ignores it when it's a part of another word, e.g. it won't match `match:Java` in `subject:JavaScript`.
Another example: a regexp `pattern:\b\d\d\b` looks for standalone two-digit numbers. In other words, it requires that before and after `pattern:\d\d` must be a symbol different from `\w` (or beginning/end of the string).
@ -125,6 +125,8 @@ alert( "1 23 456 78".match(/\b\d\d\b/g) ); // 23,78
```warn header="Word boundary doesn't work for non-Latin alphabets"
The word boundary check `\b` tests for a boundary between `\w` and something else. But `\w` means an English letter (or a digit or an underscore), so the test won't work for other characters (like cyrillic or hieroglyphs).
Later we'll come by Unicode character classes that allow to solve the similar task for different languages.
```
@ -223,13 +225,14 @@ alert( "CS4".match(/CS.4/) ); // null, no match because there's no character for
Usually a dot doesn't match a newline character.
For instance, this doesn't match:
For instance, `pattern:A.B` matches `match:A`, and then `match:B` with any character between them, except a newline.
This doesn't match:
```js run
alert( "A\nB".match(/A.B/) ); // null (no match)
// a space character would match
// or a letter, but not \n
// a space character would match, or a letter, but not \n
```
Sometimes it's inconvenient, we really want "any character", newline included.
@ -240,7 +243,6 @@ That's what `s` flag does. If a regexp has it, then the dot `"."` match literall
alert( "A\nB".match(/A.B/s) ); // A\nB (match!)
```
## Summary
There exist following character classes:
@ -255,7 +257,9 @@ There exist following character classes:
...But that's not all!
Modern JavaScript also allows to look for characters by their Unicode properties, for instance:
The Unicode encoding, used by JavaScript for strings, provides many properties for characters, like: which language the letter belongs to (if a letter) it is it a punctuation sign, etc.
Modern JavaScript allows to use these properties in regexps to look for characters, for instance:
- A cyrillic letter is: `pattern:\p{Script=Cyrillic}` or `pattern:\p{sc=Cyrillic}`.
- A dash (be it a small hyphen `-` or a long dash `—`): `pattern:\p{Dash_Punctuation}` or `pattern:\p{pd}`.