regexp draft

This commit is contained in:
Ilya Kantor 2019-03-02 12:17:42 +03:00
parent 65184edf76
commit 7888439420
4 changed files with 42 additions and 41 deletions

View file

@ -92,7 +92,7 @@ alert( '𝒴'.match(/[𝒳-𝒵]/u) ); // 𝒴
[Unicode](https://en.wikipedia.org/wiki/Unicode), the encoding format used by Javascript strings, has a lot of properties for different characters (or, technically, code points). They describe which "categories" character belongs to, and a variety of technical details.
In regular expressions these can be set by `\p{…}`.
In regular expressions these can be set by `\p{…}`. And there must be flag `'u'`.
For instance, `\p{Letter}` denotes a letter in any of language. We can also use `\p{L}`, as `L` is an alias of `Letter`, there are shorter aliases for almost every property.
@ -121,13 +121,24 @@ You could also explore properties at [Character Property Index](http://unicode.o
For the full Unicode Character Database in text format (along with all properties), see <https://www.unicode.org/Public/UCD/latest/ucd/>.
```
There are also other derived categories, like `Alphabetic` (`Alpha`), that includes Letters `L`, plus letter numbers `Nl`, plus some other symbols `Other_Alphabetic` (`OAltpa`).
There are also other derived categories, like:
- `Alphabetic` (`Alpha`), includes Letters `L`, plus letter numbers `Nl` (e.g. roman numbers Ⅻ), plus some other symbols `Other_Alphabetic` (`OAltpa`).
- `Hex_Digit` includes hexadimal digits: `0-9`, `a-f`.
- ...Unicode is a big beast, it includes a lot of properties.
Unicode is a big beast, it includes a lot of properties.
For instance, let's look for a 6-digit hex number:
One of properties is `Script` (`sc`), a collection of letters and other written signs used to represent textual information in one or more writing systems. There are about 150 scripts, including Cyrillic, Greek, Arabic, Han (Chinese) etc, the [list is long]("https://en.wikipedia.org/wiki/Script_(Unicode)").
```js run
let reg = /\p{Hex_Digit}{6}/u; // flag 'u' is requireds
The `Script` property needs a value, e.g. to search for cyrillic letters: `\p{sc=Cyrillic}`.
alert("color: #123ABC".match(reg)); // 123ABC
```
There are also properties with a value. For instance, Unicode "Script" (a writing system) can be Cyrillic, Greek, Arabic, Han (Chinese) etc, the [list is long]("https://en.wikipedia.org/wiki/Script_(Unicode)").
To search for certain scripts, we should supply `Script=<value>`, e.g. to search for cyrillic letters: `\p{sc=Cyrillic}`, for Chinese glyphs: `\p{sc=Han}`, etc.
### Universal \w
Let's make a "universal" regexp for `pattern:\w`, for any language. That task has a standard solution in many programming languages with unicode-aware regexps, e.g. Perl.