en.javascript.info/5-regular-expressions/20-regexp-unicode/article.md
2019-03-02 23:36:53 +03:00

167 lines
6.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Unicode: flag "u", character properties "\\p"
The unicode flag `/.../u` enables the correct support of surrogate pairs.
Surrogate pairs are explained in the chapter <info:string>.
Let's briefly remind them here. In short, normally characters are encoded with 2 bytes. That gives us 65536 characters maximum. But there are more characters in the world.
So certain rare characters are encoded with 4 bytes, like `𝒳` (mathematical X) or `😄` (a smile).
Here are the unicode values to compare:
| Character | Unicode | Bytes |
|------------|---------|--------|
| `a` | 0x0061 | 2 |
| `≈` | 0x2248 | 2 |
|`𝒳`| 0x1d4b3 | 4 |
|`𝒴`| 0x1d4b4 | 4 |
|`😄`| 0x1f604 | 4 |
So characters like `a` and `≈` occupy 2 bytes, and those rare ones take 4.
The unicode is made in such a way that the 4-byte characters only have a meaning as a whole.
In the past JavaScript did not know about that, and many string methods still have problems. For instance, `length` thinks that here are two characters:
```js run
alert('😄'.length); // 2
alert('𝒳'.length); // 2
```
...But we can see that there's only one, right? The point is that `length` treats 4 bytes as two 2-byte characters. That's incorrect, because they must be considered only together (so-called "surrogate pair").
Normally, regular expressions also treat "long characters" as two 2-byte ones.
That leads to odd results, for instance let's try to find `pattern:[𝒳𝒴]` in the string `subject:𝒳`:
```js run
alert( '𝒳'.match(/[𝒳𝒴]/) ); // odd result (wrong match actually, "half-character")
```
The result is wrong, because by default the regexp engine does not understand surrogate pairs.
So, it thinks that `[𝒳𝒴]` are not two, but four characters:
1. the left half of `𝒳` `(1)`,
2. the right half of `𝒳` `(2)`,
3. the left half of `𝒴` `(3)`,
4. the right half of `𝒴` `(4)`.
We can list them like this:
```js run
for(let i=0; i<'𝒳𝒴'.length; i++) {
alert('𝒳𝒴'.charCodeAt(i)); // 55349, 56499, 55349, 56500
};
```
So it finds only the "left half" of `𝒳`.
In other words, the search works like `'12'.match(/[1234]/)`: only `1` is returned.
## The "u" flag
The `/.../u` flag fixes that.
It enables surrogate pairs in the regexp engine, so the result is correct:
```js run
alert( '𝒳'.match(/[𝒳𝒴]/u) ); // 𝒳
```
Let's see one more example.
If we forget the `u` flag and occasionally use surrogate pairs, then we can get an error:
```js run
'𝒳'.match(/[𝒳-𝒴]/); // SyntaxError: invalid range in character class
```
Normally, regexps understand `[a-z]` as a "range of characters with codes between codes of `a` and `z`.
But without `u` flag, surrogate pairs are assumed to be a "pair of independant characters", so `[𝒳-𝒴]` is like `[<55349><56499>-<55349><56500>]` (replaced each surrogate pair with code points). Now we can clearly see that the range `56499-55349` is unacceptable, as the left range border must be less than the right one.
Using the `u` flag makes it work right:
```js run
alert( '𝒴'.match(/[𝒳-𝒵]/u) ); // 𝒴
```
## Unicode character properies
[Unicode](https://en.wikipedia.org/wiki/Unicode), the encoding format used by Javascript strings, has a lot of properties for different characters (or, technically, code points). They describe which "categories" character belongs to, and a variety of technical details.
In regular expressions these can be set by `\p{…}`. And there must be flag `'u'`.
For instance, `\p{Letter}` denotes a letter in any of language. We can also use `\p{L}`, as `L` is an alias of `Letter`, there are shorter aliases for almost every property.
Here's the main tree of properties:
- Letter `L`:
- lowercase `Ll`, modifier `Lm`, titlecase `Lt`, uppercase `Lu`, other `Lo`
- Number `N`:
- decimal digit `Nd`, letter number `Nl`, other `No`:
- Punctuation `P`:
- connector `Pc`, dash `Pd`, initial quote `Pi`, final quote `Pf`, open `Ps`, close `Pe`, other `Po`
- Mark `M` (accents etc):
- spacing combining `Mc`, enclosing `Me`, non-spacing `Mn`
- Symbol `S`:
- currency `Sc`, modifier `Sk`, math `Sm`, other `So`
- Separator `Z`:
- line `Zl`, paragraph `Zp`, space `Zs`
- Other `C`:
- control `Cc`, format `Cf`, not assigned `Cn`, private use `Co`, surrogate `Cs`.
```smart header="More information"
Interested to see which characters belong to a property? There's a tool at <http://cldr.unicode.org/unicode-utilities/list-unicodeset> for that.
You could also explore properties at [Character Property Index](http://unicode.org/cldr/utility/properties.jsp).
For the full Unicode Character Database in text format (along with all properties), see <https://www.unicode.org/Public/UCD/latest/ucd/>.
```
There are also other derived categories, like:
- `Alphabetic` (`Alpha`), includes Letters `L`, plus letter numbers `Nl` (e.g. roman numbers Ⅻ), plus some other symbols `Other_Alphabetic` (`OAltpa`).
- `Hex_Digit` includes hexadimal digits: `0-9`, `a-f`.
- ...Unicode is a big beast, it includes a lot of properties.
For instance, let's look for a 6-digit hex number:
```js run
let reg = /\p{Hex_Digit}{6}/u; // flag 'u' is requireds
alert("color: #123ABC".match(reg)); // 123ABC
```
There are also properties with a value. For instance, Unicode "Script" (a writing system) can be Cyrillic, Greek, Arabic, Han (Chinese) etc, the [list is long]("https://en.wikipedia.org/wiki/Script_(Unicode)").
To search for certain scripts, we should supply `Script=<value>`, e.g. to search for cyrillic letters: `\p{sc=Cyrillic}`, for Chinese glyphs: `\p{sc=Han}`, etc.
### Universal \w
Let's make a "universal" regexp for `pattern:\w`, for any language. That task has a standard solution in many programming languages with unicode-aware regexps, e.g. Perl.
```
/[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]/u
```
Let's decipher. Remember, `pattern:\w` is actually the same as `pattern:[a-zA-Z0-9_]`.
So the character set includes:
- `Alphabetic` for letters,
- `Mark` for accents, as in Unicode accents may be represented by separate code points,
- `Decimal_Number` for numbers,
- `Connector_Punctuation` for the `'_'` character and alike,
- `Join_Control` - two special code points with hex codes `200c` and `200d`, used in ligatures e.g. in arabic.
Or, if we replace long names with aliases (a list of aliases [here](https://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt)):
```js run
let regexp = /([\p{Alpha}\p{M}\p{Nd}\p{Pc}\p{Join_C}]+)/gu;
let str = `Hello Привет 你好 123_456`;
alert( str.match(regexp) ); // Hello,Привет,你好,123_456
```