en.javascript.info/5-regular-expressions/20-regexp-unicode/article.md
2019-03-02 23:36:53 +03:00

6.5 KiB
Raw Blame History

Unicode: flag "u", character properties "\p"

The unicode flag /.../u enables the correct support of surrogate pairs.

Surrogate pairs are explained in the chapter info:string.

Let's briefly remind them here. In short, normally characters are encoded with 2 bytes. That gives us 65536 characters maximum. But there are more characters in the world.

So certain rare characters are encoded with 4 bytes, like 𝒳 (mathematical X) or 😄 (a smile).

Here are the unicode values to compare:

Character Unicode Bytes
a 0x0061 2
0x2248 2
𝒳 0x1d4b3 4
𝒴 0x1d4b4 4
😄 0x1f604 4

So characters like a and occupy 2 bytes, and those rare ones take 4.

The unicode is made in such a way that the 4-byte characters only have a meaning as a whole.

In the past JavaScript did not know about that, and many string methods still have problems. For instance, length thinks that here are two characters:

alert('😄'.length); // 2
alert('𝒳'.length); // 2

...But we can see that there's only one, right? The point is that length treats 4 bytes as two 2-byte characters. That's incorrect, because they must be considered only together (so-called "surrogate pair").

Normally, regular expressions also treat "long characters" as two 2-byte ones.

That leads to odd results, for instance let's try to find pattern:[𝒳𝒴] in the string subject:𝒳:

alert( '𝒳'.match(/[𝒳𝒴]/) ); // odd result (wrong match actually, "half-character")

The result is wrong, because by default the regexp engine does not understand surrogate pairs.

So, it thinks that [𝒳𝒴] are not two, but four characters:

  1. the left half of 𝒳 (1),
  2. the right half of 𝒳 (2),
  3. the left half of 𝒴 (3),
  4. the right half of 𝒴 (4).

We can list them like this:

for(let i=0; i<'𝒳𝒴'.length; i++) {
  alert('𝒳𝒴'.charCodeAt(i)); // 55349, 56499, 55349, 56500
};

So it finds only the "left half" of 𝒳.

In other words, the search works like '12'.match(/[1234]/): only 1 is returned.

The "u" flag

The /.../u flag fixes that.

It enables surrogate pairs in the regexp engine, so the result is correct:

alert( '𝒳'.match(/[𝒳𝒴]/u) ); // 𝒳

Let's see one more example.

If we forget the u flag and occasionally use surrogate pairs, then we can get an error:

'𝒳'.match(/[𝒳-𝒴]/); // SyntaxError: invalid range in character class

Normally, regexps understand [a-z] as a "range of characters with codes between codes of a and z.

But without u flag, surrogate pairs are assumed to be a "pair of independant characters", so [𝒳-𝒴] is like [<55349><56499>-<55349><56500>] (replaced each surrogate pair with code points). Now we can clearly see that the range 56499-55349 is unacceptable, as the left range border must be less than the right one.

Using the u flag makes it work right:

alert( '𝒴'.match(/[𝒳-𝒵]/u) ); // 𝒴

Unicode character properies

Unicode, the encoding format used by Javascript strings, has a lot of properties for different characters (or, technically, code points). They describe which "categories" character belongs to, and a variety of technical details.

In regular expressions these can be set by \p{…}. And there must be flag 'u'.

For instance, \p{Letter} denotes a letter in any of language. We can also use \p{L}, as L is an alias of Letter, there are shorter aliases for almost every property.

Here's the main tree of properties:

  • Letter L:
    • lowercase Ll, modifier Lm, titlecase Lt, uppercase Lu, other Lo
  • Number N:
    • decimal digit Nd, letter number Nl, other No:
  • Punctuation P:
    • connector Pc, dash Pd, initial quote Pi, final quote Pf, open Ps, close Pe, other Po
  • Mark M (accents etc):
    • spacing combining Mc, enclosing Me, non-spacing Mn
  • Symbol S:
    • currency Sc, modifier Sk, math Sm, other So
  • Separator Z:
    • line Zl, paragraph Zp, space Zs
  • Other C:
    • control Cc, format Cf, not assigned Cn, private use Co, surrogate Cs.
Interested to see which characters belong to a property? There's a tool at <http://cldr.unicode.org/unicode-utilities/list-unicodeset> for that.

You could also explore properties at [Character Property Index](http://unicode.org/cldr/utility/properties.jsp).

For the full Unicode Character Database in text format (along with all properties), see <https://www.unicode.org/Public/UCD/latest/ucd/>.

There are also other derived categories, like:

  • Alphabetic (Alpha), includes Letters L, plus letter numbers Nl (e.g. roman numbers Ⅻ), plus some other symbols Other_Alphabetic (OAltpa).
  • Hex_Digit includes hexadimal digits: 0-9, a-f.
  • ...Unicode is a big beast, it includes a lot of properties.

For instance, let's look for a 6-digit hex number:

let reg = /\p{Hex_Digit}{6}/u; // flag 'u' is requireds

alert("color: #123ABC".match(reg)); // 123ABC

There are also properties with a value. For instance, Unicode "Script" (a writing system) can be Cyrillic, Greek, Arabic, Han (Chinese) etc, the list is long.

To search for certain scripts, we should supply Script=<value>, e.g. to search for cyrillic letters: \p{sc=Cyrillic}, for Chinese glyphs: \p{sc=Han}, etc.

Universal \w

Let's make a "universal" regexp for pattern:\w, for any language. That task has a standard solution in many programming languages with unicode-aware regexps, e.g. Perl.

/[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]/u

Let's decipher. Remember, pattern:\w is actually the same as pattern:[a-zA-Z0-9_].

So the character set includes:

  • Alphabetic for letters,
  • Mark for accents, as in Unicode accents may be represented by separate code points,
  • Decimal_Number for numbers,
  • Connector_Punctuation for the '_' character and alike,
  • Join_Control - two special code points with hex codes 200c and 200d, used in ligatures e.g. in arabic.

Or, if we replace long names with aliases (a list of aliases here):

let regexp = /([\p{Alpha}\p{M}\p{Nd}\p{Pc}\p{Join_C}]+)/gu;

let str = `Hello Привет 你好 123_456`;

alert( str.match(regexp) ); // Hello,Привет,你好,123_456