en.javascript.info/9-regular-expressions/21-regexp-unicode-properties/article.md
Ilya Kantor 7d6d4366a3 minor
2019-05-21 18:05:46 +03:00

3.9 KiB
Raw Blame History

Unicode character properties \p

Unicode, the encoding format used by JavaScript strings, has a lot of properties for different characters (or, technically, code points). They describe which "categories" character belongs to, and a variety of technical details.

In regular expressions these can be set by \p{…}. And there must be flag 'u'.

For instance, \p{Letter} denotes a letter in any of language. We can also use \p{L}, as L is an alias of Letter, there are shorter aliases for almost every property.

Here's the main tree of properties:

  • Letter L:
    • lowercase Ll, modifier Lm, titlecase Lt, uppercase Lu, other Lo
  • Number N:
    • decimal digit Nd, letter number Nl, other No
  • Punctuation P:
    • connector Pc, dash Pd, initial quote Pi, final quote Pf, open Ps, close Pe, other Po
  • Mark M (accents etc):
    • spacing combining Mc, enclosing Me, non-spacing Mn
  • Symbol S:
    • currency Sc, modifier Sk, math Sm, other So
  • Separator Z:
    • line Zl, paragraph Zp, space Zs
  • Other C:
    • control Cc, format Cf, not assigned Cn, private use Co, surrogate Cs
Interested to see which characters belong to a property? There's a tool at <http://cldr.unicode.org/unicode-utilities/list-unicodeset> for that.

You could also explore properties at [Character Property Index](http://unicode.org/cldr/utility/properties.jsp).

For the full Unicode Character Database in text format (along with all properties), see <https://www.unicode.org/Public/UCD/latest/ucd/>.

There are also other derived categories, like:

  • Alphabetic (Alpha), includes Letters L, plus letter numbers Nl (e.g. roman numbers Ⅻ), plus some other symbols Other_Alphabetic (OAltpa).
  • Hex_Digit includes hexadimal digits: 0-9, a-f.
  • ...Unicode is a big beast, it includes a lot of properties.

For instance, let's look for a 6-digit hex number:

let reg = /\p{Hex_Digit}{6}/u; // flag 'u' is required

alert("color: #123ABC".match(reg)); // 123ABC

There are also properties with a value. For instance, Unicode "Script" (a writing system) can be Cyrillic, Greek, Arabic, Han (Chinese) etc, the list is long.

To search for characters in certain scripts ("alphabets"), we should supply Script=<value>, e.g. to search for cyrillic letters: \p{sc=Cyrillic}, for Chinese glyphs: \p{sc=Han}, etc:

let regexp = /\p{sc=Han}+/gu; // get chinese words

let str = `Hello Привет 你好 123_456`;

alert( str.match(regexp) ); // 你好

Building multi-language \w

The pattern pattern:\w means "wordly characters", but doesn't work for languages that use non-Latin alphabets, such as Cyrillic and others. It's just a shorthand for [a-zA-Z0-9_], so pattern:\w+ won't find any Chinese words etc.

Let's make a "universal" regexp, that looks for wordly characters in any language. That's easy to do using Unicode properties:

/[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]/u

Let's decipher. Just as pattern:\w is the same as pattern:[a-zA-Z0-9_], we're making a set of our own, that includes:

  • Alphabetic for letters,
  • Mark for accents, as in Unicode accents may be represented by separate code points,
  • Decimal_Number for numbers,
  • Connector_Punctuation for the '_' character and alike,
  • Join_Control - two special code points with hex codes 200c and 200d, used in ligatures e.g. in arabic.

Or, if we replace long names with aliases (a list of aliases here):

let regexp = /([\p{Alpha}\p{M}\p{Nd}\p{Pc}\p{Join_C}]+)/gu;

let str = `Hello Привет 你好 123_456`;

alert( str.match(regexp) ); // Hello,Привет,你好,123_456