3.6 KiB
Unicode character properies \p
Unicode, the encoding format used by Javascript strings, has a lot of properties for different characters (or, technically, code points). They describe which "categories" character belongs to, and a variety of technical details.
In regular expressions these can be set by \p{…}. And there must be flag 'u'.
For instance, \p{Letter} denotes a letter in any of language. We can also use \p{L}, as L is an alias of Letter, there are shorter aliases for almost every property.
Here's the main tree of properties:
- Letter
L:- lowercase
Ll, modifierLm, titlecaseLt, uppercaseLu, otherLo
- lowercase
- Number
N:- decimal digit
Nd, letter numberNl, otherNo:
- decimal digit
- Punctuation
P:- connector
Pc, dashPd, initial quotePi, final quotePf, openPs, closePe, otherPo
- connector
- Mark
M(accents etc):- spacing combining
Mc, enclosingMe, non-spacingMn
- spacing combining
- Symbol
S:- currency
Sc, modifierSk, mathSm, otherSo
- currency
- Separator
Z:- line
Zl, paragraphZp, spaceZs
- line
- Other
C:- control
Cc, formatCf, not assignedCn, private useCo, surrogateCs.
- control
Interested to see which characters belong to a property? There's a tool at <http://cldr.unicode.org/unicode-utilities/list-unicodeset> for that.
You could also explore properties at [Character Property Index](http://unicode.org/cldr/utility/properties.jsp).
For the full Unicode Character Database in text format (along with all properties), see <https://www.unicode.org/Public/UCD/latest/ucd/>.
There are also other derived categories, like:
Alphabetic(Alpha), includes LettersL, plus letter numbersNl(e.g. roman numbers Ⅻ), plus some other symbolsOther_Alphabetic(OAltpa).Hex_Digitincludes hexadimal digits:0-9,a-f.- ...Unicode is a big beast, it includes a lot of properties.
For instance, let's look for a 6-digit hex number:
let reg = /\p{Hex_Digit}{6}/u; // flag 'u' is requireds
alert("color: #123ABC".match(reg)); // 123ABC
There are also properties with a value. For instance, Unicode "Script" (a writing system) can be Cyrillic, Greek, Arabic, Han (Chinese) etc, the list is long.
To search for certain scripts, we should supply Script=<value>, e.g. to search for cyrillic letters: \p{sc=Cyrillic}, for Chinese glyphs: \p{sc=Han}, etc:
let regexp = /\p{sc=Han}+/gu; // get chinese words
let str = `Hello Привет 你好 123_456`;
alert( str.match(regexp) ); // 你好
Building multi-language \w
Let's make a "universal" regexp for pattern:\w, for any language. That task has a standard solution in many programming languages with unicode-aware regexps, e.g. Perl.
/[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]/u
Let's decipher. Remember, pattern:\w is actually the same as pattern:[a-zA-Z0-9_].
So the character set includes:
Alphabeticfor letters,Markfor accents, as in Unicode accents may be represented by separate code points,Decimal_Numberfor numbers,Connector_Punctuationfor the'_'character and alike,Join_Control-– two special code points with hex codes200cand200d, used in ligatures e.g. in arabic.
Or, if we replace long names with aliases (a list of aliases here):
let regexp = /([\p{Alpha}\p{M}\p{Nd}\p{Pc}\p{Join_C}]+)/gu;
let str = `Hello Привет 你好 123_456`;
alert( str.match(regexp) ); // Hello,Привет,你好,123_456