3.6 KiB
Unicode character properies \p
Unicode, the encoding format used by Javascript strings, has a lot of properties for different characters (or, technically, code points). They describe which "categories" character belongs to, and a variety of technical details.
In regular expressions these can be set by \p{…}
. And there must be flag 'u'
.
For instance, \p{Letter}
denotes a letter in any of language. We can also use \p{L}
, as L
is an alias of Letter
, there are shorter aliases for almost every property.
Here's the main tree of properties:
- Letter
L
:- lowercase
Ll
, modifierLm
, titlecaseLt
, uppercaseLu
, otherLo
- lowercase
- Number
N
:- decimal digit
Nd
, letter numberNl
, otherNo
:
- decimal digit
- Punctuation
P
:- connector
Pc
, dashPd
, initial quotePi
, final quotePf
, openPs
, closePe
, otherPo
- connector
- Mark
M
(accents etc):- spacing combining
Mc
, enclosingMe
, non-spacingMn
- spacing combining
- Symbol
S
:- currency
Sc
, modifierSk
, mathSm
, otherSo
- currency
- Separator
Z
:- line
Zl
, paragraphZp
, spaceZs
- line
- Other
C
:- control
Cc
, formatCf
, not assignedCn
, private useCo
, surrogateCs
.
- control
Interested to see which characters belong to a property? There's a tool at <http://cldr.unicode.org/unicode-utilities/list-unicodeset> for that.
You could also explore properties at [Character Property Index](http://unicode.org/cldr/utility/properties.jsp).
For the full Unicode Character Database in text format (along with all properties), see <https://www.unicode.org/Public/UCD/latest/ucd/>.
There are also other derived categories, like:
Alphabetic
(Alpha
), includes LettersL
, plus letter numbersNl
(e.g. roman numbers Ⅻ), plus some other symbolsOther_Alphabetic
(OAltpa
).Hex_Digit
includes hexadimal digits:0-9
,a-f
.- ...Unicode is a big beast, it includes a lot of properties.
For instance, let's look for a 6-digit hex number:
let reg = /\p{Hex_Digit}{6}/u; // flag 'u' is requireds
alert("color: #123ABC".match(reg)); // 123ABC
There are also properties with a value. For instance, Unicode "Script" (a writing system) can be Cyrillic, Greek, Arabic, Han (Chinese) etc, the list is long.
To search for certain scripts, we should supply Script=<value>
, e.g. to search for cyrillic letters: \p{sc=Cyrillic}
, for Chinese glyphs: \p{sc=Han}
, etc:
let regexp = /\p{sc=Han}+/gu; // get chinese words
let str = `Hello Привет 你好 123_456`;
alert( str.match(regexp) ); // 你好
Building multi-language \w
Let's make a "universal" regexp for pattern:\w
, for any language. That task has a standard solution in many programming languages with unicode-aware regexps, e.g. Perl.
/[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]/u
Let's decipher. Remember, pattern:\w
is actually the same as pattern:[a-zA-Z0-9_]
.
So the character set includes:
Alphabetic
for letters,Mark
for accents, as in Unicode accents may be represented by separate code points,Decimal_Number
for numbers,Connector_Punctuation
for the'_'
character and alike,Join_Control
-– two special code points with hex codes200c
and200d
, used in ligatures e.g. in arabic.
Or, if we replace long names with aliases (a list of aliases here):
let regexp = /([\p{Alpha}\p{M}\p{Nd}\p{Pc}\p{Join_C}]+)/gu;
let str = `Hello Привет 你好 123_456`;
alert( str.match(regexp) ); // Hello,Привет,你好,123_456