6.2 KiB
Unicode: flag "u", character properties "\p"
The unicode flag /.../u
enables the correct support of surrogate pairs.
Surrogate pairs are explained in the chapter info:string.
Let's briefly remind them here. In short, normally characters are encoded with 2 bytes. That gives us 65536 characters maximum. But there are more characters in the world.
So certain rare characters are encoded with 4 bytes, like 𝒳
(mathematical X) or 😄
(a smile).
Here are the unicode values to compare:
Character | Unicode | Bytes |
---|---|---|
a |
0x0061 | 2 |
≈ |
0x2248 | 2 |
𝒳 |
0x1d4b3 | 4 |
𝒴 |
0x1d4b4 | 4 |
😄 |
0x1f604 | 4 |
So characters like a
and ≈
occupy 2 bytes, and those rare ones take 4.
The unicode is made in such a way that the 4-byte characters only have a meaning as a whole.
In the past JavaScript did not know about that, and many string methods still have problems. For instance, length
thinks that here are two characters:
alert('😄'.length); // 2
alert('𝒳'.length); // 2
...But we can see that there's only one, right? The point is that length
treats 4 bytes as two 2-byte characters. That's incorrect, because they must be considered only together (so-called "surrogate pair").
Normally, regular expressions also treat "long characters" as two 2-byte ones.
That leads to odd results, for instance let's try to find pattern:[𝒳𝒴]
in the string subject:𝒳
:
alert( '𝒳'.match(/[𝒳𝒴]/) ); // odd result (wrong match actually, "half-character")
The result is wrong, because by default the regexp engine does not understand surrogate pairs.
So, it thinks that [𝒳𝒴]
are not two, but four characters:
- the left half of
𝒳
(1)
, - the right half of
𝒳
(2)
, - the left half of
𝒴
(3)
, - the right half of
𝒴
(4)
.
We can list them like this:
for(let i=0; i<'𝒳𝒴'.length; i++) {
alert('𝒳𝒴'.charCodeAt(i)); // 55349, 56499, 55349, 56500
};
So it finds only the "left half" of 𝒳
.
In other words, the search works like '12'.match(/[1234]/)
: only 1
is returned.
The "u" flag
The /.../u
flag fixes that.
It enables surrogate pairs in the regexp engine, so the result is correct:
alert( '𝒳'.match(/[𝒳𝒴]/u) ); // 𝒳
Let's see one more example.
If we forget the u
flag and occasionally use surrogate pairs, then we can get an error:
'𝒳'.match(/[𝒳-𝒴]/); // SyntaxError: invalid range in character class
Normally, regexps understand [a-z]
as a "range of characters with codes between codes of a
and z
.
But without u
flag, surrogate pairs are assumed to be a "pair of independant characters", so [𝒳-𝒴]
is like [<55349><56499>-<55349><56500>]
(replaced each surrogate pair with code points). Now we can clearly see that the range 56499-55349
is unacceptable, as the left range border must be less than the right one.
Using the u
flag makes it work right:
alert( '𝒴'.match(/[𝒳-𝒵]/u) ); // 𝒴
Unicode character properies
Unicode, the encoding format used by Javascript strings, has a lot of properties for different characters (or, technically, code points). They describe which "categories" character belongs to, and a variety of technical details.
In regular expressions these can be set by \p{…}
.
For instance, \p{Letter}
denotes a letter in any of language. We can also use \p{L}
, as L
is an alias of Letter
, there are shorter aliases for almost every property.
Here's the main tree of properties:
- Letter
L
:- lowercase
Ll
, modifierLm
, titlecaseLt
, uppercaseLu
, otherLo
- lowercase
- Number
N
:- decimal digit
Nd
, letter numberNl
, otherNo
:
- decimal digit
- Punctuation
P
:- connector
Pc
, dashPd
, initial quotePi
, final quotePf
, openPs
, closePe
, otherPo
- connector
- Mark
M
(accents etc):- spacing combining
Mc
, enclosingMe
, non-spacingMn
- spacing combining
- Symbol
S
:- currency
Sc
, modifierSk
, mathSm
, otherSo
- currency
- Separator
Z
:- line
Zl
, paragraphZp
, spaceZs
- line
- Other
C
:- control
Cc
, formatCf
, not assignedCn
, private useCo
, surrogateCs
.
- control
Interested to see which characters belong to a property? There's a tool at <http://cldr.unicode.org/unicode-utilities/list-unicodeset> for that.
You could also explore properties at [Character Property Index](http://unicode.org/cldr/utility/properties.jsp).
For the full Unicode Character Database in text format (along with all properties), see <https://www.unicode.org/Public/UCD/latest/ucd/>.
There are also other derived categories, like Alphabetic
(Alpha
), that includes Letters L
, plus letter numbers Nl
, plus some other symbols Other_Alphabetic
(OAltpa
).
Unicode is a big beast, it includes a lot of properties.
One of properties is Script
(sc
), a collection of letters and other written signs used to represent textual information in one or more writing systems. There are about 150 scripts, including Cyrillic, Greek, Arabic, Han (Chinese) etc, the list is long.
The Script
property needs a value, e.g. to search for cyrillic letters: \p{sc=Cyrillic}
.
Let's make a "universal" regexp for pattern:\w
, for any language. That task has a standard solution in many programming languages with unicode-aware regexps, e.g. Perl.
/[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]/u
Let's decipher. Remember, pattern:\w
is actually the same as pattern:[a-zA-Z0-9_]
.
So the character set includes:
Alphabetic
for letters,Mark
for accents, as in Unicode accents may be represented by separate code points,Decimal_Number
for numbers,Connector_Punctuation
for the'_'
character and alike,Join_Control
-– two special code points with hex codes200c
and200d
, used in ligatures e.g. in arabic.
Or, if we replace long names with aliases (a list of aliases here):
let regexp = /([\p{Alpha}\p{M}\p{Nd}\p{Pc}\p{Join_C}]+)/gu;
let str = `Hello Привет 你好 123_456`;
alert( str.match(regexp) ); // Hello,Привет,你好,123_456