22 KiB
Strings
In JavaScript, the textual data is stored as strings. There is no separate type for a single character.
The internal format for strings is always UTF-16, it is not tied to the page encoding.
[cut]
Quotes
Let's remember the kinds of quotes.
Strings can be enclosed either with the single, double quotes or in backticks:
let single = 'single-quoted';
let double = "double-quoted";
let backticks = `backticks`;
Single and double quotes are essentially the same. Backticks allow to embed any expression into the string, including function calls:
function sum(a, b) {
return a + b;
}
alert(`1 + 2 = ${sum(1, 2)}.`); // 1 + 2 = 3.
Another advantage of using backticks is that they allow a string to span multiple lines:
let guestList = `Guests:
* John
* Pete
* Mary
`;
alert(guestList); // a list of guests, multiple lines
If we try to use single or double quotes the same way, there will be an error:
let guestList = "Guests: // Error: Unexpected token ILLEGAL
* John";
Single and double quotes come from ancient times of language creation, and the need for multiline strings was not taken into account. Backticks appeared much later and thus are more versatile.
Backticks also allow to specify a "template function" before the first backtick, the syntax is: func`string`
. The function func
is called automatically, receives the string and embedded expressions and can process them. You can read more in the docs. That is called "tagged templates". This feature makes it easier to wrap strings into custom templating or other functionality, but is rarely used.
Special characters
It is still possible to create multiline strings with single quotes, using a so-called "newline character" written as \n
, that denotes a line break:
let guestList = "Guests:\n * John\n * Pete\n * Mary";
alert(guestList); // a multiline list of guests
So to speak, these two lines describe the same:
alert( "Hello\nWorld" ); // two lines using a "newline symbol"
// two lines using a normal newline and backticks
alert( `Hello
World` );
There are other, less common "special" characters as well, here's the list:
Character | Description |
---|---|
\b |
Backspace |
\f |
Form feed |
\n |
New line |
\r |
Carriage return |
\t |
Tab |
\uNNNN |
A unicode symbol with the hex code NNNN , for instance \u00A9 -- is a unicode for the copyright symbol © . Must be exactly 4 hex digits. |
\u{NNNNNNNN} |
Some rare characters are encoded with two unicode symbols, taking up to 4 bytes. The long unicode requires braces around. |
Examples with unicode:
alert( "\u00A9" ); // ©
alert( "\u{20331}" ); // 𠌱, a rare chinese hieroglyph (long unicode)
alert( "\u{1F60D}"); // a smiling face sumbol (another long unicode)
All special characters start with a backslash character \
. It is also called an "escaping character".
We should also use it if we want to insert the quote into the string.
For instance:
alert( 'I*!*\'*/!*m the Walrus!' ); // *!*I'm*/!* the Walrus!
See, we have to prepend the inner quote by the backslash \'
, because otherwise it would mean the string end.
Of course, that refers only for the quotes that are same as the enclosing ones. So, as a more elegant solution, we could switch to double quotes or backticks instead:
alert( `I'm the Walrus!` ); // I'm the Walrus!
Note that the backslash \
serves for the correct reading of the string by Javascript, then disappears. The in-memory string has no \
. You can clearly see that in alert
from the examples above.
But what if we need exactly a backslash \
in the string?
That's possible, but we need to double it like \\
:
alert( `The backslash: \\` ); // The backslash: \
String length
The length
property has the string length:
alert( `My\n`.length ); // 3
Note that \n
is a single "special" character, so the length is indeed 3
.
```warn header="length
is a property"
People with background in some other languages sometimes mistype by calling str.length()
instead of just str.length
. That doesn't work.
Please note that str.length
is a numeric property, not a function. There is no need to add brackets after it.
## Accessing characters
To get a character at position `pos`, use square brackets `[pos]` or call the method [str.charAt(pos)](mdn:js/String/charAt). The first character starts from the zero position:
```js run
let str = `Hello`;
// the first character
alert( str[0] ); // H
alert( str.charAt(0) ); // H
// the last character
alert( str[str.length - 1] ); // o
The square brackets is a modern way of getting a character, while charAt
exists mostly for historical reasons.
The only difference between them is that if no character found, []
returns undefined
, and charAt
returns an empty string:
let str = `Hello`;
alert( str[1000] ); // undefined
alert( str.charAt(1000) ); // '' (an empty string)
Strings are immutable
Strings can't be changed in JavaScript. It is impossible to change a character.
Let's try to see that it doesn't work:
let str = 'Hi';
str[0] = 'h'; // error
alert( str[0] ); // doesn't work
The usual workaround is to create a whole new string and assign it to str
instead of the old one.
For instance:
let str = 'Hi';
str = 'h' + str[1]; // replace the string
alert( str ); // hi
In the following sections we'll see more examples of that.
Changing the case
Methods toLowerCase() and toUpperCase() change the case:
alert( 'Interface'.toUpperCase() ); // INTERFACE
alert( 'Interface'.toLowerCase() ); // interface
Or, if we want a single character lowercased:
alert( 'Interface'[0].toLowerCase() ); // 'i'
Searching for a substring
There are multiple ways to look for a substring in a string.
str.indexOf
The first method is str.indexOf(substr, pos).
It looks for the substr
in str
, starting from the given position pos
, and returns the position where the match was found or -1
if nothing found.
For instance:
let str = 'Widget with id';
alert( str.indexOf('Widget') ); // 0, because 'Widget' is found at the beginning
alert( str.indexOf('widget') ); // -1, not found, the search is case-sensitive
alert( str.indexOf("id") ); // 1, "id" is found at the position 1 (..idget with id)
The optional second parameter allows to search starting from the given position.
For instance, the first occurence of "id"
is at the position 1
. To look for the next occurence, let's start the search from the position 2
:
let str = 'Widget with id';
alert( str.indexOf('id', 2) ) // 12
If we're interested in all occurences, we can run indexOf
in a loop. Every new call is made with the position after the previous match:
let str = 'As sly as a fox, as strong as an ox';
let target = 'as'; // let's look for it
let pos = 0;
while (true) {
let foundPos = str.indexOf(target, pos);
if (foundPos == -1) break;
alert( `Found at ${foundPos}` );
pos = foundPos + 1; // continue the search from the next position
}
The same algorithm can be layed out shorter:
let str = "As sly as a fox, as strong as an ox";
let target = "as";
*!*
let pos = -1;
while ((pos = str.indexOf(target, pos + 1)) != -1) {
alert( pos );
}
*/!*
```smart header="str.lastIndexOf(pos)
"
There is also a similar method str.lastIndexOf(pos) that searches from the end of the string to its beginning.
It would list the occurences in the reverse way.
There is a slight inconvenience with `indexOf` in the `if` test. We can't put it in the `if` like this:
```js run
let str = "Widget with id";
if (str.indexOf("Widget")) {
alert("We found it"); // doesn't work!
}
The alert
in the example above doesn't show, because str.indexOf("Widget")
returns 0
(meaning that it found the match at the starting position). Right, but if
considers that to be false
.
So, we should actualy check for -1
, like that:
let str = "Widget with id";
*!*
if (str.indexOf("Widget") != -1) {
*/!*
alert("We found it"); // works now!
}
One of the old tricks used here is the [bitwise NOT](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/Bitwise_Operators#Bitwise_NOT) `~` operator. It converts the number to 32-bit integer (removes the decimal part if exists) and then reverses all bits in its binary representation.
For 32-bit integers the call `~n` means exactly the same as `-(n+1)` (due to IEEE-754 format).
For instance:
```js run
alert( ~2 ); // -3, the same as -(2+1)
alert( ~1 ); // -2, the same as -(1+1)
alert( ~0 ); // -1, the same as -(0+1)
*!*
alert( ~-1 ); // 0, the same as -(-1+1)
*/!*
```
As we can see, `~n` is zero only if `n == -1`.
So, the test `if ( ~str.indexOf("...") )` is truthy that the result of `indexOf` is not `-1`. In other words, when there is a match.
People use it to shorten `indexOf` checks:
```js run
let str = "Widget";
if (~str.indexOf("Widget")) {
alert( 'Found it!' ); // works
}
```
It is usually not recommended to use language features in a non-obvious way, but this particular trick is widely used in the old code, so we should understand it.
Just remember: `if (~str.indexOf(...))` reads as "if found".
includes, startsWith, endsWith
The more modern method str.includes(substr) returns true/false
depending on whether str
has substr
as its part.
It's the right choice if we need to test for the match, without the position:
alert( "Widget with id".includes("Widget") ); // true
alert( "Hello".includes("Bye") ); // false
The methods str.startsWith and str.endsWith do exactly what they say:
alert( "Widget".startsWith("Wid") ); // true, "Widget" starts with "Wid"
alert( "Widget".endsWith("get") ); // true, "Widget" ends with "get"
Getting a substring
There are 3 methods in JavaScript to get a substring: substring
, substr
and slice
.
str.slice(start [, end])
- Returns the part of the string from
start
to (but not including)end
.For instance:
let str = "stringify"; alert( str.slice(0,5) ); // 'string', the substring from 0 to 5 (not including 5) alert( str.slice(0,1) ); // 's', from 0 to 1, but not including 1, so only character at 0
If there is no second argument, then
slice
goes till the end of the string:let str = "st*!*ringify*/!*"; alert( str.slice(2) ); // ringify, from the 2nd position till the end
Negative values for
start/end
are also possible. They mean the position is counted from the string end:let str = "strin*!*gif*/!*y"; // start at the 4th position from the right, end at the 1st from the right alert( str.slice(-4, -1) ); // gif
str.substring(start [, end])
- Returns the part of the string between
start
andend
.Almost the same as
slice
, but allowsstart
to be greater thanend
.For instance:
let str = "st*!*ring*/!*ify"; // these are same for substring alert( str.substring(2, 6) ); // "ring" alert( str.substring(6, 2) ); // "ring" // ...but not for slice: alert( str.slice(2, 6) ); // "ring" (the same) alert( str.slice(6, 2) ); // "" (an empty string)
Negative arguments are (unlike slice) not supported, they are treated as
0
. str.substr(start [, length])
- Returns the part of the string from
start
, with the givenlength
.In contrast with the previous methods, this one allows to specify the
length
instead of the ending position:let str = "st*!*ring*/!*ify"; alert( str.substr(2, 4) ); // ring, from the 2nd position get 4 characters
The first argument may be negative, to count from the end:
let str = "strin*!*gi*/!*fy"; alert( str.substr(-4, 2) ); // gi, from the 4th position get 2 characters
Let's recap the methods to avoid any confusion:
method | selects... | negatives |
---|---|---|
slice(start, end) |
from start to end |
allows negatives |
substring(start, end) |
between start and end |
negative values mean 0 |
substr(start, length) |
from start get length characters |
allows negative start |
All of them can do the job. The author finds himself using `slice` almost all the time.
Comparing strings
As we know from the chapter info:comparison, strings are compared character-by-character, in the alphabet order.
Although, there are some oddities.
-
A lowercase letter is always greater than the uppercase:
alert( 'a' > 'Z' ); // true
-
Letters with diacritical marks are "out of order":
alert( 'Österreich' > 'Zealand' ); // true
That may lead to strange results if we sort country names. Usually people would await for
Zealand
to be afterÖsterreich
in the list.
To understand what happens, let's review the internal representaion of strings in JavaScript.
All strings are encoded using UTF-16. That is: each character has a corresponding numeric code. There are special methods that allow to get the character for the code and back.
str.codePointAt(pos)
- Returns the code for the character at position
pos
:// different case letters have different codes alert( "z".codePointAt(0) ); // 122 alert( "Z".codePointAt(0) ); // 90
String.fromCodePoint(code)
- Creates a character by its numeric
code
alert( String.fromCodePoint(90) ); // Z
We can also add unicode charactes by their codes using
\u
followed by the hex code:// 90 is 5a in hexadecimal system alert( '\u005a' ); // Z
Now let's see the characters with codes 65..220
(the latin alphabet and a little bit extra) by making a string of them:
let str = '';
for (let i = 65; i <= 220; i++) {
str += String.fromCodePoint(i);
}
alert( str );
// ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
// ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜ
See? Capital character go first, then few special ones, then lowercase characters.
Now it becomes obvious why a > Z
.
The characters are compared by their numeric code. The greater code means that the character is greater. The code for a
(97) is greater than the code for Z
(90).
- All lowercase letters go after uppercase letters, their codes are greater.
- Some letters like
Ö
stand apart from the main alphabet. Here, it's code is greater than anything froma
toz
.
Correct comparisons
The "right" algorithm to do string comparisons is more complex than it may seem. Because the alphabets are different for different languages. So the same letter may be located differently in different alphabets, that is -- even if it looks the same, different alphabets put it in different place.
So, the browser needs to know the language to compare.
Luckily, all modern browsers (IE10- requires the additional library Intl.JS) support the internationalization standard ECMA 402.
It provides a special method to compare strings in different languages, following their rules.
The call str.localeCompare(str2):
- Returns
1
ifstr
is greater thanstr2
according to the language rules. - Returns
-1
ifstr
is less thanstr2
. - Returns
0
if they are equal.
For instance:
alert( 'Österreich'.localeCompare('Zealand') ); // -1
The method actually has two additional arguments specified in the documentation, that allow to specify the language (by default taken from the environment) and setup additional rules like case sensivity or should "a"
and "á"
be treated as the same etc.
Internal encoding
The section goes deeper into string internals. The knowledge will be useful for you if you plan to deal with emoji, rare mathematical of hieroglyphs characters or other rare symbols.
You can skip the section if you don't plan to support them.
Surrogate pairs
Most symbols have a 2-byte code. Letters of most european languages, numbers, even most hieroglyphs have a 2-byte representation.
But 2 bytes only allow 65536 combinations that's not enough for every possible symbol. So rare symbols are encoded with a pair of 2-byte characters called "a surrogate pair".
The length of such symbols is 2
:
alert( '𝒳'.length ); // 2, MATHEMATICAL SCRIPT CAPITAL X
alert( '😂'.length ); // 2, FACE WITH TEARS OF JOY
alert( '𩷶'.length ); // 2, a rare chinese hieroglyph
Note that surrogate pairs did not exist at the time when Javascript was created, and thus are not correctly processed by the language!
We actually have a single symbol in each of the strings above, but the length
shows the length of 2
.
String.fromCodePoint
and str.codePointAt
are notable exceptions that deal with surrogate pairs right. They recently appeared in the language. Before them, there were only String.fromCharCode and str.charCodeAt. These methods are actually the same as fromCodePoint/codePointAt
, but don't work with surrogate pairs.
But, for instance, getting a symbol can be tricky, because surrogate pairs are treated as two characters:
alert( '𩷶'[0] ); // some strange symbols
alert( '𝒳'[0] ); // pieces of the surrogate pair
Note that pieces of the surrogate pair have no meaning without each other. So, the alerts in the example above actually display garbage.
How to solve this problem? First, let's make sure you have it. Not every project deals with surrogate pairs.
But if you do, then search the internet for libraries which implement surrogate-aware versions of slice
, indexOf
and other functions. Technically, surrogate pairs are detectable by their codes: the first character has the code in the interval of 0xD800..0xDBFF
, while the second is in 0xDC00..0xDFFF
. So if we see a character with the code, say, 0xD801
, then the next one must be the second part of the surrogate pair. Libraries rely on that to split stirngs right. Unfortunately, there's no single well-known library to advise yet.
Diacritical marks
In many languages there are symbols that are composed of the base character and a mark above/under it.
For instance, letter a
can be the base character for: àáâäãåā
. Most common "composite" character have their own code in the UTF-16 table. But not all of them, because there are too many possible combinations.
To support arbitrary compositions, UTF-16 allows to use several unicode characters. The base character and one or many "mark" characters that "decorate" it.
For instance, if we have S
followed by the special "dot above" character (code \u0307
), it is shown as Ṡ.
alert( 'S\u0307' ); // Ṡ
If we need a one more mark over the letter (or below it) -- no problem, just add the necessary mark character.
For instance, if we append a character "dot below" (code \u0323
), then we'll have "S with dots above and below": Ṩ
.
The example:
alert( 'S\u0307\u0323' ); // Ṩ
This leads to great flexibility, but also an interesting problem: the same symbol visually can be represented with different unicode compositions.
For instance:
alert( 'S\u0307\u0323' ); // Ṩ, S + dot above + dot below
alert( 'S\u0323\u0307' ); // Ṩ, S + dot below + dot above
alert( 'S\u0307\u0323' == 'S\u0323\u0307' ); // false
To solve it, there exists a "unicode normalization" algorithm that brings each string to the single "normal" form.
It is implemented by str.normalize().
alert( "S\u0307\u0323".normalize() == "S\u0323\u0307".normalize() ); // true
It's rather funny that in that exactly situation normalize()
brings a sequence of 3 characters to one: \u1e68
(S with two dots).
alert( "S\u0307\u0323".normalize().length ); // 1
alert( "S\u0307\u0323".normalize() == "\u1e68" ); // true
In real, that is not always so. It's just the symbol Ṩ
is "common enough" so that UTF-16 creators included it into the main table and gave it the code.
If you want to learn more about normalization rules and variants -- they are described in the appendix to the Unicode standard: Unicode Normalization Forms, but for most practical reasons the information from this section is enough.
Summary
- There are 3 types of quotes. Backticks allow a string to span multiple lines and embed expressions.
- Strings in JavaScript are encoded using UTF-16.
- We can use special characters like
\n
and insert letters by their unicode using\u...
. - To get a character: use
[]
. - To get a substring: use
slice
orsubstr/substring
. - To lowercase/uppercase a string: use
toLowerCase/toUpperCase
. - To look for a substring: use
indexOf
, orincludes/startsWith/endsWith
for simple checks. - To compare strings according to the language, use
localeCompare
, otherwise they are compared by character codes.
There are several other helpful methods in strings, like str.trim()
that removes ("trims") spaces from the beginning and end of the string, see the manual for them.
Also strings have methods for doing search/replace with regular expressions. But that topic deserves a separate chapter, so we'll return to that later.