# Strings In JavaScript, the textual data is stored as strings. There is no separate type for a single character. The internal format for strings is always [UTF-16](https://en.wikipedia.org/wiki/UTF-16), it is not tied to the page encoding. [cut] ## Quotes Let's remember the kinds of quotes. Strings can be enclosed either with the single, double quotes or in backticks: ```js let single = 'single-quoted'; let double = "double-quoted"; let backticks = `backticks`; ``` Single and double quotes are essentially the same. Backticks allow to embed any expression into the string, including function calls: ```js run function sum(a, b) { return a + b; } alert(`1 + 2 = ${sum(1, 2)}.`); // 1 + 2 = 3. ``` Another advantage of using backticks is that they allow a string to span multiple lines: ```js run let guestList = `Guests: * John * Pete * Mary `; alert(guestList); // a list of guests, multiple lines ``` If we try to use single or double quotes the same way, there will be an error: ```js run let guestList = "Guests: // Error: Unexpected token ILLEGAL * John"; ``` That's because they come from ancient times of language creation, and the need for multiline strings was not taken into account. Backticks appeared much later. ````smart header="Template function" The advanced feature of backticks is the ability to specify a "template function" at the beginning that would get the string and it's `${…}` components and can convert them. The syntax is: ```js function f(...) { /* the function to postprocess he string */ } let str = f`my string``; ``` We'll get back to this advanced stuff later, because it's rarely used and we won't need it any time soon. ```` ## Special characters It is still possible to create multiline strings with single quotes, using a so-called "newline character" written as `\n`, that denotes a line break: ```js run let guestList = "Guests:\n * John\n * Pete\n * Mary"; alert(guestList); // a list of guests, multiple lines, same as with backticks above ``` So to speak, these two lines describe the same: ```js run alert( "Hello\nWorld" ); // two lines, just like below alert( `Hello World` ); ``` There are other, less common "special" characters as well, here's the list: | Character | Description | |-----------|-------------| |`\b`|Backspace| |`\f`|Form feed| |`\n`|New line| |`\r`|Carriage return| |`\t`|Tab| |`\uNNNN`|A unicode symbol with the hex code `NNNN`, for instance `\u00A9` -- is a unicode for the copyright symbol `©`. Must be exactly 4 hex digits. | |`\u{NNNNNNNN}`|Some rare characters are encoded with two unicode symbols, taking up to 4 bytes. The long unicode requires braces around.| For example: ```js run alert( "\u00A9" ); // © alert( "\u{20331}" ); // 𠌱, a rare chinese hieroglyph ``` As we can see, all special characters start with a backslash character `\`. It is also called an "escaping character". Another use of it is an insertion of the enclosing quote into the string. For instance: ```js run alert( 'I*!*\'*/!*m the Walrus!' ); // *!*I'm*/!* the Walrus! ``` See, we have to prepend the inner quote by the backslash `\'`, because otherwise it would mean the string end. As a more elegant solution, we could wrap the string in double quotes or backticks instead: ```js run alert( `I'm the Walrus!` ); // I'm the Walrus! ``` Most of time when we know we're going to use this or that kind of quotes inside of the string, we can choose non-conflicting quotes to enclose it. Note that the backslash `\` serves for the correct reading of the string by JavaScript, then disappears. The in-memory string has no `\`. You can clearly see that in `alert` from the examples above. But what if we need exactly a backslash `\` in the string? That's possible, but we need to double it like `\\`: ```js run alert( `The backslash: \\` ); // The backslash: \ ``` ## The length and characters - The `length` property has the string length: ```js run alert( `My\n`.length ); // 3 ``` Note that `\n` is a single "special" character, so the length is indeed `3`. - To get a character, use square brackets `[position]` or the method [str.charAt(position)](mdn:String/charAt). The first character starts from the zero position: ```js run let str = `Hello`; // the first character alert( str[0] ); // H alert( str.charAt(0) ); // H // the last character alert( str[str.length - 1] ); // o ``` The square brackets is a modern way of getting a character, while `charAt` exists mostly for historical reasons. The only difference between them is that if no character found, `[]` returns `undefined`, and `charAt` returns an empty string: ```js run let str = `Hello`; alert( str[1000] ); // undefined alert( str.charAt(1000) ); // '' (an empty string) ``` ```warn header="`length` is a property" Please note that `str.length` is a numeric property, not a function. There is no need to add brackets after it. The call `str.length()` won't work, must use bare `str.length`. ``` ## Strings are immutable Strings can't be changed in JavaScript. It is impossible to change a character. Let's try to see that it doesn't work: ```js run let str = 'Hi'; str[0] = 'h'; // error alert( str[0] ); // doesn't work ``` The usual workaround is to create a whole new string and assign it to `str` instead of the old one. For instance: ```js run let str = 'Hi'; str = 'h' + str[1]; // replace the string alert( str ); // hi ``` In the following sections we'll see more examples of that. ## Changing the case Methods [toLowerCase()](mdn:String/toLowerCase) and [toUpperCase()](mdn:String/toUpperCase) change the case: ```js run alert( 'Interface'.toUpperCase() ); // INTERFACE alert( 'Interface'.toLowerCase() ); // interface ``` Or, if we want a single character lowercased: ```js alert( 'Interface'[0].toLowerCase() ); // 'i' ``` ## Finding substrings There are multiple ways to look for a substring in a string. ### str.indexOf The first method is [str.indexOf(substr, pos)](mdn:String/indexOf). It looks for the `substr` in `str`, starting from the given position `pos`, and returns the position where the match was found or `-1` if nothing found. For instance: ```js run let str = 'Widget with id'; alert( str.indexOf('Widget') ); // 0, because 'Widget' is found at the beginning alert( str.indexOf('widget') ); // -1, not found, the search is case-sensitive alert( str.indexOf("id") ); // 1, "id" is found at the position 1 (..idget with id) ``` The optional second parameter allows to search starting from the given position. For instance, the first occurence of `"id"` is at the position `1`. To look for the next occurence, let's start the search from the position `2`: ```js run let str = 'Widget with id'; alert( str.indexOf('id', 2) ) // 12 ``` If we're interested in all occurences, we can run `indexOf` in a loop. Every new call is made with the position after the previous match: ```js run let str = 'As sly as a fox, as strong as an ox'; let target = 'as'; // let's look for it let pos = 0; while (true) { let foundPos = str.indexOf(target, pos); if (foundPos == -1) break; alert( `Found at ${foundPos}` ); pos = foundPos + 1; // continue the search from the next position } ``` The same algorithm can be layed out shorter: ```js run let str = "As sly as a fox, as strong as an ox"; let target = "as"; *!* let pos = -1; while ((pos = str.indexOf(target, pos + 1)) != -1) { alert( pos ); } */!* ``` ```smart header="`str.lastIndexOf(pos)`" There is also a similar method [str.lastIndexOf(pos)](mdn:String/lastIndexOf) that searches from the end of the string to its beginning. It would list the occurences in the reverse way. ``` The inconvenience with `indexOf` is that we can't put it "as is" into an `if` check: ```js run let str = "Widget with id"; if (str.indexOf("Widget")) { alert("We found it"); // won't work } ``` That's because `str.indexOf("Widget")` returns `0` (found at the starting position). Right, but `if` considers that `false`. So, we should actualy check for `-1`, like that: ```js run let str = "Widget with id"; *!* if (str.indexOf("Widget") != -1) { */!* alert("We found it"); // works now! } ``` ````smart header="The bitwise NOT trick" One of the old tricks used here is the [bitwise NOT](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/Bitwise_Operators#Bitwise_NOT) `~` operator. For 32-bit integers the call `~n` is the same as `-(n+1)`. For instance: ```js run alert( ~2 ); // -(2+1) = -3 alert( ~1 ); // -(1+1) = -2 alert( ~0 ); // -(0+1) = -1 *!* alert( ~-1 ); // -(-1+1) = 0 */!* ``` As we can see, `~n` is zero only if `n == -1`. So, `if ( ~str.indexOf("...") )` means that the `indexOf` result is different from `-1`. People use it to shorten `indexOf` checks: ```js run let str = "Widget"; if (~str.indexOf("Widget")) { alert( 'Found it!' ); // works } ``` It is usually not recommended to use language features in a non-obvious way, but this particular trick is widely used, generally JavaScript programmers understand it. Just remember: `if (~str.indexOf(...))` reads as "if found". ```` ### includes, startsWith, endsWith The more modern method [str.includes(substr)](mdn:String/includes) returns `true/false` depending on whether `str` has `substr` as its part. That's usually a simpler way to go if we don't need the exact position: ```js run alert( "Widget with id".includes("Widget") ); // true alert( "Hello".includes("Bye") ); // false ``` The methods [str.startsWith](mdn:String/startsWith) and [str.endsWith](mdn:String/endsWith) do exactly what they promise: ```js run alert( "Widget".startsWith("Wid") ); // true, "Widget" starts with "Wid" alert( "Widget".endsWith("get") ); // true, "Widget" ends with "get" ``` ## Getting a substring There are 3 methods in JavaScript to get a substring: `substring`, `substr` and `slice`. `str.slice(start [, end])` : Returns the part of the string from `start` to, but not including, `end`. For instance: ```js run let str = "stringify"; alert( str.slice(0,5) ); // 'string', the substring from 0, but not including 5 alert( str.slice(0,1) ); // 's', the substring from 0, but not including 1 ``` If there is no `end` argument, then `slice` goes till the end of the string: ```js run let str = "st*!*ringify*/!*"; alert( str.slice(2) ); // ringify, from the 2nd position till the end ``` Negative values for `start/end` are also possible. They mean the position is counted from the string end: ```js run let str = "strin*!*gif*/!*y"; // start at the 4th position from the right, end at the 1st from the right alert( str.slice(-4, -1) ); // gif ``` `str.substring(start [, end])` : Returns the part of the string *between* `start` and `end`. Almost the same as `slice`, but allows `start` greater than `end`. For instance: ```js run let str = "st*!*ring*/!*ify"; alert( str.substring(2, 6) ); // "ring" alert( str.substring(6, 2) ); // "ring" // compare with slice: alert( str.slice(2, 6) ); // "ring" (the same) alert( str.slice(6, 2) ); // "" (an empty string) ``` Negative arguments are treated as `0`. `str.substr(start [, length])` : Returns the part of the string from `start`, with the given `length`. In contrast with the previous methods, this one allows to specify the `length` instead of the ending position: ```js run let str = "st*!*ring*/!*ify"; alert( str.substr(2, 4) ); // ring, from the 2nd position get 4 characters ``` The first argument may be negative, to count from the end: ```js run let str = "strin*!*gi*/!*fy"; alert( str.substr(-4, 2) ); // gi, from the 4th position get 2 characters ``` Let's recap the methods to avoid any confusion: | method | selects... | negatives | |--------|-----------|-----------| | `slice(start, end)` | from `start` to `end` | allows negatives | | `substring(start, end)` | between `start` and `end` | negative values mean `0` | | `substr(start, length)` | from `start` get `length` characters | allows negative `start` | ```smart header="Which one to choose?" All of them can do the job. The author of this chapter finds himself using `slice` almost all the time. ``` ## Comparing strings As we know from the chapter , strings are compared character-by-character, in the alphabet order. Although, there are some oddities. 1. A lowercase letter is always greater than the uppercase: ```js run alert( 'a' > 'Z' ); // true ``` 2. Letters with diacritical marks are "out of the alphabet": ```js run alert( 'Österreich' > 'Zealand' ); // true ``` That may give strange results if we sort country names. Usually people would await for `Zealand` to be after `Österreich` in the list. To understand the reasoning behind that, let's review the internal representaion of strings in JavaScript. All strings are encoded using [UTF-16](https://en.wikipedia.org/wiki/UTF-16). That is: each character has a corresponding numeric code. There are special methods that allow to get the character for the code and back. `str.codePointAt(pos)` : Returns the code for the character at position `pos`: ```js run // different case letters have different codes alert( "z".codePointAt(0) ); // 122 alert( "Z".codePointAt(0) ); // 90 ``` `String.fromCodePoint(code)` : Creates a character by its numeric `code` ```js run alert( String.fromCodePoint(90) ); // Z ``` We can also add unicode charactes by their codes using `\u` followed by the hex code: ```js run // 90 is 5a in hexadecimal system alert( '\u005a' ); // Z ``` Now let's make the string from the characters with codes `65..220` (the latin alphabet and a little bit extra): ```js run let str = ''; for (let i = 65; i <= 220; i++) { str += String.fromCodePoint(i); } alert( str ); // ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~€‚ƒ„ // ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜ ``` Now it becomes obvious why `a > Z`. The characters are compared by their numeric code. The greater code means that the character is greater. And we can easily see that: 1. Lowercase letters go after uppercase letters, their codes are greater. 2. Some letters like `Ö` stand apart from the main alphabet. Here, it's code is greater than anything from `a` to `z`. ### The correct way The "right" comparisons are more complex than it may seem. Because the alphabets are different for different languages. The same letter may be located differently in different alphabets. Luckily, all modern browsers (IE10- requires the additional library [Intl.JS](https://github.com/andyearnshaw/Intl.js/)) support the internationalization standard [ECMA 402](http://www.ecma-international.org/ecma-402/1.0/ECMA-402.pdf). It provides a special method to compare strings in different languages, following their rules. [str.localeCompare(str2)](mdn:String/localeCompare): - Returns `1` if `str` is greater than `str2` according to the language rules. - Returns `-1` if `str` is less than `str2`. - Returns `0` if they are equal. For instance: ```js run alert( 'Österreich'.localeCompare('Zealand') ); // -1 ``` The method actually has two additional arguments, allowing to specify the language (by default taken from the environment) and setup additional rules like case sensivity or should `a` and `á` be treated as the same etc. See the manual for details when you need them. ## Encoding ```warn header="Advanced knowledge" The section goes deeper into string internals. The knowledge will be useful for you if you plan to deal with emoji, rare math of hieroglyphs characters and such. You can skip the section if all you need is common letters and digits. ``` ### Surrogate pairs Most symbols have a 2-byte code. Letters of most european languages, numbers, even most hieroglyphs have a 2-byte representation. But 2 bytes only allow 65536 combinations that's not enough for every possible symbol. So rare symbols are encoded with a pair of 2-byte characters called "a surrogate pair". Examples of symbols encoded this way: ```js run alert( '𝒳'.length ); // 2, MATHEMATICAL SCRIPT CAPITAL X alert( '😂'.length ); // 2, FACE WITH TEARS OF JOY alert( '𩷶'.length ); // 2, a rare chinese hieroglyph ``` Note that surrogate pairs are incorrectly processed by the language most of the time. We actually have a single symbol in each of the strings above, but the `length` shows the length of `2`. `String.fromCodePoint` and `str.codePointAt` are notable exceptions that deal with surrogate pairs right. They recently appeared in the language. Before them, there were only [String.fromCharCode](mdn:String/fromCharCode) and [str.charCodeAt](mdn:String/charCodeAt) that do the same, but don't work with surrogate pairs. Getting a symbol can also be tricky, because most functions treat surrogate pairs as two characters: ```js run alert( '𩷶'[0] ); // some strange symbols alert( '𝒳'[0] ); // pieces of the surrogate pair ``` Note that pieces of the surrogate pair have no meaning without each other. So, the alerts actually display garbage. How to solve this problem? First, let's make sure you have it. Not every project deals with surrogate pairs. But if you do, then there are libraries in the net which implement surrogate-aware versions of `slice`, `indexOf` and other functions. Surrogate pairs are detectable by their codes: the first character has the code in the interval of `0xD800..0xDBFF`, while the second is in `0xDC00..0xDFFF`. So if we see a character with the code, say, `0xD801`, then the next one must be the second part of the surrogate pair. ### Diacritical marks In many languages there are symbols that are composed of the base character and a mark above/under it. For instance, letter `a` can be the base character for: `àáâäãåā`. Most common "composite" character have their own code in the UTF-16 table. But not all of them. To generate arbitrary compositions, several unicode characters are used: the base character and one or many "mark" characters. For instance, if we have `S` followed by "dot above" character (code `\u0307`), it is shown as Ṡ. ```js run alert( 'S\u0307' ); // Ṡ ``` If we need a one more mark over the letter (or below it) -- no problems, just add the necessary mark character. For instance, if we append a character "dot below" (code `\u0323`), then we'll have "S with dots above and below": `Ṩ`. The example: ```js run alert( 'S\u0307\u0323' ); // Ṩ ``` This leads to great flexibility, but also an interesting problem: the same symbol visually can be represented with different unicode compositions. For instance: ```js run alert( 'S\u0307\u0323' ); // Ṩ, S + dot above + dot below alert( 'S\u0323\u0307' ); // Ṩ, S + dot below + dot above alert( 'S\u0307\u0323' == 'S\u0323\u0307' ); // false ``` To solve it, there exists a "unicode normalization" algorithm that brings each string to the single "normal" form. It is implemented by [str.normalize()](mdn:String/normalize). ```js run alert( "S\u0307\u0323".normalize() == "S\u0323\u0307".normalize() ); // true ``` It's rather funny that in that exactly situation `normalize()` brings a sequence of 3 characters to one: `\u1e68` (S with two dots). ```js run alert( "S\u0307\u0323".normalize().length ); // 1 alert( "S\u0307\u0323".normalize() == "\u1e68" ); // true ``` In real, that is not always so, but the symbol `Ṩ` was considered "common enough" by UTF-16 creators to include it into the main table. For most practical tasks that information is enough, but if you want to learn more about normalization rules and variants -- they are described in the appendix to the Unicode standard: [Unicode Normalization Forms](http://www.unicode.org/reports/tr15/). ## Summary - There are 3 types of quotes. Backticks allow a string to span multiple lines and embed expressions. - Strings in JavaScript are encoded using UTF-16. - We can use special characters like `\n` and insert letters by their unicode using `\u...`. - To get a character: use `[]`. - To get a substring: use `slice` or `substr/substring`. - To lowercase/uppercase a string: use `toLowerCase/toUpperCase`. - To look for a substring: use `indexOf`, or `includes/startsWith/endsWith` for simple checks. - To compare strings according to the language, use `localeCompare`, otherwise they are compared by character codes. There are several other helpful methods in strings, like `str.trim()` that removes ("trims") spaces from the beginning and end of the string, see the [manual](mdn:String) for them. Also strings have methods for doing search/replace with regular expressions. But that topic deserves a separate chapter, so we'll return to that later.