regexp draft

This commit is contained in:
Ilya Kantor 2019-03-02 01:02:01 +03:00
parent 1369332661
commit 65184edf76
11 changed files with 730 additions and 399 deletions

View file

@ -8,15 +8,13 @@ Let's take the following task as an example.
We have a text and need to replace all quotes `"..."` with guillemet marks: `«...»`. They are preferred for typography in many countries.
For instance: `"Hello, world"` should become `«Hello, world»`.
For instance: `"Hello, world"` should become `«Hello, world»`. Some countries prefer other quotes, like `„Witam, świat!”` (Polish) or `「你好,世界」` (Chinese), but for our task let's choose `«...»`.
Some countries prefer `„Witam, świat!”` (Polish) or even `「你好,世界」` (Chinese) quotes. For different locales we can choose different replacements, but that all works the same, so let's start with `«...»`.
The first thing to do is to locate quoted strings, and then we can replace them.
To make replacements we first need to find all quoted substrings.
A regular expression like `pattern:/".+"/g` (a quote, then something, then the other quote) may seem like a good fit, but it isn't!
The regular expression can look like this: `pattern:/".+"/g`. That is: we look for a quote followed by one or more characters, and then another quote.
...But if we try to apply it, even in such a simple case...
Let's try it:
```js run
let reg = /".+"/g;
@ -193,7 +191,7 @@ Please note, that this logic does not replace lazy quantifiers!
It is just different. There are times when we need one or another.
Let's see one more example where lazy quantifiers fail and this variant works right.
**Let's see an example where lazy quantifiers fail and this variant works right.**
For instance, we want to find links of the form `<a href="..." class="doc">`, with any `href`.
@ -210,7 +208,7 @@ let reg = /<a href=".*" class="doc">/g;
alert( str.match(reg) ); // <a href="link" class="doc">
```
...But what if there are many links in the text?
It worked. But let's see what happens if there are many links in the text?
```js run
let str = '...<a href="link1" class="doc">... <a href="link2" class="doc">...';
@ -239,14 +237,14 @@ let reg = /<a href=".*?" class="doc">/g;
alert( str.match(reg) ); // <a href="link1" class="doc">, <a href="link2" class="doc">
```
Now it works, there are two matches:
Now it seems to work, there are two matches:
```html
<a href="....." class="doc"> <a href="....." class="doc">
<a href="link1" class="doc">... <a href="link2" class="doc">
```
Why it works -- should be obvious after all explanations above. So let's not stop on the details, but try one more text:
...But let's test it on one more text input:
```js run
let str = '...<a href="link1" class="wrong">... <p style="" class="doc">...';
@ -256,24 +254,24 @@ let reg = /<a href=".*?" class="doc">/g;
alert( str.match(reg) ); // <a href="link1" class="wrong">... <p style="" class="doc">
```
We can see that the regexp matched not just a link, but also a lot of text after it, including `<p...>`.
Now it fails. The match includes not just a link, but also a lot of text after it, including `<p...>`.
Why it happens?
Why?
That's what's going on:
1. First the regexp finds a link start `match:<a href="`.
2. Then it looks for `pattern:.*?`: takes one character (lazily!), check if there's a match for `pattern:" class="doc">` (none).
3. Then takes another character into `pattern:.*?`, and so on... until it finally reaches `match:" class="doc">`.
2. Then it looks for `pattern:.*?`, we take one character, then check if there's a match for the rest of the pattern, then take another one...
But the problem is: that's already beyound the link, in another tag `<p>`. Not what we want.
The quantifier `pattern:.*?` consumes characters until it meets `match:class="doc">`.
Here's the picture of the match aligned with the text:
...And where can it find it? If we look at the text, then we can see that the only `match:class="doc">` is beyond the link, in the tag `<p>`.
3. So we have match:
```html
<a href="..................................." class="doc">
<a href="link1" class="wrong">... <p style="" class="doc">
```
```html
<a href="..................................." class="doc">
<a href="link1" class="wrong">... <p style="" class="doc">
```
So the laziness did not work for us here.