C8.txt

(23 KB) Pobierz
Subsitution and Yet More Regex Power


Basic changes
Suppose you want to replace bits of a string. For example, 'us' with 'them'. 
$_='Us ? The bus usually waits for us, unless the driver forgets us.';

print "$_\n";

s/Us/them/;   # operates on $_, otherwise you need $foo=~s/Us/them/;

print "$_\n";

What happens here is that the string 'Us' is searched for, and when a match is found it is replaced with the right side of the expression, in this case 'them'. Simple. 
You'll notice that only one substitution was made. To match globally use /g which runs through the entire string, changing wherever it can. Try: 

s/Us/them/g;


which fails. This is because regexes are not, by default, case-sensitive. So: 
s/us/them/ig;


would be a better bet. Now, everything is changed. A little too much, but one problem at a time. Everything you have learn about regex so far can be used with s/// , like parens, character classes [ ] , greedy and stingy matching and much more. Deleting things is easy too. Just specify nothing as the replacement character, like so s/Us//; . 
So we can use some of that knowledge to fix this problem. We need to make sure that a space precedes the 'us'. What about: 

s/ us/them/g;


An small improvement. The first 'Us' is now no longer changed, but one problem at a time ! We'll first consider the problem of the regex changing 'usually' and other words with 'us' in them. 
What we are looking for is a space, then 'us', then a comma, period or space. We know how to specify one of a number of options - the character class. 

s/ us[. ,]/them/g;


Another tiny step. Unfortunately, that step wasn't really in the right direction, more on the slippery slope to Poor Programming Practice. Why ? Because we are limiting ourselves. Suppose someone wrote ' send it to us; when we get it'. 
You can't think of all the possible permutations. It is often easier, and safer, to simply state what must not follow the match. In this case, it can be anything except a letter. We can define that as a-z. So we can add that to the regex. 

s/ us[^a-z]/ them/g;


the caret ^ negates the character class, and a-z represents every alphabet from a to z inclusive. A space has been added to the substitution part - as the original space was matched, it should be replaced to maintain readability. 



\w
What would be more useful is to use a-zA-Z instead. If we weren't using /i we'd need that. As a-zA-Z is such a common construct, Perl provides an easy shorthand: 
s/ us[^\w]/ them/g;


The \w construct actually means 'word' - equivalent to a-zA-Z_0-9 . So we'll use that instead. 
To negate any construct, simply capitalise it: 

s/ us[\W]/ them/g;


and of course we don't need the negating caret now. In fact, we don't even need the character class ! 
s/ us\W/ them/g;


So far, so good. Matching the first 'us' is going to be difficult though. Fortunately, there is an easy solution. We've seen Perl's definition of a word - \w . Between each word is a boundary. You can match this with \b . 
s/\bus\W/ them/g;


(that's \b followed by 'us', not 'bus' :-)
Now, we require a word boundary before 'us'. As there is a 'nothing' at the start of the string, we have a match. There is a space after the first 'Us', so the match is successful. You might notice an extra space has crept in - that's the space we added earlier. The match doesn't include the space any more - it matches on the word boundary, that is just before the word begins. The space doesn't count. 
Did you notice the final period and the comma are replaced ? They are part of the match - it is the 


Replacing with what was found
\W that matches them. We can't avoid that. We can however put back that part of the match. 
s/\bus(\W)/them\1/g;


We start with capturing whatever the \W matches, using parens. Then, we add it to the replacement string. The capture is of course in $1 , but as it is in a regex we refer to it as \1 . 
The final problem is of course capitalising the replacement string when appropriate. Which in old versions of the tutorial I left as an exercise to the reader, having run out of motivation. A reader by the name of Paul Trafford duly solved the problem, and I have just inserted his excellent explanation for the elucidation of all concerned: 


#         Solution to the us/them problem...
#
#   The program works through the text assigning the 
#   variable $1 to 'U' or 'u' for any words where this 
#   letter is followed by 's' and then by non 'word' 
#   characters.   The latter is assigned to variable $2.
#
#   For each such matching occurrence, $1 is replaced by 
#   the letter that precedes it in the alphabet using 
#   operations 'ord' and 'chr' that return the ASCII value 
#   of a character and the character corresponding to a 
#   given natural number.  After this 'hem' is tacked on 
#   followed by $2, to retain the shape of the original 
#   sentence.  The '/e' switch is used for evaluation.
#
#   NOTES
#   1. This solution will not replace US (short for 
#   United States) with Them or them.
#
#   2. If a 'magical' decrement operator '--' existed for 
#   strings then the solution could be simplified for we 
#   wouldn't need to use the 'chr' and 'ord' operators.


$_='Us ? The bus usually waits for us, unless the driver forgets us.';

print "$_\n";

s/\b([Uu])s(\W)/chr(ord($1)-1).hem.$2/eg;

print "$_\n";

An excellent solution, thanks Paul. 

There are several more constructs. We'll take a quick look at \d which means anything that is a digit, that is 0-9 . First we'll use the negated form, \D , which is anything except 0-9 : 

print "Enter a number :";
chop ($input=<STDIN>);

if ($input=~/\D/) {
        print "Not a number !!!!\n";
} else {
        print 'Your answer is ',$input x 3,"\n";

}

this checks that there are no non-number characters in $x . It's not perfect because it'll choke on decimal points, but it's just an example. Writing your own number-checker is actually quite difficult, but it is an interesting exercise. Try it, and see how accurate yours is. 



x 
I hope you trusted me and typed the above in exactly as it is show (or pasted it), because the x is not a mistake, it is a feature. If you were too smart and changed it to a * or something change it back and see what it does. 
Of course, there is another way to do it : 

unless ($input=~/\d/) {
        print 'Your answer is ',$input x 3,"\n";
} else {
        print "Not a number !!!!\n";
}

which reverses the logic with an unless statement. 

More Matching
Assume we have: 
$_='HTML <I>munging</I> time is here <I>again</I> !.';

and we want to find all the italic words. We know that /g will match globally, so surely this will work : 
$_='HTML <I>munging</I> time is here <I>again</I> ! What <EM>fun</EM> !';

$match=/<i>(.*?)<\/i>/ig;

print "$match\n";

except it returns 1, and there were definitely two matches. The match operator returns true or false, not the number of matches. So you can test it for truth with functions like if, while, unless Incidentally, the s/// operator does return the number of substitutions. 
To return what is matched, you need to supply a list. 

($match) = /<i>(.*?)<\/i>/i;

which handily puts all the first match into $match . Note that an = is used (for assignment), as opposed to =~ (to point the regex at a variable other than $_. 

The parens force a list context in this case. There is just the one element in the list, but it is still a list. The entire match will be assigned to the list, or whatever is in the parens. Try adding some parens: 

$_='HTML <I>munging</I> time is here <I>again</I> ! What <EM>fun</EM> !';

($word1, $word2) = /<i>(.*?)<\/i>/ig;

print "Word 1 is $word1 and Word 2 is $word2\n";

In the example above notice /g has been added so a global replacement is done - this means perl carries on matching even after it finds the first match. Of course, you might not know how many matches there will be, so you can just use an array, or any other type of list: 
$_='HTML <I>munging</I> time is here <I>again</I> ! What <EM>fun</EM> !';

@words = /<i>(.*?)<\/i>/ig;

foreach $word (@words) {
        print "Found $word\n";
}

and @words will be grown to the appropriate size for the matches. You really can supply what you like to be assigned to: 
($word1, @words[2..3], $last) = /<i>(.*?)<\/i>/ig;

you'll need more italics for that last one to work. It was only a demonstration. 
There is another trick worth knowing. Because a regex returns true each time it matches, we can test that and do something every time it returns true. The ideal function is while which means 'do something as long the condition I'm testing is true'. In this case, we'll print out the match every time it is true. 

$_='HTML <I>munging</I> time is here <I>again</I> ! What <EM>fun</EM> !';

while (/<(.*?)>(.*?)<\/\1>/g) {
        print "Found the HTML tag $1 which has $2 inside\n";
}

So the while operator runs the regex, and if it is true, carries out the statements inside the block. 
Try running the program above without the /g . Notice how it loops forever ? That's because the expression always evaluates to true. By using the /g we force the match to move on until it eventually fails. 

Now we know this, an easy way to find the number of matches is: 

$_='HTML <I>munging</I> time is here <I>again</I> ! What <EM>fun</EM> !';

$found++ while /<i>.*?<\/i>/ig;...
Zgłoś jeśli naruszono regulamin