Saturday, July 25, 2009

adventures in ignorance: making do with the new \d

So, I have given up hope of \d being [0-9] in Perl 5. Even if it gets changed back in 5.12, it will be unsafe to consider it to be [0-9] for a long time (since it will still be wrong on 5.8 and 5.10, and we will need for those interpreters to leave the ecosystem). By the time it would be safe to assume \d means [0-9], Perl 6 will be here, and the current Perl 6 policy is that \d will continue to match any Unicode digit.

In light of this surrender, it would be nice if there were a simple way of specifying a specific digit. Right now, you see regexes like
    /(?<!\d) ( 03 (?: \d\d-\d{7} | \d{9} ) ) (?!\d)/gx
The hardcoded 03 in that regex causes a problem in the brave new world where we happily deal with digits other than [0-9]. To live in that world, we have three choices I can see:
  1. add new syntax to handle this to regexes
  2. use \d and a code block to check the value character
  3. create a character class of every 0 character and every 3 character
I am not certain what option 1 would look like (maybe \p{0}, \p{1}, etc.), but I am not holding my breath. Option 2 is dangerous because (?{}) is marked as experimental and is ugly (if it even works, I spent a couple of hours trying to make it work this morning to no avail). Option 3 is probably the most likely (if only because I can do it for myself) and safest (a new module won't have to worry about backwards compatibility or Unicode adding a numeric name) choice.

The problem is that between Perl 5.8 and 5.10, new digit characters were added to Unicode (this, by the way, is why Unicode::Digits is failing one of its tests on 5.8), so we can't use a static list, we must build it dynamically. Luckily there is a file, at least in Perl 5.8.0 – Perl 5.10.0, in one of the lib directories named unicore/lib/To/Digit.pl that has a mapping of digit characters to their decimal values. This makes it easy to build the character classes we need:
#!/usr/bin/perl

use perl5i;

my @digits;
for (split "\n", require "unicore/To/Digit.pl") {
my ($ord, $val) = split;
$digits[$val] .= "\\x{$ord}";
}
@digits = map { qr/[$_]/ } @digits;

my $mobile = qr{
(?<!\d) ( $digits[0] $digits[3] (?: \d\d-\d{7} | \d{9} ) ) (?!\d)
}x;

my $thai = "\x{0e53}" x 9; #9 THAI DIGIT THREE characters

my @cases = (
"0312-1234567",
"03123456789",
"03$thai",
"\x{0e50}\x{0e53}$thai",
"0212-1234567",
);

for my $case (@cases) {
say "$case ", $case =~ /$mobile/ ? "matches" : "doesn't match";
}
Which outputs:
0312-1234567 matches
03123456789 matches
03๓๓๓๓๓๓๓๓๓ matches
๐๓๓๓๓๓๓๓๓๓๓ matches
0212-1234567 doesn't match
If unicore/To/Digit.pl is supported (I have a question on the Perl 5 Porters list at the moment) I will probably be creating a nice interface to it and the other files. Once I have that interface I can build a better, more efficient version of Unicode::Digits and have better tests for it (i.e. ones that won't break because of the version of Perl).

Maybe this new world won't be as bad as I thought.

No comments:

Post a Comment

Some limited HTML markup is allowed by blogger: strong, b, i, and a. You may also use em, but I have repurposed it through the magic of CSS to be behave very much like <tt><code></code></tt>.