Laurent Rosenfeld Weekly Review: Challenge - 005

Saturday, Nov 23, 2019| Tags: Raku

Raku Solutions Weekly Review


Task #1: Anagrams of a Word

This is derived in part from my blog post made in answer to the Week 5 of the Perl Weekly Challenge organized by Mohammad S. Anwar as well as answers made by others to the same challenge.

The challenge reads as follows:

Write a program which prints out all anagrams for a given word. For more information about anagrams, please check this Wikipedia page.

I’m not sure I fully grasp the question, but my understanding is that we are looking for existing words which are anagrams of each other. For example, “pots”, “spot”, “tops”, and “post” are anagrams of each other because they have exactly the same letters rearranged in a different order.

For the purpose of this challenge, I will use a words.txt file containing 113,809 lower-case English words usually accepted for crossword puzzles and other word games. The words.txt file can be found on my Github repository. The original list was contributed to the public domain by Internet activist Grady Ward in the context of the Moby Project. This word list is also mirrored at Project Gutenberg.

For the purpose of testing the programs below, the words.txt file is located in my current directory.

My Solutions

Normalized Strings: the is-anagram Subroutine

We can start by creating an is-anagram subroutine that takes two words as parameters and return a true value if the words are anagrams of each other and false otherwise.

sub is-anagram (Str $word1, Str $word2) {
    return False if $word1.chars != $word2.chars;
    return $word1.comb.sort eq $word2.comb.sort;
}
for <ban bane post stop pots stop post pots pots taps> -> $w1, $w2 {
    say "$w1 $w2:\t", is-anagram $w1, $w2;
}

The is-anagram subroutine first returns a false value if the input words don’t have the same length. Words of different lengths cannot be anagrams and it is my belief that checking the length of the input words is significantly faster than the subsequent processing of their individual letters (a couple of tests seemed to confirm that idea, but I must admit that haven’t seriously benchmarked that). Therefore, it should speed up processing when we will use this subroutine repeatedly for a large number of word combinations. When words have the same length, then we just split the words into individual letters, sort the letters, and produce new “normalized” or “canonical form” strings with the input letters; if the new normalized strings are equal, then the words were anagrams of each other.

This displays the following output:

$ perl6 is-anagram.p6
ban bane:       False
post stop:      True
pots stop:      True
post pots:      True
pots taps:      False

Note that this works correctly because the eq operator in the is-anagram subroutine coerces its arguments into strings, so we don’t need to explicitly stringify the letter lists generated by the comb function after they have been sorted.

Using a Bag

Another approach is to use one of a variety of Perl 6 data structure types called Set, Bag and Mix. They are immutable collections of unique and weighed items.

You might construct a Set as follows:

> my $s = set <banana apple orange orange banana pear apple>;
set(apple banana orange pear)

As you can see, duplicates have been removed. Sets only tell us whether at least one item of a given name has been encountered.

A bag, by contrast, also keeps track of how many of each item have been seen, as shown in these tests under the REPL:

> my $b = bag <banana apple orange orange banana pear apple orange>;
bag(banana(2), orange(3), pear, apple(2))
> say $b{'banana'}
2

Mixes are similar to bags, except that the elements’ weights don’t have to be integers.

One interesting thing about these collections is that they can use many set operators commonly used in mathematics, such as the (elem) (or ) set membership operator or the (<) (or ) subset operator, illustrated here under the REPL:

> say "Found it!" if 'apple' `∈` $s;
Found it!
> say "It is a subset" if qw/orange banana/ ⊂ $s
It is a subset
> say "Found several oranges" if  qw/orange orange/ ⊂ $b
Found several oranges

We can now try the following alternate is-anagram subroutine using bags:

sub is-anagram (Str $word1, Str $word2) {
    return $word1.comb.Bag === $word2.comb.Bag;
}
for <ban bane post stop pots stop post pots pots taps> -> $w1, $w2 {
    say "$w1 $w2:\t", is-anagram $w1, $w2;
}

The === value identity operator used between two bags returns True if the bags are identical. This program displays the same output as before.

Creating an Anagram Operator

Just a bit of fun: rather than creating an is-anagram subroutine, we could create the infix ana operator:

sub infix:<ana> (Str $word1, Str $word2) {
    return $word1.comb.Bag === $word2.comb.Bag;
}
for <ban bane post stop pots stop post pots pots taps> -> $w1, $w2 {
    say "$w1 $w2:\t", $w1 ana $w2;
}

This prints the same result. Note, however that this feels significantly slower. My guess is that it is mainly compilation time. Adding the following line at the end of the previous script:

say now - INIT now;

shows that the run time is less than 0.02 seconds:

perl6 is-anagram.p6
ban bane:       False
post stop:      True
pots stop:      True
post pots:      True
pots taps:      False
0.0156261

I might undertake a serious benchmark one day, but it really seems that the perceived response time really has to do with compile time.

Finding All Anagrams of a Word From a Word List

The other possible understanding of the challenge is that we are given a word and a word list, and should output the words of the list that are anagrams of the input word. We can use the same is-anagram subroutine as follows and use directly the words.txt file mentioned earlier:

my $post-bag = "post".comb.Bag;
sub is-anagram (Str $word) {
    return $word.comb.Bag === $post-bag;
}
for "words.txt".IO.lines -> $line {
    say $line if is-anagram $line;
}

This program displays the words of the word list that are anagrams of the word “post”:

perl6 is-anagram.p6
opts
post
pots
spot
stop
tops

Alternative Solutions

Arne Sommer understood the challenge to mean that we were supposed to look up anagrams of a given word in a list or dictionary (what I did in the last section of my solutions). His approach is to read all the words of the dictionary and store them in a Set (done in the get-dictionary subroutine). Then, the program uses the permutations built-in function to find all the letter permutations of the input word:

unit sub MAIN (Str $word is copy where $word !~~ /\W/,
  :$dictionary where $dictionary.IO.r = "/usr/share/dict/british-english");

$word .= lc;

my $dict = get-dictionary($dictionary);

print "Anagrams:";

for $word.comb.permutations>>.join.unique -> $candidate
{
  # next if $candidate eq $word;
  print " $candidate" if $dict{$candidate};
}
print "\n";

sub get-dictionary ($file where $file.IO.r) is export
{
  return $file.IO.lines.grep(* !~~ /\W/)>>.lc.Set;
}

Doug Schrag understood the challenge the same way, also used a set to store the words of the dictionary, and also used the permutations built-in function:

subset Filename of Str;
sub MAIN(Str $word, Filename :$word-file) {
    my $file = .IO with $word-file;
    my Set $words = Set.new(.lines.sort) with $file;
    if $words.defined {
        .say for anagrams($word, -> $w { $w (elem) $words });
    }
    else {
        note 'Please supply word dictionary using --word-file option';
        say 'All permutations:';
        .say for anagrams($word);
    }
}

sub anagrams ($word, &is-word = -> $w { True }) {
    gather
    for $word.comb.permutations.unique(:with(&[eqv])) {
        with .join {
            .take if .&is-word
        }
    }
}

Francis J. Whittle also understood that the task was to find anagrams of a word in given file. He used Bags of letters to check whether words are anagrams of each other.

unit sub MAIN(
  Str $file #= file containing list of words
          where { given .IO { .r && ( .l || .f) or die "Cannot read from $_" } },
  $word #= word to find anagrams of
);

my $word-bag := $word.lc.comb(/ \w /).Bag;

my @words = $file.IO.lines.unique.hyper.grep(*.chars > 2)
                 .map: { .lc.comb(/ \w /).Bag => $_ };

@words.race.grep(*.key === $word-bag)».value.unique(with => *.lc eq *.lc).join(', ').put;

Jaldhar H. Vyas normalized the target word and the words of the dictionary by rearranging their letters in alphabetical order much in the same way as in my original is-anagram subroutine. His normalized word dictionary is stored in a hash.

my $search = $word.comb.sort.join;
my %dictionary = $list.IO.lines.map({ $_ => $_.comb.sort.join });

my %anagrams;
for %dictionary.kv -> $key, $value {
    %anagrams{$value}.push($key);
}

if (%anagrams{$search}:exists) {
    say %anagrams{$search}.grep({ $_ !~~ $word }).sort.join(' ');
}

Jo-Christian Oterhals stored the words of the /usr/share/dict/words standard Unix dictionary into a Set and used the permutations of the input words to check for anagrams:

my $dict = "/usr/share/dict/words".IO.lines()>>.lc.Set;
for @*ARGS ?? @*ARGS !! ! $*IN.t ?? lines() !! '' -> $w {
    $w.lc.comb.permutations>>.join.grep({ $dict{$_} and $_ ne $w }).map({ "$w\t$_\n" }).unique.join.say;
}

Joelle Maslak used Bags to store the letters of the input word and analyze the dictionary words.

sub MAIN(Str:D $letters, Str:D $filename = '/usr/share/dict/words') {
    my $matchbag = Bag.new($letters.comb);
    my SetHash $dedupe = SetHash.new;  # To store matches we gave back

    for $filename.IO.lines -> $word {
        my $fcword = $word.fc;

        my $bag = Bag.new($fcword.comb);

        if $bag ~~ $matchbag {
            next if $fcword ∈ $dedupe;
            $dedupe{$fcword}++;
            say $fcword;
        }
    }
}

Mark Senn normalized words of the word list much in the same way as my original is-anagram solution (i.e. sorting the letters) and stored the result in a hash. He then used a hash

my Array %hash;

# Construct the array.
for (@word)
{
    # The key for "family" is "amfily".
    my $key = .comb(/./).sort.join;
    # Add the current word to the hash.
    %hash{$key}.push($_);
}
my $word = @word.pick;

# Convert $word to $key.
my $key = $word.comb(/./).sort.join;

for (%hash{$key}.Array.sort)
{
    ($word eq $_)  or  print " $_";
}
print "\n";

Rob4t used Bags to check for anagrams:

sub MAIN(Str $word, Str $file where *.IO.r = '/usr/share/dict/words') {
    my $word_bag = $word.lc.comb.Bag;

    my @found_words = $file.IO.lines.grep: {
        # not the same word
        .lc ne $word.lc
        and
        # look for words AND phrases
        .lc.words.map({.comb}).flat.Bag eqv $word_bag
    };

    .say for @found_words;
}

Ruben Westerberg wrote an anagram module implementing a findAnangrams (sic) subroutine:

unit module anagram;
sub findAnangrams(@letters, @words) is export {
    my @a=sort @letters;
    my @anagrams;
    my $test;
    while * {
        $test=@a.join("");
        @anagrams.append(@words.hyper(degree=>4,batch=>50000).grep: {$_ eq $test});
        my int $k=0;
        my int $l=0;
        my $tmp;
        my $flag=False;
        my int $maxK=-1;
        my int $maxL=-1;

        while ($k < @a.elems-1) {
            $maxK= $k if ((@a[$k] lt @a[$k+1]) && ($k > $maxK));
            $k++;
        }
        last unless $maxK >= 0;
        $maxL=$maxK+1;
        while ($l < @a.elems) {
            $maxL= $l if ((@a[$maxK] lt @a[$l]) && $l>$maxL);
            $l++;
        }
        $tmp=@a[$maxK];
        @a[$maxK]=@a[$maxL];
        @a[$maxL]=$tmp;
        my @b=@a.splice($maxK+1).reverse;
        @a.=append(@b);
    }
    @anagrams;
}

His program using this module then looks like this:

use lib $?FILE.IO.dirname;
use anagram;
my $wordsFile=  $?FILE.IO.dirname ~ "/../words_alpha.txt";

my @word-list=$wordsFile.IO.lines;
my $input= @*ARGS[0];
my @letters=$input.trim.lc.split("")[1..*-2];
say "Anagrams: ",findAnangrams(@letters,@word-list).join: ' ';

This probably works well, but, although the idea of creating a module for anagrams is probably good, I have the feeling that this implementation is a bit over-engineered. My solution held in 7 code lines.

Simon Proctor also wrote an Anagrams module:

unit package Anagrams;

sub normal ( Str \word ) is pure {
    samemark( lc( word ), "a" )
}
sub order-string ( Str \word ) is export is pure {
    normal( word ).comb.sort( { $^a cmp $^b } ).join;
}
multi sub is-anagram-of( Str \target, Str \word where * eq target ) is export is pure { False; }
multi sub is-anagram-of( Str \target, Str \word where *.codes != target.codes ) is export is pure { False; }
multi sub is-anagram-of( Str \target, Str \word ) is export is pure {
    normal( target ) ne normal( word ) && order-string( target ) ~~ order-string( word );
}

His program using this module then looks like this:

use v6;
use lib 'lib';
use Anagrams;

my %*SUB-MAIN-OPTS = :named-anywhere;

sub USAGE { say $*USAGE }

subset FilePath of Str where *.IO.f;

#| Display Help file
multi sub MAIN ( Bool :h($help) where *.so ) { USAGE(); }

#| Find the anagrams for a given word
multi sub MAIN (
    Str $word, #= Word to check for. Case insensitive
    FilePath :$dict = "/etc/dictionaries-common/words" #= Dictionary file to use. Defaults to "/etc/dictionaries-common/words"
) {
    $dict.IO.words.grep( { is-anagram-of( $word, $_ ) } )>>.say;
}

Again, this seems to be a bit over-engineering to me.

SEE ALSO

Not less than six blog posts this time:

Wrapping up

Please let me know if I forgot any of the challengers or if you think my explanation of your code misses something important (send me an e-mail or just raise an issue against this GitHub page).

SO WHAT DO YOU THINK ?

If you have any suggestions or ideas then please do share with us.

Contact with me