Machine Learning

Writing a haiku-detecting bot for Slack

At Metal Toad, we have several bots integrated into Slack.

Sep 21, 2016

Filed under:

Data Science

At Metal Toad, we have several bots integrated into Slack. Some are more useful (TicketBot, which detects mentions of JIRA tickets and provides links) and some are more whimsical (plusplus, which lets everyone give their coworkers points for whatever reason). I wanted to get in on this, so I decided to add to the latter category and write a bot that would detect when someone inadvertently wrote a haiku. Here's how I did it; maybe it will inspire you to write something too.

The first step was to find a way to convert messages into syllable counts. I didn't find any readily available data with the number of syllables in English words, but I did find the CMU Pronouncing Dictionary. This is intended for speech recognition and synthesis applications, and maps words to phonemes. For example:

QUIZZICAL  K W IH1 Z AH0 K AH0 L

The numbers indicate the stresses on the syllables, so by counting the number of tokens that end with a digit, we can get the number of syllables in a word. I wrote a Python script to output a JSON file mapping words to syllable counts:

import codecs, json, string

lines = [line.strip() for line in codecs.open('cmudict-0.7b.txt', 'r', 'iso-8859-1') if line[0] != ';']

syllables = {}
digits = tuple(string.digits)

for line in lines:
    tokens = line.split(' ')
    count = len([token for token in tokens[1:] if token.endswith(digits)])
    syllables[tokens[0]] = count

with codecs.open('syllables.json', 'w', 'utf-8') as output:
    json.dump(syllables, output, ensure_ascii=False, indent=0, sort_keys=True)

With the necessary data in place, I started with the bot itself. Our bots run inside Hubot, so I used its native language CoffeeScript (anything else that ends up as Javascript would have worked too). I needed to write a custom listener that would read every message in the channel and, if it matched the 5/7/5 syllable format of a haiku, output a message celebrating the accidental artistry of the author. Hubot's robot.listen works as follows:

module.exports = (robot) ->
    robot.listen()
        (message) ->
            is_haiku message
        (response) ->
            response.send ":leaves: Haiku detected! :fallen_leaf:"
    )

If the function that is passed message returns true, the second function is called, which sends a message to the Slack channel. That's it! Except for writing is_haiku, of course. Here's how I did that:

    is_haiku = (message) ->
        if not message.text
            return false
        words = message.text.split ' '
        start = 0
        for line in [5, 7, 5]
            result = starts_with words[start..], line
            if result == false
                return false
            start += result
        start == words.length

We split the message text into an array of words, then see if the array starts with words totalling five, then seven, then five syllables. If we have consumed all of the words after that, then the message matches the haiku pattern, and we return true. starts_with is the part that actually uses the syllables data:

    starts_with = (words, count) ->
        consumed = 0
        re = /\W*\b(.+)\b\W*/
        for word in words
            # replace smart quotes/dashes with plain ones
            word = word.toUpperCase()
                .replace(/[\u2018\u2019]/g, "'")
                .replace(/[\u201C\u201D]/g, '"')
                .replace(/\u2014/g, '-')
            matches = word.match(re)
            if matches == null
                # no word characters, skip this word
                consumed += 1
                continue
            word = matches[1]
            if word of custom_words
                count -= custom_words[word]
            else if word of syllables
                count -= syllables[word]
            else
                # unknown word
                return false

            consumed += 1
            if count == 0
                return consumed
            if count < 0
                return false
        return false

This works by starting at the first word, sanitising it so that smart quotes won't prevent us from finding the word in our syllables data, seeing if the word has letters in it and skipping it if not, then seeing if we know how many syllables the word has. If it's an unknown word, we don't know how many syllables it has, and we have to return false, denying that the message is a haiku. Otherwise, we increase the count of syllables we've seen so far. This subtotal can be: less than our target number of syllables (five or seven), in which case we do the same thing with the next word; more than our target, meaning we've overshot our target and the message is not a haiku; or equal to the target, meaning we have encountered just the right number of words for this line of the haiku. This count is returned so that the next call to starts_with can be passed new words instead of starting at the beginning of the message again.

With this, haikubot was ready to detect gems like the following from Aaron Amstutz:

I used to be punk
until I broke my skateboard
and got a haircut.

This is all well and good, but many messages that might be a haiku would be ignored if they contain someone's name or some other word that is not in the dictionary. I wanted to add the capability to learn new words to haikubot. To do this, I used robot.hear instead of robot.listen. This allows for a regular expression to be used instead of having to write a function:

    robot.hear /haikubot learn (\S+) (\d+)/i, (response) ->
        count = parseInt(response.match[2])
        if count > 0
            custom_words[response.match[1].toUpperCase()] = count
            persist_custom_words()
            response.send "Thanks for teaching me!"
        return

The groups matched in the regex are available through the array response.match. Any message that starts with haikubot learn followed by a word then one or more digits is handled by adding the specified word and syllable count to a variable custom_words. You might've noticed above that this variable is used alongside syllables in starts_with. persist_custom_words saves the custom words so that they are preserved if Hubot needs to be restarted.

I added a few other robot.hear commands: forget, to remove custom words that were erroneously added; list, to display all the custom words that haikubot knows; and help, to tell users what commands are available. The full code is available here. There are a few other features I'd like to add some time: posting all results to a #haiku channel; highlighting results that seem especially good (perhaps those with punctuation between lines, or 'terminal' seeming words at the end of lines); and making detection more robust. Hope you have fun with it!

Machine Learning Artificial Intelligence Git JavaScript Python

Writing a haiku-detecting bot for Slack

Similar posts

How I got Twitter, UNIX Timestamps, and Drupal 7 to all play nice

Using the Token module to enhance the Editor experience

Using the Drupal Batch API

Writing a haiku-detecting bot for Slack

Similar posts

How I got Twitter, UNIX Timestamps, and Drupal 7 to all play nice

Using the Token module to enhance the Editor experience

Using the Drupal Batch API

Get notified on new marketing insights