Asimov’s 4th law: A robot will not tweet.

Well, that might be a bit extreme. At least if they do they should put in a bit more effort.

Perhaps I need to explain my problem here. The complaint I have concerns automatic tweets – popular with bloggers and online publshers in general. Extremely unpersonal, often unhelpful clipits drawing the audiences attention to a new article or blog entry. Here’s an example:

[news] Pepsi drinkers join the dots: Anyone buying a Pepsi Max soft drink over the next few w.. http://tinyurl.com/5qu3w3

@guardianmedia

Ok, so it’s pretty obvious what’s wrong with this tweet. The article the Guardian Media is trying to promote is about a campaign by Pepsi which uses QR codes on the side of their cans – not that you’d have known from the tweet.

The problem is they’ve used a witty headline not a descriptive one. In itself that is fine. Like many online publishers, however, the Guardian have opted against manually tweeting and have integrated (presumably) their CMS with Twitter. More specifically, the tweet is a concatination of the articles title and the begining of the text. It just so happens that neither of those blocks of text mension QR codes.

There is a lot to be said for automation, though. It’s not just that this system saves the author of the article or blog time. It also ensures consistency – all articles get posted. And, to be fair, most of the time these posts are okay…

…not always though. Personally, I’ve stopped following the Guardian Media on twitter (and Scientific American) because these badly formed tweets annoy me way too much. Take the article above, for example. A human author might tweet something like this:

Pepsi launch campaign using QR codes on cans. Drinkers get access to secret content through phone browser.

That sums up the article much better, with 33 characters spare for the URL. I’d be far more likely to read the article having read that tweet, as I think QR codes are interesting (I’m a bit of a geek) and appreciate imaginative marketing.

So what’s the answer? Is there a way to achieve the normalization and efficiency of an automated system while being a good Twitterer? Well yes, I think there is.

I’ve been playing with the workflow engine in Nstein’s WCM and have written a nifty little Twitter-bot. It’s secret is it’s ability to understand content. Nstein also produce a text mining engine (TME) which is ingrained into the WCM right down to the core. This means that semantic data about an article is always easily accessible. I’ve used this automatically extracted meta data in two ways for my bot.

Firstly, I’ve made use of the TME’s concept and entity extraction features to create hash-tags. For those who don’t know, a hash-tag is a peice of meta-data associated to a tweet. They are prefixed with a hash (#) character and generally are alpha numeric. A lot of automated tweets now use hash-tags with vary degrees of success. @northamptonrfc (the rugby team I support), for example, tags all tweets with “#rugby”. Well I never. The correct use of hash-tags (IMHO) is to:

  1. Add relevant meta data to a tweet which adds meaning.
  2. Create a trend to follow (essencially a thread accross all Twitter users).

In order to meet those criteria the tag needs to be meaningful. It stands to reason. In the Pepsi example above two tags spring to mind: “#pepsi” and “#qrcode”. Including 2 spaces that makes an extra 15 characters which can (relatively) easily be fitted in before the TinyURL. Nstein’s TME would, undoubtedly, have picked these concepts out.

“QR Code” is what the TME refers to as a complex concept, that is, a phrase. “Pepsi” is an entity, specifically an organisation name. A simple regex can transform these strings into hash-tags. Using this technique the bot imediately adds a great deal of meaning to the tweet.

The second way in which I’ve leveraged the meta data extracted by the TME is using NSummarizer. This cartridge takes a document, splits it into sentence components, rates each component on its relevance to the article and returns the best scoring one(s) as a brief summary of the document. This is a really useful tool for getting around the issue of having a first sentence which is not (particularly) descriptive of the article as a whole.

So, does it work? Well I’ve used this blog as a test, here’s the resultant tweet:

I’ve made use of the TME’s concept and entity extraction features to create hash-tags. #tweet #nsteinswcm http://tinyurl.com/d3ozzn

Personally, I count that as a success.

7 Responses to “Asimov’s 4th law: A robot will not tweet.”

  1. Promising, but too small a data sample. Could you do 25 pepsi blog posts and see what happens?

  2. Chris says:

    Thanks for the comment Avi. I’m not sure about getting Pepsi blog posts but I can certainly republish some copylefted content (probably Wikipedia). Watch this space.

  3. Rob says:

    How are you assuring NSummarizer keeps the Tweet within the specific character limit of Twitter? Are you relying on Twitter to change a long URL into a TinyURL

    • chris says:

      I’ve done a number of things. First off, NSummarizer is being asked for a single sentence fragment. Usually this is small enough to fit in. In the cases were it is still to long I do some progressively aggressive shortening techniques: reduce conjunctives (“and” to “&”, for example); removing punctuation; etc.

      It’s true that you will always faced with the programmatic problem that a 140 character restriction presents but by using a more concise and relevant string you can minimize the need for brute force truncation.

  4. Sarah Bourne says:

    Brilliant! Please feel free to test your clever bot on the @massgov twitter feed. I’d love to see the results.

  5. OlegR says:

    Do you think we can have this bot work as a private API?
    I’m sure cmswire, foliomag and others would love to try it out 🙂 and see the power of text mining

  6. chris says:

    Thanks everyone for the feedback. I’ve had a bit of a fine tune and played with a few more articles (thank @sarahbourne). See the new post here. I’m still only taking conceptually but it addresses some of the points raised.

Leave a Reply

Your e-mail address will not be published. Required fields are marked *