How long is a (piece of) string?

I recently posted an article about a workflow script I cooked up for automatically tweeting about an article when it gets published via Nstein’s WCM (here). Basically, the script to which the article referred was leveraging data from Nstein’s Text Mining Engine (TME) to create concise but still descriptive tweets. As a brief reminder of that post, the script was using a computer generated summary and adding hash-tags extracted from the text to create a micro-blog like this:

I’ve made use of the TME’s concept and entity extraction features to create hash-tags. #tweet #nsteinswcm

It seems to be an idea which the industry finds interesting (judging by my Twitter account and the comments on the article). Sarah Bourne’s (@sarahebourne) offer – in particular – I could not pass up. Sarah, who is the Chief Technology Strategist for the Commonwealth of Massachusetts (@massgov), had suggested that I try my micro-blogging bot on some of the MassGov content from their Twitter stream. So I did…

Well, as one comment in the last entry (by “Rob”) alluded to, no matter how relevant my tweet is it still needs to comply to the 140 character limit set by Twitter. This seemed to be presenting some problems with the MassGov content. A big part of the problem was that the subjects of the Massachusetts articles were often political; they tend to have long sentences with complex subject matters and feature lots of relatively long words (“Massachusetts” for example). So although pertinent hash-tags and relevant teasers were being generated some times these were still over the limit.

The way my bot dealt with this situation was by using progressively more aggressive truncation techniques. At the light end of the scales it might swap all occurrences of “with” for “w”, “and” for “&”, etc. After each pass the tweet’s character count gets remeasured, if it’s still to long the next truncation technique is applied. Ultimately, if all else fails, the tweet is truncated by removing words from the end until it no longer exceeds the limit.

Obviously, this can lead to the very problem the original post was discussing: ending up with automatically generated tweets which do not describe the article they are plugging. Now, the bot I created makes this situation far less common, no doubt – but not impossible. Adding hash-tags guarantees a level of meaning which would otherwise be impossible to achieve with an automated system and that makes up for truncated sentences to some extent, however I was not satisfied. Here’s an example of a tweet which was too long:

Attorney General Martha Coakley Sponsors Legislation to Enhance Victim compensation Assistance. #massachusetts #compensation

In fact it’s 9 characters too long. Now the bot would have truncated it to this:

Attorney General Martha Coakley Sponsors Legislation to Enhance Victim compensation. #massachusetts #compensation

As it turns out, that wasn’t too destructive but I may not have been so lucky.

That tweet had given me an idea, though. The inspiration? TinyURL.

I don’t use when I’m tweeting. These days who does? Twhirl (or Seesmic) is my twitter client and when I want to shorten a URL it offers me a list of services to use. I always make the same choice: “”.The reason is pretty obvious – their domain name is 6 characters shorter.

Okay, so a bit of a no-brainer there then. Switch my bot’s shortening service to “”, save at least 6 characters per tweet. But that wasn’t really the point. I would never have used TinyURL so why had I programmed my bot too? What was I thinking?

Well the truth of the matter is this: I wasn’t. I’d used the TinyURL API before and so just stuck it into the code. So I started thinking about what else I might have done wrong. Or, more specifically, I started to think about how I tweeted (in the flesh, as it were) and if my bot was doing as good a job.

Once I started down that trail-of-thought one big difference struck me: Where possible I use inline hash-tags. If the keyword you are tagging already exists in the post then you are not adding meaning, per se. You may be emphasizing that word and you may also be starting a trend for replies and retweets. Therefor, it stands to reason that you can use the hash-tag inline and not waste space by duplicating the word.

So, having made those changes to the program I republished the MassGovs article. This time my bot tweeted:

Initiative will help municipalities pursue clean #energy projects make best use of federal stimulus funds. #massachusetts

Much better. It actually transpires that (perhaps unsurprisingly) these inline tags occur pretty frequently in the tweets. I’ve republished a selection now, here are the tweets:

Officials “flex” highway stimulus funds to support “net zero” transit center. #transportation #greenfield

Attorney General Martha Coakley Sponsors Legislation to Enhance Victim #compensation Assistance. #massachusetts

Patrick Administration Credits Dropout Prevention Efforts for Improvement. #student #malden

#patrickadministration Receives $1 Million Grant to Support Expanded Services for People with #traumaticbraininjuries.

Welcome to DCR Park Server Day. #volunteer #capecod

Costs to Employers ThirdLowest #oregon Survey Reports Under Patrick Administration Rates Have. #compensationrates

The results there are, I think, pretty good. Out of the seven articles I’ve republished only the last one has needed to be to truncated.

My bot isn’t perfect and it won’t create faultless tweets every time, however, it is a huge improvement over the traditional blind truncation. My conclusion – from the previous post, the discussion around it and the experiments I have carried out – is that Twitter automation has too many benefits for it not be used by online publishers but will (probably) never be perfect 100% of the time. What we’ve accomplished here, so far, is a much higher and more consistent level of readability and relevancy and a much reduced frequency of the need to truncate teasers. I’m sure there are many techniques I could implement to improve the results (and I may do in the future) but for now there is just one more change I’m going to make…

As I mentioned at the beginning of this article (and in the previous one) this experiment has be done using the workflow engine in Nstein’s WCM. It’s a scripted state transition engine, so when I published articles they were also passed to the Twitter-bot for it to create a tweet. The change I am going to make is this: create a new, “Needs tweeting”, workflow state. Then in the minority of cases where the bot cannot tweet about an article without truncating the teaser it passes the responsibility onto a human twitterer.

There are a huge (really, really huge) number of things that we can accomplish with the TME. Some of the key ones, like SEO, have already been taken to very high standards, but we are only scraping the surface of possible uses. Ideas and experiments, such as this one, are key to our industries growth. From my point of view accomplishing automation in 85% of cases and a high level of quality in 100% would be a fantastic acomplishment. Let’s face it: in this day and age information has been commoditized so quality become the only differentiator between publishers. Quality is what attracts an audience and certainly what keeps them… even on Twitter.

4 Responses to “How long is a (piece of) string?”

  1. OlegR says:

    here are some suggestions for the bot that I learned from my personal experience:
    1. Twitter (or clients) do not like signs like & / | ” etc. – these come in often as yada yada and auto truncate meaningful part of the tweet with this kind of rubbish (for ex. #34w%)

    2. You might want to abbreviate states (#MA instead of #massachusetts, or #QC instead of #Quebec). I know that Nstein’s text mining engine comes with a geographic location cartridge that has all these abbreviations built in.

    3. When you hash tag along name, pleause use underscore as space between words. Smashing all the words together becomes very geek-compatible, but not human-being-compatible if you know what I mean 🙂

    4. I’ve ran across a dictionary of sms-word truncations – you might want to integrate those to save some twitter real estate.


    Oleg (twitter – OlegR)

    • admin says:

      Great comments @OlegR. I particularly like to idea of a dictionary of SMS word truncations. No doubt there are many other things you could do to squeeze extra characters out once you have a decent semantic understanding of the content at the core.

  2. Sarah Bourne says:

    Well, that’s impressive!

    I like your idea of kicking it back to the human, but for a slightly different reason. Far too much of our content is written far above the 8th-grade reading level recommended for web publishing. If the bot can’t fit it into 140 characters, maybe it’s because you used too many big words.

    Of course, we’re stuck with “Massachusetts”, but in similar situations, we shorten it to “Mass.” or the postal code “MA”. It might be good to let users set abbreviations for certain words or phrases.

  3. Very interesting set of posts on Twitterbots. Publishers right now are at a critical stage with Twitter and because of the short form of the medium aren’t putting enough effort into their Tweet strategy.

    “Quality is what attracts an audience and certainly what keeps them… even on Twitter.”

    Right on! And if your content on Twitter sucks it will ultimately detract from what you are trying to do. You’ve shown that automation can be used to product decent robot Tweets that actually meet a consumer need.

Leave a Reply

Your e-mail address will not be published. Required fields are marked *