Wait till I come! » Blog Archive » Making twitter multilingual with a hack of the Google Translation API
Random notes by Chris Heilmann

Making twitter multilingual with a hack of the Google Translation API

After helping to fix the Yahoo search result pages with the correct language attributes to make them accessible for screen reader users I was wondering how this could be done with user generated content. The easiest option of course would be to ask the user to provide the right language in the profile, but if you are bilingual like me you actually write in different languages. The other option would be to offer me as the user to pick the language when I type it, which is annoying.

I then stumbled across Google’s Ajax Translation API and thought it should be very easy to marry it with for example the JSON output of the twitter API to add the correct lang attributes on the fly.

Alas, this was not as easy as I thought. On the surface it is very easy to use Google’s API to tell me what language a certain text is likely to be:


var text = "¿Dónde está el baño?";
google.language.detect(text, function(result) {
  if (!result.error) {
    var language = 'unknown';
    for (l in google.language.Languages) {
      if (google.language.Languages[l] == result.language) {
        language = l;
        break;
      }
    }
    var container = document.getElementById("detection");
    container.innerHTML = text + " is: " + language + "";
  }
});

However, if you want to use this in a loop you are out of luck. The google.language.detect method fires off an internal XHR call and the result set only gives you an error code, the confidence level, a isReliable boolean and the language code. This is a lot but there is no way to tell the function that gets the results which text was analyzed. It would be great if the API repeated the text or at least allowed you to set a unique ID for the current XHR request.

As Ajax requests return in random order, there is no way of telling which result works for which text, so I was stuck.

Enter Firebug. Analyzing the requests going through I realized there is a REST URL being called by the internal methods of google.language. In the case of translation this is:


http://www.google.com/uds/GlangDetect?callback={CALLBACK_METHOD}&context={NUMBER}&q={URL_ENCODED_TEXT}&key=notsupplied&v=1.0

You can use the number and an own callback method to create SCRIPT nodes in the document getting these results back. The return call is:


CALLBACK_METHOD('NUMBER',{"language" : "es","isReliable" : true,"confidence" : 0.24716422},200,null,200)

However, as I am already using PHP to pull information from another service, I ended up using curl for the whole proof of concept to make twitter speak in natural language:


    <ul>
    <?php
      // curl the twitter feed
      $url = 'http://twitter.com/statuses/public_timeline.rss';
      $ch = curl_init(); 
      curl_setopt($ch, CURLOPT_URL, $url); 
      curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
      $twitterdata = curl_exec($ch); 
      curl_close($ch); 
      // get all the descriptions
      preg_match_all("/<description>([^<]+)</description>/msi",    $twitterdata,$descs);
      // skip the main feed description
      foreach($descs[1] as $key=>$d){
        if($key===0){
          continue;
        }
        // assemble REST call and curl the result
        $url = 'http://www.google.com/uds/GlangDetect?callback=' .  
               'feedresult&context=' . $key . '&q=' . urlencode($d) .
               '&key=notsupplied&v=1.0';
        $ch = curl_init(); 
        curl_setopt($ch, CURLOPT_URL, $url); 
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
        $langcode = curl_exec($ch); 
        curl_close($ch);
        // get the language
        preg_match("/"language":"([^"]+)"/",$langcode,$res);
        // write out the list item
        echo '<li lang="'.$res[1].'">'.$d.'</li>';
      }
    ?>
    </ul>

Check out the result: Public twitter feed with natural language support

I will do some pure JavaScript solutions soon, too. This could be a great chance to make UGC a lot more accessible.

Thanks to Mark Thomas and Tim Huegdon for bouncing off ideas about how to work around the XHR issue.

2 Responses to “Making twitter multilingual with a hack of the Google Translation API”

  1. Richard Says:

    Nice work. Definitely very cool.

    It incorrectly identified Chinese as Italian, and Japanese as Slovak (I think; sk?) though. Which seems like pretty strange mistake to make, especially having heard that Google's translator is quite good at translating Chinese (and Arabic) in particular. Then again, all that code means nothing to me, though (the language bit is what interested me) so it might be some limitation with how the API parses the text? Or that might not even make any sense. It's a shame the demo page doesn't output the confidence values too. I'd be interested to see that.

    Another potential issue might be correctly identifying tweets in one language, that happen to contain words of another language. For example, it showed this tweet: "GIF Sundsvall vs. Helsingborgs IF (Allsvenskan): Match has finished. Result: 0:3" (URL from tweet removed, lest I annoy your spam filters) as Swedish, despite obviously being English.

    The code means very little to me, but the idea is definitely awesome.

    It's a shame automatic machine translation comes with so many inherent downfalls. Though I think Google's is one of the better ones, with quite a shallow approach to disambiguation, which would be the biggest hurdle (if you even intended to advance it to actually translating the given text).

  2. Mark Neigh Says:

    Hello, This is very cool. I am working on a paper about mobile social media and its effects on global English. Do you have any stats on %s of languages being used on Twitter? Or any other interesting bits you could share?

Leave a Reply