July 25, 2011

Language Identification for Text Mining and NLP

The Tower of Babel and ships in a large marine landscape.

Introduction

Language Identification is a key task in the text mining process. Successful analysis of extracted text with natural language processing or machine learning training requires a good language identification algorithm. If it fails to recognize the language, this error will nullify subsequent  processes. NLP algorithms must be adjusted for different corpuses and according to the grammar of different languages. Certain NLP software is best suited to certain languages. For example NLTK is the most popular natural language processing package for English under Python, but as FreeLing is best for Spanish. The efficiency of language processing depends on many factors.

A very high level model for text analysis includes the following tasks:

Text Extraction
Text can be extracted by: scraping a web site, importing it in a specific format, getting it from a database, or accessing it via an API.

Text Identification
Text identification is a process which can separate interesting text from other content or format that adds noise to the analysis. For example a blog can include advertising, menus, and other information besides the main content.

NLP
NLP is a set of algorithms to aid in the processing of different languages. See links to NLP software packages and articles here.

Machine Learning
Machine learning is a necessary step for tasks such as collaborative filtering, sentiment analysis and clustering.

Software Alternatives

There is a lot of language identification software available on the web. NLTK uses Crúbadán, while Gate includes TextCat. At Data Big Bang, we like to use Google Language API because it is very accurate even for just one word. It also includes an accuracy measure in the response.

Sadly, Google has deprecated the Google Language API Family and we have added them to our “Google NoAPI” list. They can be used until they are shut down.

Example Including an API Key

Google highly recommends including an API key with the API request. You can get one at http://code.google.com/apis/loader/signup.html or with the new Google API Console https://code.google.com/apis/console/. Use it as follows:

language-identification.py

#!/usr/bin/python

# Language Detection using Google Language API: http://code.google.com/apis/language/translate/v2/getting_started.html
# It can handle unicode texts. You need to add your exception/errors catching.
import sys
import urllib
import urlparse
import simplejson

ENDPOINT = "https://www.googleapis.com/language/translate/v2/detect"
KEY = "" # Insert your key here. Get it from: https://code.google.com/apis/console/

def detect_language(text):
   utf8_encoded_text = text.encode('utf-8')
   query_field = urllib.urlencode({'key':KEY, 'q':utf8_encoded_text})
   parsed_url = urlparse.urlparse(ENDPOINT)
   url = urlparse.urlunparse((parsed_url[0], parsed_url[1], parsed_url[2], parsed_url[3], query_field, parsed_url[5]))

   data = simplejson.loads(urllib.urlopen(url).read())
   response = data['data']['detections'][0][0]

   return response # it answers: {'isReliable': , 'confidence': , 'language': }

if __name__ == '__main__':
   terminal_encoding = sys.stdin.encoding
   text = raw_input("Text? ")
   unicode_text = text.decode(terminal_encoding)
   response = detect_language(unicode_text)

   print response

Google Language API for language identification is very easy to use and was very permissive in terms of usage limitation but now the rate limit status can be found in the console.

Benchmarking 

Different language identification algorithms can be easily benchmarked against the Google’s. Testing with single words and small sentences is a good indicator, especially if the algorithms will be used for services like twitter where the sentences are very short.

Resources

  1. Google Scholar search on language identification
  2. Google language detection
  3. Lingua Identify for Perl
  4. A language detection library for Java
  5. Language identification addition for NLTK
  6. Sentiment analysis and language processing tools
  7. Balie language identification
  8. Gate
  9. NLTK
  10. FreeLing
  11. TextCat and TextCat under Gate
  12. LingPipe