Introduction
Language Identification is a key task in the text mining process. Successful analysis of extracted text with natural language processing or machine learning training requires a good language identification algorithm. If it fails to recognize the language, this error will nullify subsequent processes. NLP algorithms must be adjusted for different corpuses and according to the grammar of different languages. Certain NLP software is best suited to certain languages. For example NLTK is the most popular natural language processing package for English under Python, but as FreeLing is best for Spanish. The efficiency of language processing depends on many factors.
A very high level model for text analysis includes the following tasks:
Text Extraction
Text can be extracted by: scraping a web site, importing it in a specific format, getting it from a database, or accessing it via an API.
Text Identification
Text identification is a process which can separate interesting text from other content or format that adds noise to the analysis. For example a blog can include advertising, menus, and other information besides the main content.
NLP
NLP is a set of algorithms to aid in the processing of different languages. See links to NLP software packages and articles here.
Machine Learning
Machine learning is a necessary step for tasks such as collaborative filtering, sentiment analysis and clustering.
Software Alternatives
There is a lot of language identification software available on the web. NLTK uses Crúbadán, while Gate includes TextCat. At Data Big Bang, we like to use Google Language API because it is very accurate even for just one word. It also includes an accuracy measure in the response.
Sadly, Google has deprecated the Google Language API Family and we have added them to our “Google NoAPI” list. They can be used until they are shut down.
Example Including an API Key
Google highly recommends including an API key with the API request. You can get one at http://code.google.com/apis/loader/signup.html or with the new Google API Console https://code.google.com/apis/console/. Use it as follows:
language-identification.py
#!/usr/bin/python # Language Detection using Google Language API: http://code.google.com/apis/language/translate/v2/getting_started.html # It can handle unicode texts. You need to add your exception/errors catching. import sys import urllib import urlparse import simplejson ENDPOINT = "https://www.googleapis.com/language/translate/v2/detect" KEY = "" # Insert your key here. Get it from: https://code.google.com/apis/console/ def detect_language(text): utf8_encoded_text = text.encode('utf-8') query_field = urllib.urlencode({'key':KEY, 'q':utf8_encoded_text}) parsed_url = urlparse.urlparse(ENDPOINT) url = urlparse.urlunparse((parsed_url[0], parsed_url[1], parsed_url[2], parsed_url[3], query_field, parsed_url[5])) data = simplejson.loads(urllib.urlopen(url).read()) response = data['data']['detections'][0][0] return response # it answers: {'isReliable': , 'confidence': , 'language': } if __name__ == '__main__': terminal_encoding = sys.stdin.encoding text = raw_input("Text? ") unicode_text = text.decode(terminal_encoding) response = detect_language(unicode_text) print response
Google Language API for language identification is very easy to use and was very permissive in terms of usage limitation but now the rate limit status can be found in the console.
Benchmarking
Different language identification algorithms can be easily benchmarked against the Google’s. Testing with single words and small sentences is a good indicator, especially if the algorithms will be used for services like twitter where the sentences are very short.
Resources
- Google Scholar search on language identification
- Google language detection
- Lingua Identify for Perl
- A language detection library for Java
- Language identification addition for NLTK
- Sentiment analysis and language processing tools
- Balie language identification
- Gate
- NLTK
- FreeLing
- TextCat and TextCat under Gate
- LingPipe
Another language detection web service is http://www.whatlanguage.net . It has an easy to use API that outputs a JSON or XML object with the detected language. It supports 100+ languages and detects the language of texts, websites and documents.