Dyslexia is a surprisingly common problem. A 2005 study estimated that between 5 and 17% of the U.S. population are affected by dyslexia and unable to read and write adequately. While there is no known cure to correct the underlying cause of dyslexia, Lingit is trying to help dyslexia sufferers cope.
Lingit knew that in order to build a statistically accurate language model, there was one fundamental rule: the more data you have, the better. To build this tool, Lingit turned to Atbrox, a Norwegian-based company focusing on data mining/data analysis and cloud-based solutions.
Since they were dealing with up to terabyte-sized, structured set of texts or text corpora, Atbrox created a solution using Amazon Web Services. Lingit uploads their data to S3, starts the extraction process on an arbitrary number of compute instances using Amazon Elastic MapReduce parallel processing and finally downloads the resulting files from S3 once the MapReduce job has finished.
Amund Tveit, Founder of Atbrox describes the solution, “The job is divided into four different phases, each phase having a map and a reduce operation. First, the raw data is tokenized, meaning that the text is split into single words or tokens. Next, certain tokens, such as dates and phone numbers, are normalized into a standard notation according to Lingit’s requirements. The third phase is where the tokens are grouped into sentences based on a set of rules that looks at special tokens such as abbreviations and punctuation. Finally, the n-grams are extracted and written as a set of files to S3.”
For Lingit, the cost and time savings of using Elastic MapReduce are substantial. Building their own infrastructure for getting similar results in the same amount of time would require considerable upfront investment. What is more, the purchased hardware would sit idle most of the time, as language model building is a fairly infrequent operation.
“Cloud computing is a perfect solution for a company such as ours,” said Prof. Torbjørn Nordgård, CEO of Lingit. “With Amazon Web Services, we can experiment with different approaches towards statistical language model building and get results in a short amount of time. It doesn’t cost a lot either. Ultimately, this helps us to innovate and to keep improving our products.”
Amazon Case Study: Atbrox and Lingit
Date: 28-08-2013