Normalization algorithms for SMS messages
Research Methods Assignment 2
Normalization algorithms in SMS text messages
The Short Message Service (SMS) is a text based mobile communication system, whereby uses can send messages up to 160 characters in length to each other wirelessly. The culture which surrounds text message conversation allows for the proliferation of relaxed spelling and grammatical constraints, and the introduction of extensive word abbreviation.
This informal form of communication differs from post-edited text based messages, such as letters, news feeds, and to some extent, email. These types of communication are much more formal, and the author is obliged to use due diligence to ensure the text is free of errors.
The motivation behind the study of normalization of SMS text messages is not to alter the culture or habits of the people who send SMS messages, but to aid the processing of SMS textual data in computerized systems, which will lead to the development of the applications outlined in this paper.
The applications for normalized SMS are virtually endless, and the examples cited below are just an example of what is possible.
2.1 SMS to voicemail
An application outlined by Farrujia (2003) was the ability of mobile devices to send SMS text messages to landline handsets which had audio capabilities only. The message would be converted into audio, and played down the phone line once the receiver picked up the phone, or the telephone system accepted it as voicemail.
A similar application is described by Ghini et al. (2000), however, this is based more on the higher level software architecture than the underlying implementation. Furthermore, Farrujia concentrates more on the accessibility features of the system than the data retrieval possibilities discussed by Ghini.
This could handle text messages being sent accidentally to landline numbers. In other cases, it could be useful where the sender cannot send a verbal message due to a noisy environment or speech disability.
2.2 SMS for the visually impaired.
Furthering the application of SMS to audio conversion, it would also be useful for people with eyesight problems, whereby they could press a button to have any message they received spoken out loud.
It would also be useful for people not familiar with the idiosyncrasies of text message abbreviations. So a message such as ‘ru ok 2nite? Wld u like 2 c me?” would be expanded to “Are you ok tonight? Would you like to see me?”
2.3 SMS Text Compression
SMS text messages are currently limited to 160 characters, however, text can be compressed very effectively with lossless compression techniques such as Ziff-Lempel and Huffman compression. This would expand the transmittable message to potentially 320 characters within one message.
In order to avail of techniques such as dictionary compression, the majority of words within the message must be within the English dictionary. To better explain – the English language has approximately 20,000 words in common usage. Two 8 bit characters could represent 65,000 words, which could cover common proper names and place names also, and potentially an extra language, other than English.
Using dictionary compression, the message could be compressed to half its size or more, by replacing whole words with 2-character representations. However this is not possible if a large percentage of words in the message are outside the English dictionary. SMS normalization could correct this.
2.4 SMS based normal language interfaces
SMS based interfaces are very simplistic at present. They usually consist of messages such as “WIN A” or “WIN B”. Instead of natural interfaces such as “I want contestant A to win”. The lack of decent language processing in SMS based interface, means that whenever an SMS based service is advertised, it must contain explicit information on how exactly to send the message.
With SMS normalization, it should be possible to tokenize the message being sent, and to extract key phrases or expressions to determine the senders intentions. Tokenization is dealt with in more detail in the next section.
Tokenization is the process of delimiting a block of text into sub units, such as sentences. These sub units can be further tokenized into smaller sub units such as words. In English, Sentence delimiters are denoted by the period or question mark, but in more informal text, a sentence end can be delimited by a block of white space, such as new line character, or hyphen. Words in English are delimited within sentences by the space character, or by a hyphen. Tokenization is required for message normalization, so that each word can be analyzed independently.
Where message space is limited, like in the case of SMS messages, tokenization becomes more difficult, because people will intentionally or unintentionally omit spaces and full stops for the sake of brevity. In some other languages, such as Chinese, the notion of word delimiters is much vaguer, as words and phrases can be grouped together more freely. German causes similar problems, for instance the word for SMS message, is often written as a composite of two nouns “SMSSprüche” which would need to be tokenized into separate words. A technique known as Viterbi encoding was developed to assist in this problem.
As discussed in Clark (2003), Viterbi encoding is where each letter in a word, or word composite is analyzed with a view to the probability of the following letter completing the word. For instance, where a space has been omitted in the string “whatif”; it is impossible for a letter to complete the string “whati” to a valid word, but “what” on its own is a complete word. Viterbi encoding allows for misspellings within word composites, which makes it particularly powerful for this application.
The process of tokenization is carried out mainly by the technique of regular expressions (Regex). Regex uses a symbolic moniker to describe how a string should be parsed. For a simple example, the moniker “w” will split words on white space.
4.0 Spelling correction
Spelling correction, is quite simply, the conversion of words which are not in a larger set of allowable words (dictionary) into a similar word which is in the dictionary. Prudence must be used when converting words, not to loose the original meaning of the sentence. For instance the word “John” is not in the dictionary, and it could be a misspelling of “Join”. However, the capital “J” would point to a proper name, assuming it was not the first word in the sentence. Furthermore if the word “John” is followed by a verb, it is likely to be a proper name, but if it is followed by a numerator “a, two, three” it is most likely a misspelling of “Join”. This contextual spelling correction is described towards the end of this section.
Spelling errors occur in one of two ways, one is from mistyping a word, and another is from the author genuinely not knowing the correct spelling of a word. When a word is mistyped the most common mistake is for two letters to be exchanged – such as in “teh” rather than “the”. Letters can also be omitted, or substituted for letters which are close to each other on the keyboard, or on the same key, in the case of a mobile phone.
Where a message author mistakenly misspells a word because of a lack of knowledge of the correct spelling of the word, certain patterns emerge. The most common spelling mistake is for phonetically similar letter groups such as “ant” and “ent” to be interchanged, for instance “apparent” is often misspelled as “apparant”. However, where the world would be pronounced differently phonetically, the interchange is less likely, such as “applicant” and “applicent”. It is a common trait for message authors to consistently make the same error in subsequent messages, so a case could be made for some user tracking and error recording to help solve future errors faster.
A line must be drawn between what a misspelling is and what is quite obviously not a word. For instance, in a message containing “My password is ujikol” it is futile to try and convert “ujikol” to an English word, since it is too far removed from anything similar. Two processes can be used to judge similarity between words, one is known as “Soundex”, where the phonetic representations of the words are compared, rather than the letters. Another system known as the Levenshtein distance measures the number of substitutions, deletions and insertions are require to make one word into another. In this way the Levenshtein distance between “Camel” and “Hippopotamus” is so high that is obvious that the user is really talking about camels, and did not accidentally type “Hippopotamus” by mistake!.
Brill and Moore (2000) whilst working for Microsoft research developed a probabilistic mechanism for measuring the chance of misspellings within a word, and thus provided a more accurate way of estimating correct spellings than Levenshtein. This was based on computing the probability of letter groups becoming interchanged and then extrapolating this upwards to measure the similarity between words. For instance “F” and “Ph” are often mistyped, due to their phonetic similarity, similarly for “ant” and “ent”. Therefore, while a Levenshtein based system would predict that “elefent” and “elephant” could not be the same word. Brill’s system correctly identifies the similarity. The downfall of Brill’s system, as pointed out by Clark in his paper, is that it does not attempt to correct words that are mistyped to become other words, which by chance, appear in the dictionary.
An issue that has been overlooked at this point was the speed at which these dictionary searches are made. In an ideal world, every misspelled word should be compared with every word in the English dictionary to compute its Levenshtein distance. However, scanning 20,000 words for every misspelling is excessive. According to Clark, and popular opinion, the fastest way to search through this volume is to use what is known as a prefix Tree Acceptor algorithm. This means that when looking for the word “Xylophone”, you don’t start reading from “Aardvark”, instead, you start at “X”, thus cutting the search to one 26th, then search for XY, and so forth.
The more advanced form of spelling correction, which is often more commonly known as grammar correction, is where words which have misspelled to become other valid words – such as From and Form. In this case, grammatical rules can be used to check the correct order of nouns, verbs, adjectives etc.
5.0 Punctuation correction
Informal communications such as SMS text messages, personal email, instant message conversations are prone to the use of excessive punctuation. In a study by Clark, where he made a comparison between text extracted from news feeds and other formal post-edited sources versus text from a corpus of USENET postings, which are much less formal. He found huge numbers of occurrences of repeated exclamation marks and periods within the text, which are used to express either excitement or suspense. Both features were non existent within the news feeds.
Although excessive punctuation does not cause a problem with the applications proposed in this paper, they could be used to set the tone of the Text to Speech engine. So if a “= )” symbol is found, instead of reading “equals, close parenthesis” the speech engine should say the preceding sentence in an lilting, upbeat fashion.
6.0 Text to speech
Text to speech processing (TTS) is a process by which textual data is converted into a synthesized human voice. This process involves extracting the phonemes within words, and looking up a database of audio equivalents.
Care must be taken when converting phonemes to audio, since their pronunciation sometimes depends on the containing word. For instance “ight” is generally said as “yte” as in “Bright”, “Fight”, “Height”, “Light”, “Might”, “Night” etc. – except in the case of “eight”, where it becomes “ate”.
Other problems occur where the word changes pronunciation based on the context of the containing sentence. For example, “We use polish to clean our floors”, and “There are many polish people in Germany” etc.
Word expansion is a technique discussed by Farrujia (2003), whereby acronyms and shortened words such as “cu soon” would be expanded to “see you soon”. Also, numerical information such as “12:30” would need to be expanded to “twelve thirty”.
A criticism of the work by Farrujia is that no reference is made to existing TTS speech systems which would permit most of this functionality to by implemented without Reinventing the wheel. Systems such as the Microsoft Speech Application Programming Interface are outlined in Reid (2004)
It is expected that the limitation of 160 character SMS messages will be shortly supplanted with the ability to send email from mobile phones, which, are not only unlimited in length of text, but more compatible with existing IT infrastructure. With more space to write messages, the use of abbreviations will decline, however, people will still use them in order to speed up the process of sending text messages, when they have a limited keyboard set. The reduction in costs of sending SMS messages will also contribute to this, but may actually serve to shorten messages, but have people text more often.
An emerging technology in mobile phones is the development of T9 predictive text. This was initially developed for people with physical disabilities who did not have the dexterity to use the standard 100 key keyboard, but needed to form words with four keys. It was found that the English dictionary could be covered effectively with a set of 9 keys, thus T9 was born. T9 reduces the level of misspellings sent in text messages, but can lead to unusual mistyping, for instance, “Stale” and “Quake” have the same key combination, but their Levenshtein distance is so high that standard spell checkers would fail to correct this error.
Mobile devices are also coming equipped with new input devices, such as retractable keyboards, and handwriting recognition, which should serve to allow people two write longer, and more correct SMS messages, without taking too much time to compose them. Although handwriting recognition may introduce its own variety of errors, with similar letters such as “j” and “i” becoming interchanged.
The offshoots of all of these advances are more SMS enabled services and products for the consumer. Eventually, you will be able to text “Please renew my car insurance” to your bank, and have an automated agent settle your bills for you.
Clark, A. (2003), Pre-processing very noisy text. Geneva, Switzerland
Brill E, Moore R.C, (2000), An improved error model for noisy channel spelling correction. Proceedings of ACL 2000
Farrujia P.J (2003) Text to Speech Technologies for Mobile Technology Services,
University of Malta
Ghini. V. Pau. G. Salomoni P (2000). Integrating notification services in computer network and mobile telephony. Bolognia, Italy
Reid, F. (2004). Network programming in .NET, with C# and Visual Basic .NET, Oxford, UK