Archive for November, 2005

Aggregated XML based dissemination of Travel Industry Services

Research Methods

Assignment 3: Developing a Research Proposal



Aggregated XML based dissemination of Travel Industry Services



            Author: Fiach Reid

            Course: Msc Computers and International Business

            Module: Research Methods

            Date: 28/11/2005







The travel industry has been one of the earliest adopters of electronic commerce. Since the 1980’s global distribution systems (GDS) have been in place to enable availability checks, and remote reservation of flights and hotels over modem connections. Unfortunately, much of this legacy technology is still extremely prevalent, and equally outdated.


The overall objective of this proposed research is to introduce an alternative, more modern alternative to the legacy GDS systems available today. This alternative system is based on the notion of service aggregation, which provides backwards-compatibility with older systems, with the flexibility to be expanded to cover any conceivable travel service.


It is proposed that both qualitative empirical measurements and quantitative surveys would be carried out in order to assess customer acceptance of the new system once deployed.



1.0   Introduction


The basis of this research is the idea of data aggregation, this, in essence means that data streams of various formats are combined together into one source of data which is in one single format.


An example of information aggregation could be used, for instance, to combine data from various stock exchanges, such as the NASDAQ, and FTSE, so that a customer could check prices of shares from both UK and US companies in the same place. In this context, we envisage the use of an aggregation service for travel agency GDS systems, aggregated with data extracted from websites. This would mean that a customer could check prices from Ryanair and Air France in the same place.


It may not be immediately apparent the benefits of such a service to the travel industry, but it fundamentally affects the way in which people book flights, how travel agents sell flights, and how travel service providers such as airlines and hotel chains disseminate their availability information.


From a customer’s perspective, it would mean faster and more streamlined holiday booking process. It would mean, that they could visit one website, and be sure that they were totally informed of their flight options before booking. Currently, if you visit a travel website, you will oftentimes be given the option to book from one airline only (i.e. or at best from one GDS network ( the latter would not include airlines that are not included in that particular GDS network – which would notable exclude low cost airlines such as RyanAir and EasyJet.


Current GDS systems offer an option to add a margin to offered flights, from which travel agents make their profit. The proposed aggregated solution would also provide a mechanism for agents to add both a percentile or flat-rate markup to travel products. It should also make the sales process more streamlined for travel agents, because, as discussed later in this proposal, there would be a function to calculate multi-leg journeys combining flights from different operators. With current technology, travel agents would need to manually calculate multi-leg combinations, if the airline data was coming from different sources (i.e. website & GDS)


Travel service providers such as Airlines and Hotels could also find benefit in this system, with an easier technical integration process with their own internal booking systems. Since the proposed solution is based on modern web service technology which is natively supported in many programming languages, it should be easier to integrate than older legacy systems, which rely on proprietary protocols.


1.1    Problem statement


Current aggregation systems do exist, and to be fair, GDS systems themselves are constituted by data feeds from proprietary booking systems generated by airlines. Amadeus GDS comprises availability information from 742 airlines, of which 441 [1] can be sold on-line via the Amadeus interface. This is not, by all means all airlines in the world.


Other GDS systems exist, such as Galileo, Sabre, and WorldSpan. These all have proprietary interfaces, and are not particularly interoperable. Efforts have been made to create modern XML interfaces to these systems, and this has resulted in the creation of a plethora of new standards, and technologies, none of which have become the industry standard. – The proposed system would be designed to be backwards compatible, so that it could facilitate a more gradual migration for customers.


In an ideal world, this information would be free. However, all of the above GDS systems are quite expensive to run, and are generally charged on a per query basis. This means that every time a potential customer asks for flight information in a travel agent, the travel agent is charged to perform a GDS lookup regardless of whether a sale is made – The system outlined in this document may also require payment for transactions made, especially if the back-end system was making a lookup to a pay-per-query GDS system. However, a differentiation could be made between free to query and paid query searches.


GDS providers not only charge their travel agents, but airlines also pay a set-up fee to be integrated with their service. This fee is prohibitive to low-cost-airlines, and thus many opt not to be listed, and sell direct to the consumer via their website. In this system, a facility would be provided for the airline to provide its data freely to the service, or else, allow third parties provide middleware between the airlines’ website and the aggregation service.


1.2    Proposed research


The research proposed is the design and implementation of a prototype aggregation system which would offer all the functionality of the existing GDS systems, this would include

                     Flight availability & booking

                     Hotel availability & booking

                     Negotiated fares (Bulk, or otherwise discounted fares privately negotiated between agents and suppliers)

                     Holiday package reservation

          Miscellaneous leisure services (Car hire, Exhibitions, concerts etc)


Being able to pay for flight and hotel reservations via the system would be a key feature of the system. This could either be implemented by associating each transaction with an account, and settling accounts through standard processes. Otherwise, transactions could be carried out via the system using credit card transactions, or direct debit authorizations.


The system would also require functionality to automatically combine flights, so for instance if somebody wanted to fly from Belfast to Toronto, the system could automatically choose a route via London, based on price and time availability.

Expandability would need to be a key feature of the system, to allow different types of miscellaneous travel services to be booked. This would allow new forms of transport to be added as the system gained popularity – such as trains, busses, taxis’ etc.


In order to ease the transition for agents to the new system, backwards compatibility would be a key issue. This would mean that software which was designed for use with Amadeus could be easily patched to work with the new system, without major overhaul, or investment in re-training staff or purchasing of new hardware or software.


Security and authentication would be also necessary, since certain aspects of the service may be pay-per-query based operations. Also, it would be possible for some agents to buy flights and be invoiced at a later date, therefore, it would be necessary to ensure that unauthorized persons could not book flights on another’s behalf.


Continuing the theme of authentication, it would be necessary to ensure that the system was purged of any bogus data. This could occur when a travel service provider accidentally or purposefully feeds data into the system which contains inaccurate information. It would be envisaged that rules could be applied to the data as it enters the system, and if the information fell outside the normal parameters it could be quarantined. For instance, it is impossible to fly from London to New York in one hour, and it will probably cost somewhere in the region of £100 to £1,000. Anything outside of these parameters could be sent back to the travel service provider (airline) to be cross checked manually.


Since the data would be fed live from many different sources with varying latencies, it would be necessary for the system to respond in an asynchronous manner. This would permit any “slow” sources to return data later in the feed than “faster” sources. The XML would be structured so that partial functionality would be available, even with incomplete feeds. – This would also allow for degraded functionality in the event of corrupted feeds.


A call back feature would also be useful. In this way it would be possible for airlines to notify Travel Agents of upcoming price reductions, and other special offers.

2.0   Background


Currently the most common aggregation system for flight and travel information are GDS systems. There are four major GDS systems available, WorldSpan, Amadeus, Galileo, and Sabre.


Sabre is the largest GDS system in America market share (Wiig, 2004), and drives many high profile websites including The system operates over either a proprietary protocol named Sabre SDS (Structured Data Stream) or the more legacy X.25 protocol.


Galileo is the second most important GDS system after Sabre with a market share of 22% as measured in terms of number of travel agencies having access to the system. Galileo has acquired the travel website, which is well known worldwide. The system operates using either EDIFACT (Electronic Data Interchange for administration, commerce and Transport) or X.25.


Worldspan (Granados 2003) provides the GDS system behind many high profile travel websites, most notably, a travel service established by Microsoft. Worldspan is jointly owned by Delta, Northwest and American Airlines. WorldSpan operates over the MATIP (Mapping of Airline Traffic over Internet Protocol).


Amadeus is viewed as the European counterpart for Worldspan accounting for 60% market share in Europe. Amadeus is jointly owned by Air France, Iberia and Lufthansa. This GDS system operates over MATIP or Telnet.


With such a diverse range of protocols and communication types it creates difficulties in standardising the various systems. Some of the protocols mentioned above are not Internet based – X.25 for instance, and thus loose out on many of the advancements in Internet connectivity, such as broadband DSL (Digital Subscriber Lines), and Wireless Networks. Proprietary protocols also have difficulty negotiating through corporate firewalls which will by default, not recognise the traffic and block it.


There is a general trend of modern communications systems to leverage standard Internet technologies. In the field of business to business communications, XML (Extensible Markup Language) is becoming increasingly prevalent. The XML format is widely accepted in the industry, and is used for all types of transactions, from financial services to travel services.


Some initiatives do exist to aggregate the existing GDS systems to XML, the most notable of these are SITA (Société internationale de télécommunications aéronautiques),

OTA (Open Travel Alliance) and TORIX (Tour Operator Reservations in XML).


SITA is an aggregator for the four main GDS suppliers (Amadeus, Galileo, WorldSpan and Sabre), it also supplies data from smaller CRS (Computer reservation systems) such as TravelSky and SHARES. It is based upon the SOAP (Simple Object Access Protocol), which is the XML based technology behind Web Services.


The OTA provides a much wider range of members than SITA, including the four main GDS suppliers, and hundreds of other well known airlines (American Airlines, Continental Airlines, Delta Airlines etc.) and hotel chains. They published their XML specification in 2002, and it is well known within the industry.


TORIX is a small Irish based initiative to aggregate data feeds into XML from tour operators. It currently runs on the TUI/Thomson network, and supplies this data to travel agents and retailers.


2.1    Literature review


When undertaking a project such as this it is worthwhile looking at how other companies have avoided or negated the issue of GDS fees and dealing with the plethora of legacy technologies which comes along with them.


Orbitz ( is a good case in point. This company was set up jointly in 2001 by American, Continental, Delta, Northwest, and United (Granados, 2003). In order to avoid reliance on legacy system infrastructures and high GDS and CRS fees, they used a joint system, collecting flight schedules and flight tariffs separately.


They collected fares directly from the Airline Tariff Publishing Company (, which collects and distributes fares from airlines worldwide and schedules from OAG (


This is not an ideal situation, since it makes it more complex to collect airfares, and any inconsistency or latency issue between the two feeds could lead to incorrect information


3.0   Proposed work


In order to provide such a system, the first step would be to sign up to each of the four GDS suppliers, in order to obtain live data feeds in whatever format they provided. These feeds would be read via their own proprietary protocol, reformatted, and then sent out to the client as XML over HTTP (Hypertext Transfer Protocol). It would be envisaged that Web Services would be used to provide the interface, to permit ease of use by clients.


Since GDS systems incur fees, and they do not include many budget airlines, it would be necessary to expand the system such that data from low-cost-airlines could also be collected, and used transparently with the rest of the system. This would be done by adding modules to the Web Service, so that where a request was received that contained a route that was serviced by an operator not included in GDS, it could make a request to that airline’s website, and parse the results returned.


The web site integration as described above could either be carried out by the airline itself, or by a third party, for a fee. This particular style of integration may cause difficulties as it depends on the airline’s website remaining static, and not changing structurally, or else the integration would need to be performed again.


In order to access the success of such a system, it would be necessary to carry out customer satisfaction surveys, and supplier acceptance surveys. These could be used to Gauge if the benefit to the consumer outweighs the inconvenience to the Travel Supplier and Agents who would necessarily need to upgrade certain aspects of their IT infrastructure.


3.1    Aims & Objectives


The overriding aim of this project would be to provide a system which would encourage primary travel providers (airlines/hotels) to use modern protocols, instead of relying on “lowest-common-denominator” distribution systems.


The overall effect of this would be that it would reduce the monopoly position of the four major GDS systems. Increased competition would reduce costs for consumers and travel agents alike, and produce a more streamlined service for all parties involved.

3.2    Hypothesis & Rationale


The hypothesis behind this research is that the added benefit afforded by the functionality of an XML based system would compensate for the inconvenience of the inevitable client’s changeover between heterogeneous systems.


Software companies rarely produce applications that are backwards compatible with computers that are 10 years old. Although, this would open a market to a fraction more of the population, it would fail to leverage the added functionality inherent in modern computers. By embracing new technology, the consumer base will upgrade to the new technology gradually, and will appreciate the extra functionality over time.


In the same way, primary travel industry providers need not rely on legacy systems to distribute their pricing and availability data to retailers and agents. Travel agents will also

Upgrade to the new technology.


To support the changeover, travel service providers should provide a dual system for a period of time, allowing customers a grace period to upgrade their systems. Support for the legacy system could be withdrawn gradually, and eventually terminated at a forewarned time.

3.3    Methodology


Evaluating a completed system would be a crucial part of the process. This would be better carried out while the system was still in its infancy, and not deployed over multiple sites worldwide. The buy-in or acceptance rate of travel service providers, and agents could be measured empirically. If no airline is interested in publishing its flight pricing and availability data through this system, it is not going to be of any added benefit to the consumer. Likewise, if no travel agent agrees to evaluate the service, it would be equally pointless, as the consumer would not be able to buy flights through the system.


Customer acceptance could also be measured through surveys, asking people what they thought of the system, and how easy the found it to use, and integrate with their existing IT systems. This information could be used to look for ways to improve the service, through proposed features suggested in the surveys

3.4    Preliminary Design of System


The system would be based on Web Services. These are software modules which reside on Web servers, which communicate with other software modules via XML over HTTP (Reid, 2003). The back-end of these software modules would contain interfaces to the four main GDS systems, plus be expandable to include custom modules, such as website-integrated airlines. 


In order to support the vast number of concurrent users who could potentially use the system, it would be envisaged that a network of load balanced servers would be used to handle the volume of traffic. Load balanced servers are a cluster of machines that automatically delegate work between each-other. A distributed database could also be used to maintain state information between these servers.


An important aspect of the system is its backwards compatibility with legacy GDS systems. The purpose of this backwards compatibility is so that software based on GDS could still continue to operate with minimal change. It would be envisaged that this would be implemented by encapsulating the GDS commands (protocol) in XML, so that the client application would simply wrap the outgoing requests in a predefined XML enclosure. Although this mode of operation would not provide the full functionality of the system, it would permit for ease of integration with legacy systems.

3.5    Preliminary results

The author of this report is working closely with a company named Cheap Flights International  to develop a means of aggregating integrated website scanning modules together. This product has a limited web service interface, which allows, for the meantime, functionality to add flights to the distributed database, and perform some basic utilities such as converting airport names to IATA (International Air Transport Association).


Currently this service covers something near thirty different airlines, and has successfully retrieved details for over 30 million flights. Although this is nowhere near the scope of a typical GDS system with 400+ airlines, nor provides functionality to book flights, it does not incur GDS fees to operate.


Further research and development work would be required to provide the full functionality as outlined in section 1.2.

4.0   Summary


In summary, a typical travel agent that one would find in any high street would be using a GDS system, which has changed little since the 1980’s. These systems are outdated and fail to leverage modern advances in Internet technology such as broadband or wireless Internet. Furthermore they are expensive to operate, since the companies running these GDS systems have a monopoly on the market. The cost of which are so prohibitive that some budget airlines opt not to be included in GDS networks, therefore making it more difficult for agents and consumers see the broader picture of all available flight options.


The solution proposed in this paper is the development of an aggregated XML system, which would combine all the features of the existing GDS network, and combine that data with information collected freely from other sources, including the airline’s websites themselves. The advantage of this system is that it is more modern, and thus takes advantage of recent developments in Internet technology, and hopefully lower costs to the agents and consumers.

4.1    Significance of the research


It is inevitable that the travel industry will migrate away from legacy GDS systems. With so many initiatives to move to XML afoot, it seems likely that this is the way the industry will turn.


If the XML specification developed and demonstrated in this research was proven to be more encompassing or simpler to implement than other proposals, then it could be adopted worldwide.


The knock effects of this system could include quite significant development costs for larger operators, and widespread upgrades of legacy software at many travel agent outlets. However, the benefits would be immediately apparent. For instance, a typical X.25 connection runs at 64kbs, whereas a DSL connection, which can be operated for the same cost, runs at speeds up to 4,000 kbps. This would mean that travel agents could search for and book flights quicker, and thus have shorter queues of people waiting to be served.


4.2    Original contribution


How this differs from many of the XML-based proposals now on the drawing board is that this system does not only focus on the integration of existing GDS networks, but it allows for free inclusion of airlines and operators who are not currently part of GDS. This would be achieved through reading the fares directly from the operator’s website. It has already been proven with research carried out by the author at Cheap Flights, that this is possible, and cost effective. With the inclusion of new operators costing approximately £100 per airline, thus a GDS scale deployment could be achieved for approximately £44,000 which is within the budget of medium sized companies. This cost could be recouped through either booking fees, sign up fees or advertising.


Another original aspect of the system is that is backwards compatible with the legacy GDS systems, to allow for gradual upgrades of legacy software. This is achieved through XML encapsulation of the underlying GDS protocols.

5.0   References


The following Internet resources were used for reference while writing this document. It should be noted that these documents may not have been peer reviewed, and thus the information may not be accurate




United Nations – EDIFACT standard


5.1    Bibliography Section




Wiig, A. (2004). Denmark, Developing countries and the tourist industry in the internet age: The Namibian case


Reid, F. (2003), Donegal. Network programming in .NET, With C# and Visual Basic .NET

Categories: Uncategorized

Supporting the Mouse Wheel in VB6

VB6 does not natively support the mouse wheel (The roller ball in the middle of your mouse). Thus, any programs you develop with VB6 will not support the mouse wheel either. There is a work around.
Some websites will show a "hacky solution", which can cause vb6 to crash every time you press the "stop debug button" – this approach is better.
1st stop. Download the VB subclassing DLL from (SSubTmr.dll)
– Add a reference to this in the project
2. Add this constant
Public Const WM_MOUSEWHEEL = &H20A
3. Add this to the top of the form
Implements ISubclass
4.Add these three functions to the code
 Private Property Let ISubClass_MsgResponse(ByVal RHS As EMsgResponse)
 End Property
 Private Function ISubClass_WindowProc( _
  ByVal hwnd As Long, _
  ByVal iMsg As Long, _
  ByVal wParam As Long, _
  ByVal lParam As Long) As Long
 Dim rotation As Long
 rotation = wParam / 65536
 If rotation > 0 Then SendKeys ("{UP}")
 If rotation < 0 Then SendKeys ("{DOWN}")
End Function
  Private Property Get ISubClass_MsgResponse() As EMsgResponse
     ISubClass_MsgResponse = emrConsume
  End Property
And that’s it.
Categories: Uncategorized

emailing a web page as plain text

I was looking for a way to send a web page as an email from a windows form appliction. Quite easily done:

HTMLDocument htmlPage = (HTMLDocument)WebBrowser.Document;

email.BodyFormat = MailFormat.Html;

email.Body = htmlPage.body.innerHTML;

email.From = "";

email.To = "";

email.Subject = "subject";

SmtpMail.SmtpServer = settings.SMTPServer;



Then, I looked at how to send as plain text, which I used htmlPage.body.innerText, however, this rendered the links unusable, simply saying "Click here" rather than Click here So, I decided to render links as "text" (url). So they could still be used. To do this, I could use extensive parsing and regular expressions, or this nifty piece of code:


foreach(IHTMLElement htmElement in htmlPage.links)


IHTMLAnchorElement htmLink = (IHTMLAnchorElement)htmElement;

htmElement.innerHTML = htmElement.innerText + " (" + htmLink.href + ")";


email.BodyFormat = MailFormat.Text;

email.Body = htmlPage.body.innerText;


Nice eh?


Categories: Uncategorized

SQLClient vs OleDb

I’ve read that SQLClient is more efficient than OleDb, but I was curious to know how much actual performance increase it would give in a real world environment.
I plucked 40 SQL statements from a trace at random, and ran them both through code for OLEDB and SQLClient respectively,

private void button1_Click(object sender, System.EventArgs e)


// Native

string strDSN = "Data Source=;Initial Catalog=CFSpain;User ID=;Password=;Connection Reset=FALSE;";

SqlConnection conDb = new SqlConnection(strDSN);

DateTime dtStart = DateTime.Now;


string[] strSqlBenchmarks = Regex.Split(this.tbSQL.Text,"gorn");

foreach (string strSqlBenchmark in strSqlBenchmarks)




SqlDataAdapter daDb =

new SqlDataAdapter(strSqlBenchmark,conDb);

DataSet dsDb = new DataSet();



catch(Exception ex)






TimeSpan tsLength = dtStart – DateTime.Now;

MessageBox.Show("Time elapsed:" + tsLength.Seconds);


private void btnGeneric_Click(object sender, System.EventArgs e)


// Generic

string strDSN = "Provider=SQLOLEDB.1;Password=;Persist Security Info=True;User ID=;Initial Catalog=CFSpain;Data Source=;Connection Reset=FALSE;";

OleDbConnection conDb =

new OleDbConnection(strDSN);

DateTime dtStart = DateTime.Now;


string[] strSqlBenchmarks = Regex.Split(this.tbSQL.Text,"gorn");

foreach (string strSqlBenchmark in strSqlBenchmarks)




OleDbDataAdapter daDb = new OleDbDataAdapter(strSqlBenchmark,conDb);

DataSet dsDb =

new DataSet();



catch(Exception ex)






TimeSpan tsLength = dtStart – DateTime.Now;

MessageBox.Show("Time elapsed:" + tsLength.Seconds);



The results were quite dissapointing, with either technique taking somewhere between 7 to 11 seconds over the same set of SQL statements. Adding these results up over 8 tests each, I came up with a 5% improvement with SqlClient over OleDb adapter.


In the particular project I was working on, this performance gain was too low to justify an overhaul of the system.

Categories: Uncategorized


When trying to connect to SQL server from Query analyser I got this error SQL_HANDLE_ENV, with some reference to a problem with my ODBC drivers.
I Decided to install MDAC 2.8 … and it fixed the problem!
Categories: Uncategorized

KB 871122 – Wireless Zero Configuration service

This is a quick fix to check whenever you see your wireless network is down.
Click Administrative Tools > Services > Wireless Zero Configuration service
right click properies, set the statup to automatic, then press start.
you might set the recovery options to restart the service also.
Categories: Uncategorized

KB834707 killed my internet connection

This is a pretty wierd one, but it happened twice in the space of two days, so I though I should share it.
Whenever I install Hotfix KB834707, which comes through windows update, My internet connection dies, and I have to use system restore to get it back again.
Categories: Uncategorized

Normalization algorithms for SMS messages

Research Methods Assignment 2

Normalization algorithms in SMS text messages

1.0         Abstract


The Short Message Service (SMS) is a text based mobile communication system, whereby uses can send messages up to 160 characters in length to each other wirelessly. The culture which surrounds text message conversation allows for the proliferation of relaxed spelling and grammatical constraints, and the introduction of extensive word abbreviation.


This informal form of communication differs from post-edited text based messages, such as letters, news feeds, and to some extent, email. These types of communication are much more formal, and the author is obliged to use due diligence to ensure the text is free of errors.


The motivation behind the study of normalization of SMS text messages is not to alter the culture or habits of the people who send SMS messages, but to aid the processing of SMS textual data in computerized systems, which will lead to the development of the applications outlined in this paper.

2.0         Applications


The applications for normalized SMS are virtually endless, and the examples cited below are just an example of what is possible.

2.1    SMS to voicemail


An application outlined by Farrujia (2003) was the ability of mobile devices to send SMS text messages to landline handsets which had audio capabilities only. The message would be converted into audio, and played down the phone line once the receiver picked up the phone, or the telephone system accepted it as voicemail.


A similar application is described by Ghini et al. (2000), however, this is based more on the higher level software architecture than the underlying implementation. Furthermore, Farrujia concentrates more on the accessibility features of the system than the data retrieval possibilities discussed by Ghini.


This could handle text messages being sent accidentally to landline numbers. In other cases, it could be useful where the sender cannot send a verbal message due to a noisy environment or speech disability.


2.2    SMS for the visually impaired.


Furthering the application of SMS to audio conversion, it would also be useful for people with eyesight problems, whereby they could press a button to have any message they received spoken out loud.


It would also be useful for people not familiar with the idiosyncrasies of text message abbreviations. So a message such as ‘ru ok 2nite? Wld u like 2 c me?” would be expanded to “Are you ok tonight? Would you like to see me?”

2.3    SMS Text Compression


SMS text messages are currently limited to 160 characters, however, text can be compressed very effectively with lossless compression techniques such as Ziff-Lempel and Huffman compression. This would expand the transmittable message to potentially 320 characters within one message.


In order to avail of techniques such as dictionary compression, the majority of words within the message must be within the English dictionary. To better explain – the English language has approximately 20,000 words in common usage. Two 8 bit characters could represent 65,000 words, which could cover common proper names and place names also, and potentially an extra language, other than English.


Using dictionary compression, the message could be compressed to half its size or more, by replacing whole words with 2-character representations. However this is not possible if a large percentage of words in the message are outside the English dictionary. SMS normalization could correct this.

2.4    SMS based normal language interfaces


SMS based interfaces are very simplistic at present. They usually consist of messages such as “WIN A” or “WIN B”. Instead of natural interfaces such as “I want contestant A to win”. The lack of decent language processing in SMS based interface, means that whenever an SMS based service is advertised, it must contain explicit information on how exactly to send the message.


With SMS normalization, it should be possible to tokenize the message being sent, and to extract key phrases or expressions to determine the senders intentions. Tokenization is dealt with in more detail in the next section.

3.0         Tokenization


Tokenization is the process of delimiting a block of text into sub units, such as sentences. These sub units can be further tokenized into smaller sub units such as words. In English, Sentence delimiters are denoted by the period or question mark, but in more informal text, a sentence end can be delimited by a block of white space, such as new line character, or hyphen. Words in English are delimited within sentences by the space character, or by a hyphen. Tokenization is required for message normalization, so that each word can be analyzed independently.


Where message space is limited, like in the case of SMS messages, tokenization becomes more difficult, because people will intentionally or unintentionally omit spaces and full stops for the sake of brevity. In some other languages, such as Chinese, the notion of word delimiters is much vaguer, as words and phrases can be grouped together more freely. German causes similar problems, for instance the word for SMS message, is often written as a composite of two nouns “SMSSprüche” which would need to be tokenized into separate words. A technique known as Viterbi encoding was developed to assist in this problem.


As discussed in Clark (2003), Viterbi encoding is where each letter in a word, or word composite is analyzed with a view to the probability of the following letter completing the word. For instance, where a space has been omitted in the string “whatif”; it is impossible for a letter to complete the string “whati” to a valid word, but “what” on its own is a complete word. Viterbi encoding allows for misspellings within word composites, which makes it particularly powerful for this application.


The process of tokenization is carried out mainly by the technique of regular expressions (Regex). Regex uses a symbolic moniker to describe how a string should be parsed. For a simple example, the moniker “w” will split words on white space.

4.0         Spelling correction


Spelling correction, is quite simply, the conversion of words which are not in a larger set of allowable words (dictionary) into a similar word which is in the dictionary. Prudence must be used when converting words, not to loose the original meaning of the sentence. For instance the word “John” is not in the dictionary, and it could be a misspelling of “Join”. However, the capital “J” would point to a proper name, assuming it was not the first word in the sentence. Furthermore if the word “John” is followed by a verb, it is likely to be a proper name, but if it is followed by a numerator “a, two, three” it is most likely a misspelling of “Join”. This contextual spelling correction is described towards the end of this section.


Spelling errors occur in one of two ways, one is from mistyping a word, and another is from the author genuinely not knowing the correct spelling of a word. When a word is mistyped the most common mistake is for two letters to be exchanged – such as in “teh” rather than “the”. Letters can also be omitted, or substituted for letters which are close to each other on the keyboard, or on the same key, in the case of a mobile phone.


Where a message author mistakenly misspells a word because of a lack of knowledge of the correct spelling of the word, certain patterns emerge. The most common spelling mistake is for phonetically similar letter groups such as “ant” and “ent” to be interchanged, for instance “apparent” is often misspelled as “apparant”. However, where the world would be pronounced differently phonetically, the interchange is less likely, such as “applicant” and “applicent”. It is a common trait for message authors to consistently make the same error in subsequent messages, so a case could be made for some user tracking and error recording to help solve future errors faster.


A line must be drawn between what a misspelling is and what is quite obviously not a word. For instance, in a message containing “My password is ujikol” it is futile to try and convert “ujikol” to an English word, since it is too far removed from anything similar. Two processes can be used to judge similarity between words, one is known as “Soundex”, where the phonetic representations of the words are compared, rather than the letters. Another system known as the Levenshtein distance measures the number of substitutions, deletions and insertions are require to make one word into another. In this way the Levenshtein distance between “Camel” and “Hippopotamus” is so high that is obvious that the user is really talking about camels, and did not accidentally type “Hippopotamus” by mistake!.


Brill and Moore (2000) whilst working for Microsoft research developed a probabilistic mechanism for measuring the chance of misspellings within a word, and thus provided a more accurate way of estimating correct spellings than Levenshtein. This was based on computing the probability of letter groups becoming interchanged and then extrapolating this upwards to measure the similarity between words. For instance “F” and “Ph” are often mistyped, due to their phonetic similarity, similarly for “ant” and “ent”. Therefore, while a Levenshtein based system would predict that “elefent” and “elephant” could not be the same word. Brill’s system correctly identifies the similarity. The downfall of Brill’s system, as pointed out by Clark in his paper, is that it does not attempt to correct words that are mistyped to become other words, which by chance, appear in the dictionary.


An issue that has been overlooked at this point was the speed at which these dictionary searches are made. In an ideal world, every misspelled word should be compared with every word in the English dictionary to compute its Levenshtein distance. However, scanning 20,000 words for every misspelling is excessive. According to Clark, and popular opinion, the fastest way to search through this volume is to use what is known as a prefix Tree Acceptor algorithm. This means that when looking for the word “Xylophone”, you don’t start reading from “Aardvark”, instead, you start at “X”, thus cutting the search to one 26th, then search for XY, and so forth.


The more advanced form of spelling correction, which is often more commonly known as grammar correction, is where words which have misspelled to become other valid words – such as From and Form. In this case, grammatical rules can be used to check the correct order of nouns, verbs, adjectives etc.



5.0         Punctuation correction


Informal communications such as SMS text messages, personal email, instant message conversations are prone to the use of excessive punctuation. In a study by Clark, where he made a comparison between text extracted from news feeds and other formal post-edited sources versus text from a corpus of USENET postings, which are much less formal. He found huge numbers of occurrences of repeated exclamation marks and periods within the text, which are used to express either excitement or suspense. Both features were non existent within the news feeds.


Although excessive punctuation does not cause a problem with the applications proposed in this paper, they could be used to set the tone of the Text to Speech engine. So if a “= )” symbol is found, instead of reading “equals, close parenthesis” the speech engine should say the preceding sentence in an lilting,  upbeat fashion.

6.0   Text to speech


Text to speech processing (TTS) is a process by which textual data is converted into a synthesized human voice. This process involves extracting the phonemes within words, and looking up a database of audio equivalents.


Care must be taken when converting phonemes to audio, since their pronunciation sometimes depends on the containing word. For instance “ight” is generally said as “yte” as in “Bright”, “Fight”, “Height”, “Light”, “Might”, “Night” etc. – except in the case of “eight”, where it becomes “ate”.


Other problems occur where the word changes pronunciation based on the context of the containing sentence. For example, “We use polish to clean our floors”, and “There are many polish people in Germany” etc.


Word expansion is a technique discussed by Farrujia (2003), whereby acronyms and shortened words such as “cu soon” would be expanded to “see you soon”. Also, numerical information such as “12:30” would need to be expanded to “twelve thirty”.


A criticism of the work by Farrujia is that no reference is made to existing TTS speech systems which would permit most of this functionality to by implemented without Reinventing the wheel. Systems such as the Microsoft Speech Application Programming Interface are outlined in Reid (2004)



It is expected that the limitation of 160 character SMS messages will be shortly supplanted with the ability to send email from mobile phones, which, are not only unlimited in length of text, but more compatible with existing IT infrastructure. With more space to write messages, the use of abbreviations will decline, however, people will still use them in order to speed up the process of sending text messages, when they have a limited keyboard set. The reduction in costs of sending SMS messages will also contribute to this, but may actually serve to shorten messages, but have people text more often.


An emerging technology in mobile phones is the development of T9 predictive text. This was initially developed for people with physical disabilities who did not have the dexterity to use the standard 100 key keyboard, but needed to form words with four keys. It was found that the English dictionary could be covered effectively with a set of 9 keys, thus T9 was born. T9 reduces the level of misspellings sent in text messages, but can lead to unusual mistyping, for instance, “Stale” and “Quake” have the same key combination, but their Levenshtein distance is so high that standard spell checkers would fail to correct this error.


Mobile devices are also coming equipped with new input devices, such as retractable keyboards, and handwriting recognition, which should serve to allow people two write longer, and more correct SMS messages, without taking too much time to compose them. Although handwriting recognition may introduce its own variety of errors, with similar letters such as “j” and “i” becoming interchanged.


The offshoots of all of these advances are more SMS enabled services and products for the consumer. Eventually, you will be able to text “Please renew my car insurance” to your bank, and have an automated agent settle your bills for you.




Clark, A. (2003), Pre-processing very noisy text.  Geneva, Switzerland


Brill E, Moore R.C, (2000), An improved error model for noisy channel spelling correction. Proceedings of ACL 2000


Farrujia P.J (2003) Text to Speech Technologies for Mobile Technology Services,

University of Malta


Ghini. V. Pau. G. Salomoni P (2000). Integrating notification services in computer network and mobile telephony. Bolognia, Italy


Reid, F. (2004). Network programming in .NET, with C# and Visual Basic .NET, Oxford, UK


Categories: Uncategorized

Filtering a DataSet

I know that .NET offers a number of ways to show different filtered views of a DataSet, however, if you want to actually change the DataSet, you run into nasty little errors such as "This row already belongs to another table" or "These columns dont currently have unique values" etc.
So here’s a method I made for filtering datasets…

public static DataSet FilterDataSet(DataSet ds,string TableName,string column, string filter)


DataSet dsCopy = ds.Copy();

ArrayList alRowsToRemove = new ArrayList();

foreach(DataRow dr in dsCopy.Tables[TableName].Rows)


if (dr[column].ToString()!=filter)





foreach(DataRow dr in alRowsToRemove)




return dsCopy;


Categories: Uncategorized

Filtering a DataSet

I know that .NET offers a number of ways to show different filtered views of a DataSet, however, if you want to actually change the DataSet, you run into nasty little errors such as "This row already belongs to another table" or "These columns dont currently have unique values" etc.
So here’s a method I made for filtering datasets…

public static DataSet FilterDataSet(DataSet ds,string TableName,string column, string filter)


DataSet dsCopy = ds.Copy();

ArrayList alRowsToRemove = new ArrayList();

foreach(DataRow dr in dsCopy.Tables[TableName].Rows)


if (dr[column].ToString()!=filter)





foreach(DataRow dr in alRowsToRemove)




return dsCopy;


Categories: Uncategorized
%d bloggers like this: