An Essay on Peer to Peer data storage

Home > Uncategorized > An Essay on Peer to Peer data storage

An Essay on Peer to Peer data storage

March 15, 2005 Infinite Loop Development Ltd Leave a comment Go to comments

Report on data storage
Part 2. Distributed storage.

In order to store data in the order of thousands of terabytes, it becomes increasingly difficult to store such information on a single device. Distributed storage solves this problem by storing the data on multiple devices managed by different computers. Single computers in a distributed storage system may fail without rendering the entire storage system inoperable, this leads to enhanced durability and redundancy within the storage system.

When a client requests data from a distributed storage system, the system must be able to locate the distributed device (node) that holds the data that the client requested. There are three different techniques that can be used to achieve this, they are Ad-Hoc Peer to Peer (P2P), Pure P2P, or Indexed P2P.

Ad Hoc peer to peer is where the client knows which node within a distributed storage system contains the data that it requires. A good example of ad-hoc P2P is the WHOIS system.

The WHOIS system consists of about 100 or so computers which are managed independently. It is possible for a client to query the WHOIS system with a request such as “WHOIS google.com” and the system will return the name and address of the company which registered the domain name “google.com”. Each county (top level domain) manages their own WHOIS server, and some redundancy is provided where larger WHOIS servers contain duplicate data of country-specific WHOIS servers. For example, RIPE.NET contains WHOIS information for the whole of Europe, but so too do regional WHOIS servers in France, Germany, Holland etc.

The disadvantage of the WHOIS system and ad-hoc P2P systems in general is that regional differences can imagine, and the US format for WHOIS is different from the European format, and so forth. Also, if one server goes down, WHOIS information for that country may be lost. Furthermore, the client must know which WHOIS server to connect to. One advantage of the system is that it is quick and easy to make changes to data stored on the WHOIS system. This is not the case with pure P2P or unmanaged indexed P2P.

Pure P2P is used where a client knows a node within a distributed storage system, but not necessarily the node that holds the data that it is requesting.

A good example of Pure P2P is the DNS system. The DNS system consists of millions of interlinked DNS servers worldwide. It is possible for a client to query the DNS system with a request such as “resolve google.com” and the system will return the IP address for the “google.com” website. Each ISP manages their own DNS server, and many layers of redundancy are provided, as upstream DNS servers routinely exchange “routing advertisements”. Each DNS server knows the location of at least one other DNS server, and thus a request to any DNS server can be referred up the chain to the DNS server which holds the data the client requested.

The disadvantage of the DNS system and Pure P2P systems in general is that the topology of the system is not optimized to be ‘flat’, and each node in the system may have to query 30 other nodes in order to find an authoritative response to a client request. Furthermore, due to the ‘stingy’ topology of the DNS system and the periodicity of the routing advertisements, a change to a DNS record may take up to 48 hours to propagate through the internet. The main advantage of the pure P2P system is that there is no single point of failure, and even if multiple DNS servers go down, the system can continue to operate correctly.

Indexed P2P is a more recent invention, and is used effectively by file sharing networks such as WinMx, Kazaa and Napster. It is also used for large load-balanced multi-server websites, such as Google. It differs from pure P2P in that there are a single set of index servers which contain an index of the location of all other nodes in the network. Indexed P2P comes in two different forms, managed and unmanaged.

Unmanaged Indexed P2P is where the nodes within the distributed storage network are managed by private individuals. Such a system is used by music sharing networks such as WinMx. As a client connects to the WinMx network, and looks for a file such as “U2-vertigo.mp3”, the index servers return a list of IP addresses of peer servers which hold this file. The client then may download the file from one of the peer servers.

The disadvantage of unmanaged indexed P2P is that the content is not held by a single company, and thus the network is at the mercy of what data that each individual wishes to host. This may lead to users hosting copyright material, pornography, or viruses. Furthermore, since the index servers form the basis of the network, if the index servers fail, then the network is useless. Unmanaged servers run by private individuals typically have low-bandwidth, non-dedicated connections, making the network slower than managed servers.

Managed indexed P2P is where the nodes within the distributed storage network are managed by a single company. Such a system is used by large multi-server networks such as Google. In a managed indexed P2P works in a similar way to unmanaged indexed P2P. As a client makes a request to Google looking for “University of Ulster”, Google then passes this request on to an array of index servers which pass the request on to database servers which contain information on the “University of Ulster”. The database servers (nodes) return results on the pages found, a snippet of text containing the page from within the page referring to the searched text, and the weighting regarding the order in which it should appear in the list.

The disadvantage of managed P2P is that the system is extremely expensive to implement, as it requires several clustered servers. In the case of Google, they use over 10,000 networked desktop-grade servers. The advantage of managed P2P is that the company can control the content contained in its index. Also, since the bandwidth within and between their data-centres is very high, the result of any query can be returned in seconds.

Categories: Uncategorized

Comments (0) Trackbacks (0) Leave a comment Trackback

No comments yet.

No trackbacks yet.

Network Programming in .NET

An Essay on Peer to Peer data storage

Leave a comment Cancel reply

Follow me on Twitter

Archives

Like us on Facebook

Network Programming in .NET

An Essay on Peer to Peer data storage

Share this:

Leave a comment Cancel reply

Follow me on Twitter

Archives

Like us on Facebook