Distributed Information System

Christophe Meessen - Fri, Oct 16, 2015

The Distributed Information System (DIS) is a world wide distributed storage for any type of information. Its architecture is inspired by the Distributed Naming System (DNS) which has been proven to be robust, resilient and highly scalable. It extends it by using binary keys instead of string names, storing any type of information with large size limit, and supporting remote information addition, modification and deletion with access control.

DIS Overview

When designing a distributed information system, many fundamental choices regarding references have to be made :

  1. is a reference independent of the information value ?
  2. is an information scattered in multiple location ?
  3. is a reference independent of the information location ?

For the DIS it was clear from the start that references must be independent of the information value. This allows to modify an information without invalidating the reference. It was also decided very early to consider only unscattered information because a system supporting scattered information can be build on top of it.

Ideally, the reference should be independent of the information storage location so that an information can be moved around without invalidating its reference. This independency is usually achieved by mean of an index that maps the reference to the storage address (location) of the information.

For a large scale distributed information system, such index should be able to cope with highly concurrent access, continuous modifications, and access control to restrict the modification of the information address to the owner of that information. And at the same time the index should be simple to create, extend and maintain.

Unfortunately an index with such properties is not trivial to design unless we give up on some requirement. For the DIS the choice was made to give up on the location independence of the references. It is then possible to design a simple, scalable and maintainable index. In addition to these properties the system can benefit from the locality property to optimize index access.

The index architecture is the one of the Distributed Naming System (DNS) which has been proven to be robust, resilient, and highly scalable. In this architecture the index is split in Nodes organized in a tree like structure (see Fig. 1). Each Node is the root of a subtree and has one parent Node. The Root Node has no parent Node and a leaf Node has no branches (sub Nodes). A reference defines a path in this tree like a fully qualified domain name does in the DNS tree.

           ___
          |___| <-------------- Root Node
      ___/ _|_ \___
     |___||___||___| <--------- Nodes
           _|_  _|_ \___
          |___||___||___|
         /  |    |    |  \ <--- Branches
       ... ...  ...  ... ...

    Fig. 1: Tree like organization of the DIS Nodes

This index organization allows to fully delegate the management of Nodes to companies, institutions, or tech savvy users. They add index entries in their Nodes for the information they create and manage. Accessing this index will be faster for their users because it is closer.

DIS generalize this architecture by allowing to store the information directly in the index. This optimization is frequently used in database indexes to save an access indirection. The only constrain is that the stored information must then be static or change slowly, although some specialized Nodes could also generate the information dynamically as with web servers. The stored information can in theory be of any kind like for instance images, text, mp3, URLs, certificates, etc. and the size is limited by the storage capacity of the Node.

DIS Reference

Each Node contains information or references to other Nodes identified by a Local Information Reference (LIR). This LIR is a sequence of bytes that will usually correspond to a natural integer incremented for each new insertion. A LIR is thus a very compact and opaque reference to an information inside a Node.

              _______________
Parent       |      Node     |
Node    <----|* 0         0 *|----> information on this node
SubNode <----|* 1         1 *|----> image
SubNode <----|* 2         2 *|----> text
SubNode <----|* 3         3 *|----> music
             |  :         4 -|
             |  :         5 *|----> DIS certificate
             |  :         :  |
             |  :  Local  :  |
             |  Information  |
             |   Reference   |
             |_______________|
   Node references        Information

 Fig. 2: A Node containing information or Node references

A DIS reference, called a Distributed Information Reference (DIR), is a concatenation of LIR, from the root Node down to the Node containing the referenced information. It is exactly like the DNS where a fully qualified domain name is defined as the concatenation of names defining a path in the domain name tree (e.g. www.ditp.org). The difference is that with a DIR, the LIR in the root Node comes first (left most), and that it is a compact sequence of at most 32 bytes.

A DIR is by definition a binary data, but it also has an ASCII representation as an URI. It has the scheme dis and a path component where each segment is a LIR encoded in a base64 variant. A DIS URI would look like this : “dis:/Az-0c/1iPg/Zl4-E”. 

DIS agent

To locate the information associated to a given reference, one has to traverse the index tree from the root Node down to the Node containing that information. Each LIR of the reference is used to obtain the address of the next Node, until the last Node is reached where the LIR identifies the referenced information.

The root Nodes, and some Nodes below it, can be replicated around the world to spread the load and minimize round trip time for the users. It’s a feasible optimization because the content of the upper Nodes is expected to change only slowly.

Another optimization is to use caching. Information requests are delegated to DIS agents that will perform the index traversal. They can cache the result so that other users can benefit from a shorter response time. If the information is too big to be cached, the agent returns the address of the Node containing the information. The user must then establish a direct connection to the Node to access the information.

A company or institution DIS agent may have credentials to act on behalf of one of the company users to access restricted information. The agent will check that the users requesting to access these information have the appropriate credentials. This allows to benefit from the cache even when fetching information with restricted access.

If the user want to create, modify or delete an information, he request only the Node address to the DIS agent and establish a direct connection to the Node. After presenting the appropriate credentials he will be able to perform its desired actions.