In this dissertation we present an approach for the implementation of a scalable content management system that is based on the virtualization of the repository. It allows to dynamically scale-out the repository on multiple machines in order to adjust the system to the current load. This is an important precondition when offering a content management system as a service in a cloud to customers of different and changing sizes. As an example we will look at e-mail archiving.
In personal and especially in business life one is daily confronted with a huge amount of information, and E-mails have become a dominant part. The appropriate management of the information is very important. It has to support the typical use cases as well as huge and changing volumes of information, and it has to be cost-efficient. Moreover, in a globalized world the information has to be accessible from anywhere on the world at any time.
In this dissertation we present an approach to manage the information that allows growing and shrinking the system in a wide range. Therefore we first introduce a repository virtualization layer that decouples the applications from the direct usage of the repository. For each request it determines the responsible repository and forwards the request to it. Transparent for the applications and their users it is possible to add new machines to the system or to remove superfluous ones.
First we describe the data an e-mail archive has to deal with. These are first and foremost the e-mails, but there are some other metadata, too. For the implementation of the data model we are using a combination of a search engine and a relational database. The focus lies on the search engine as it is better suited for semi-structured and unstructured information. With the typical fuzzy information need of the user the search engine provides more relevant results than the rather strict and precise databases. When ingesting a new e-mail into the system the full-text is extracted and is added to a full-text index. A cluster of search engines makes the full-text indexes accessible by the users. The database is mostly used for consistency information and access control information. A normalization of the information in the search engine is also investigated.
The repository virtualization layer allows distributing the data over multiple machines. The starting point for this distribution is the partitioning schema. In order to achieve a good performance a good scalability is important to keep the communication between the machines at a minimum. Especially very frequent operations and time-critical interactive operations should be executed locally on one machine and join operations should be avoided. For the distribution of the data to the machines we are using a distributed hash table from the peer-to-peer system OpenChord. By using consistent hashing the identifiers of the documents, which we compute as a hash over their content, are uniquely mapped to the currently responsible machine.
To store the original documents we evaluated different approaches. In the simplest approach a regular file system was chosen. It has the lowest overhead, but also provides the fewest optional features. Thus e.g. to rearrange the data, one has to implement a different mechanism which typically by transmits them over the network. Shared file systems like network file systems and cluster file systems have an advantage here. With an appropriate organization of the file system it is relatively easy and fast to assign even large fractions of the documents to another machine. When the documents are stored in a content management system further functionalities provided by these systems like replication and hierarchical storage management can be exploited. We are also looking at peer-to-peer systems which provide a wide spectrum of different functionalities.
Besides forwarding the operations on individual documents to the responsible machine, the repository virtualization layer also has to rearrange the data in case a machine joins or leaves the system, or in order to balance the load. While moving data from one machine to another this fraction of the data will not be available. In order to keep this time low the movement is done in two phases: In the first phase only the data that is absolutely necessary for the new node to start working is transferred. In the second phase, where the larger fraction of the data is moved, no locking is necessary. This is implemented as an extension to OpenChord that deals with some additional operations and the movement of the large data sets.