The Monoliths

INTRODUCTION 

The aim of the project is to create a cheap storage solution for DHers, that would be constituted of individual monoliths. These would ensure local data reliability, data redundancy over the network of monoliths, and should be viewable as a single database of information constituting The Great Common of human digitalized cultural content.

We aim to describe a possible system implementing all of the above.

The traditional way to deal with the issue of data storage is to use a NAS (Network attached Storage), comprised of a RAID (redundant array of independent disks) to ensure protection against disk failure. 

However, we are thinking about a decentralized aggregation of the data of all the monoliths instead, thus allowing anyone to access The Great Common of cultural data as it is digitalized, through some kind of database explorer software.

STORAGE SOLUTIONS
At first glance, hardware RAID (redundant array of independent disks) seemed to be a perfect way to solve our problem: it is cheap and fast, and ensures data reliability after a loss of one or several disks. However, there is a downside: RAID is very sensitive to data corruption (let’s recall that 1 bit error occurs around every 1015 bits written in a hard disk), whose probability increases even more in presence of old data (spontaneous bit flips). Indeed, data integrity is jeopardized because RAID is not able to detect corruption in the information, thus propagating it. Therefore, since classical RAID doesn’t perform error checking, its use might not be indicated for our purpose, considering we want our data to be stored for a long time.
This led us to the conclusion that, to take care of this kind of problems (especially for “older” data), we might have to use ZFS (a filesystem, i.e. a software solution) to store our data. In fact, ZFS will allow us to store our data for an important amount of time by embedding data corruption checking (through checksums) and restore possibly lost, or incorrect (corrupted) data, and at the same time emulate RAID capabilities.
Moreover, to guarantee even more reliability, a computer equipped with ECC RAM (Error-correcting Code Random Access Memory) can be preferred. Indeed while reading or writing on the disks, bit flips might happen in the computer memory, due mainly to the natural radiation of the environment, or physical defects in the memory chips. This kind of errors can’t be detected by any means (not even ZFS), and only ECC RAM is capable of preventing such kind of errors.
Over the aforementioned solutions, we could add a level of reliability through redundancy between the monoliths, through a decentralized peer-to-peer (P2P) network. This would serve two purposes: It could protect against a complete destruction of a monolith, where after buying a replacement one, it could download the lost data through the backups stored in other monoliths around the world. It could also, in the case where Internet availability of the data is a wanted feature, remove the need for all the monoliths to be connected at once to the Internet. Users could access the data of a specific monolith through, in reality, all the available backups stored over random monoliths.
METHODOLOGY
To obtain monoliths that are cheap, and “easy” to use, we settled on two possible hardware solutions
  • In the case in which we use the hardware RAID, the idea was to use the Raspberry Pi (around 35$), because it is powerful enough while still being affordable in terms of price.
  • However, the Raspberry Pi is not powerful enough for ZFS. The alternative is in small footprint, low power computer motherboards (mini ITX for example, around 60-80€ with integrated CPU), slightly less affordable, but that would allow us to use ZFS, an therefore, have a more “solid” system. In addition, the headroom in computing power would also allow for the use of heavier software that could be useful on the field for DHers.
This being said, we now have 2 different sets of hardware that can be used, depending on the need of the customer.
Which brings us to the second part of our research: to decide with which solution the Monoliths should be built on. In fact, we need to know what is the real need of the people/researchers/scholars that would be the most likely to use the Monoliths. 
Therefore, at our scale, we believe a survey among the members of the Digital Humanities Lab is necessary. Historians and Scholars will be asked several questions which will allow us to grasp their need in terms of digital storing.
 As being seen above, many solutions are available: some hold a virtually eternal storage solution with a continuous integrity checking and automatic repair, others use simple data redundancy only against disk failures; some hardware is sufficient but not enough powerful sometimes.
Consequently, we must address a survey to the targeted users so that afterwards we can have a clear picture of the expectations. the survey could be held as follows
  • Who needs it?      
  • Why, and  for what purpose?
  • How would the storage and the access to the data be more convenient to you (local network only? internet (but private)? broad access over all the monoliths?)
  • What level of reliability? (Local raid solution, network distributed redundancy?)
  • If there are network capabilities between the monoliths, should they be in a specified work group, or global (in the Great Common idea).
  • How big is the required/wanted storage?
    
After building up the monoliths, the structure has to be tested.
The system will rely on a software distribution that meets at best the specifications defined by the previously done surveys. Therefore, after having done a first estimation of the hardware resources needed (the main limitation being on memory and computing power), the specifically engineered Linux/freeBSD distribution would be tested in resource constrained virtual machines. This would help test the systems, without having to invest in potentially non adapted hardware, for:
  •  The stability of the software distribution and the specifically crafted scripts, programs, and settings.
  •  The computing power needed for the software to run seamlessly.
  • The distributed network capabilities of many monoliths together, on just a single or a few computers.
  • Stability of the systems against power outages (sudden shutting down of the virtual machine), hard disk failures (total or on the data), etc.

Then, after having sorted out all the bugs, and pinned down the computing resources needed, a small number of systems would be built, and also tested, with their network capabilities tested in conjunction with the virtual machines to emulate the network of a great number of monoliths.

MILESTONES
Preliminary Phase (January 2016): 10 hours 
Initiation of the crowdfunding – possibly an EPFL subvention. First contact with the users (DHers) / Submission of the survey. Establishment of the specifications.
Hardware and Software Development PhaseBuild up the monoliths (February – May 2016): 50 hours
Programming. Integration of the software. Possible optimizations. In parallel with virtual machines testing.
Purchasing Phase (May 2016): 10 hours 
Purchase of the needed tools/hardware.
System testing Phase  (May 2016): 15 hours
Test of the software distribution functionality on the monoliths.
Launching : June 2016
Eventually, at the end of this project, the monoliths will be a structure that can be implemented into the daily work of the DHers. Obviously, our deliverable will try to fulfill and cover up all the specifications and expectations gathered through a submitted survey performed in January 2016. As a early estimation of the required budget for the DHers, around 1000€ may be enough: indeed, a 4×4 TB raid specialized hard disks costs 700 – 800 € (the price is legitimate :  (better for storage in general, made to run for long times at once, less energy consumption, etc…); the rest could be representative of the hardware/software purchase.