DIASER beta-2 - Technical Manual v 1.0.3 Damian L Brasher - 01/06/2010 Index 1 Introduction 2 Explanation of the overall design 2.1 Design philosophy 2.2 The storage architecture 2.3 Integrated approach 2.4 Limitations 2.5 Why Linux? 3 The package and contents 3.1 Downloading and unpacking 3.2 Main source file 3.3 Configuration files 3.4 Example backup software configuration 3.5 Licence 3.6 Documentation 4 Requirements 4.1 Hardware 4.2 Software 4.3 Skills 5 Primary scripts 5.1 diaser 5.2 tab_$.pl 5.3 hvautoc_$.pl 5.4 fill_diaser.pl 6 Explanation of features 6.1 Geographical distribution 6.2 Security 6.3 SE Linux and AppArmor 6.4 Upgrade and modify 6.5 Filling or loading 6.6 Non distinct binary volumes 6.7 Logging 6.8 Archive retrieval 6.9 Data and node migration 6.10 Reporting and monitoring 6.11 Multiple instances 6.12 Extending operation 6.13 Pruning old volumes 6.14 Time zone compensation and leap years 6.15 Digital volume check-sum or stamp 6.16 Complete removal 7 Configuration 7.1 diaser.conf 7.2 Number of years of expected operation 7.3 First year of operation 7.4 Start time of phases 7.5 Node IP address's 7.6 OpenSSH ports 7.7 Dry run mode 7.8 Lowest maximum bandwidth (LMB) 7.9 Time zone compensation 7.10 Working diaser account name 7.11 Time out 7.12 Home directories 7.13 Fill start time 7.14 Volume directory 7.15 Differential or constant name prefix 7.16 Collect Full volume or not 7.17 Collect Full volume on which day 7.18 Full volume prefix 7.19 More than one configuration file 8 Installation 9 Command Line Options 9.1 --help 9.2 --bandwidth 9.3 --configure 9.4 --extend 9.5 --install 9.6 --list 9.7 --lock 9.8 --logs 9.9 --migrate 9.10 --modify 9.11 --pause 9.12 --recreate 9.13 --remove 9.14 --resume 9.15 --retrieve 9.16 --stats 9.17 --stop 9.18 --upgrade 9.19 --version 10 Operation 10.1 Stop 10.2 Pause 10.3 Resume 10.4 Hard Lock 10.5 Migrate node 11 The Code 11.1 Why Perl? 11.2 Style 11.3 Modules 11.4 Error handling 11.5 Contribute 12 On-line resources 12.1 Website 12.2 SourceForge 12.3 Mailing list 12.4 DIAP/LTASP and early project memory APPENDIX A Tables and calculations B Glossary of terms 1 Introduction DIASER is for long term digital archive storage, it securely... 1) Accumulates 2) Replicates 3) Manages fill diaser flow chart DIASER has been created to solve mid-range and below, long term archiving requirements of the SME, a data vault application. Where tape has been deployed in the past DIASER now offers an alternative solution designed to be more robust and manageable in the long term than simple NAS devices or disk based storage alone. This manual is designed to assist the systems administrator providing; a detailed technical overview of the system and it's components parts, how to plan deployment, installation, storage space calculations, an overview of the code base and other available resources. Cloud based computing has taken off the last few years. DIASER is an ideal application for cloud computing deployment as well as an archiving framework solution. Once implemented the system is invisible to users but allows them to do more. Cloud computing is a popular term, a useful way of communicating a complex collection of technologies. The use of virtual machines in a distributed environment has many advantages. The problem many people foresee with cloud computing is lock-in-in and loss of control of data and increased cost of services. DIASER allows an organisation to build private storage clouds using existing resources as you will see in this technical manual. The result is control over your long term could based storage in terms of administration and resources as soon as the system is deployed and beyond. This means that data can be migrated when you want to without penalties from a 3rd party provider. With security in mind at all times DIASER is based on a carefully designed robust storage architecture called LTASP, Long Term Archive Storage Protocol. This means consistency is ensured now and in the future. The design phase involved four years of careful evaluation and testing. DIASER is open source software using GPL the GPL v3 licence model so users can enjoy the benefits the of open development methodology. Simplicity of design and reuse of code and readily available resources is key to power of this system. A strong design philosophy has been cultivated and adhered to for the benefit of all users. DIASER is written by a systems administrator for systems administrators but potential benefits to an SME, it's IT manager, CEO and committee have been the highest priority throughout all stages of the design process. The DIASER implementation is targeted primarily at education, hence the name Distributed Internet Archive System for Educational Repositories however the system can be downloaded and deployed by any SME. DIASER is designed to be extremely future proof. As an Open Source product minimise the risks associated with vendor lock-in and data retrieval. More features are planned for the future and the most current development road-map can be viewed here: http://diaser.svn.sourceforge.net/viewvc/diaser/ROADMAP_DEV. 2 Explanation of the overall design 2.1 Design philosophy Archiving and backup is art and science. For me a philosophy has evolved over the years I have been a systems administrator and I applied them to the design of LTASP and DIASER: Maximise: Storage capacity, availability of data, data restoration and recovery speed, scalability, modularity, cross-platform deployment, resilience and robustness. Minimise: Operating bandwidth overhead, impact of network outages, management overheads, support costs. Simplify: Development cycle, deployment, data recovery, operation, integration with existing systems. 2.2 The storage architecture To maintain archives over a number of years requires organisation. For this reason DIASER builds a set of slots/directories on each node in advance which correspond to date. This is done in advance and not generated as required for a number of reasons. As the system operates across networks and network connections can have variable performance or be down completely creating a year or more slots (slots roughly equate to a single tape) of storage upon installation ensures that the directories are named and therefore dated correctly. This ensures if data is not copied correctly we can identify failure even without log data. Log data may or may not be created on a node but empty slots are indicative of copy or network failure. Computers are not the best time keepers left to their own devices. If the storage structure creation is undertaken when all nodes are known to be synchronised then accuracy of the storage structure is ensured. If slots were created on the fly and node time was not synchronised for any reason, bios changes, other software changes the time inadvertently and so on, inaccuracies could occur. The structure is human readable too and simply put; empty slots are easier to read and parse than missing slots. Storing old archives in a well defined data storage structure is very important. This means DIASER can be deployed in the past, i.e. 2007 onwards. Then the system can be manually filled with old data like a filling cabinet and default automatic operation simply continues. The system is optimised to store a combination of Full and differential volumes. Fulls created at the beginning of the month and Diffs during the month. However this does not preclude storage of constant volume sizes, i.e. the storage of CCTV video footage but calculations must reflect this kind of storage mode. The recommended data vault operation will make use of certain directory structures in each month; Full01 and Full02. Full01 will store a Full volume at the beginning of the month and skip d1. Full02 is there for additional redundancy and to cope with the scenario where the current month is the last (this is not default behaviour). There are two parts to the architecture, that described above and the data transfer mechanism. The data transfers are initiated by an internal structure called the hyper virtual auto-changer, a virtual concept drawn from the mechanical tape changer. The well used tool rsync is a key component of this mechanism and it's features are utilised fully. DIASER installs onto three Linux nodes for optimal data storage resilience. No parity is used, this means complete data can be stored and retrieved if a single node is isolated from others. DIASER can be managed from any Perl 5.8.8 and network enabled workstation or from a node if preferred. This section can be skipped and here for the very technically minded. Taking a deeper look at the architecture, also see section 6.4 filling or loading. Nodes A and B both contain d0's. This structure allow copy phases to simply and accurately span different days, if data was set to be copied directly from Node A d5 then midnight passed +-1 day will have to be factored depending on the point of reference - node$. Filling of DIASER can then occur well in advance thus keeping the copy phases operationally contained and therefore greater control over operation, implementation and readability. The filling occurs outside the LMB calculations and can be at a much slower rate. This means LMB calculations remain applicable to both phases. d0 also acts as a buffer; original copies exists if an internal copy fails, allows simultaneous copies i.e. A; d0->d5 and B; d0->d0 otherwise the second copy would have to wait and begin safely only after completion. d0 can be tested for a successful fill before phases begin. The concept of node role assists towards an optimised architecture, which differs depending on the node role. To allow the roles to be practically changed and for simplified fail-over implementation the directory structure is identical on each node, whether it is node A, B or C. The difference between roles is subtle but important: Role A: uses d0 contain in each month, designed to be closest to original backup volume source. Utilised in phase1 only. Role B: Utilised in phase1 and phase2. Only accepts data during phases. Role C: Utilised in phase2 only. Only accepts data during phases. 2.3 Integrated approach DIASER makes use of existing resources where possible. This results in streamlined software tightly integrated with the POSIX, Linux computing environment. Using Perl for this task ensures GNU tools are used for tasks instead of re-writing functionality unnecessarily. Use of the the common Linux home directory environment, cron, OpenSSH and rsync. Perl is commonly installed on most Linux operating systems by default and only the core is required on the storage nodes. This allows for very simple installation and management. By using user space the system is contained and a layer away from it's host root environment which has many positive implications not least better security and deployment modularity. DIASER will store backup volumes generated by most backup software products, at least all those that can write volumes to disk lessening operation, integration and installation overheads. Volumes are are defined as resembling a single tape entity. 2.4 Limitations Storage space is limited by bandwidth. At my reference installation site I spent half an hour with the IT manager to decide the relative importance of the organisations data. To this end we managed to select about 30% of all data generated on a regular basis and pipe this into DIASER. This practical approach coupled with compression, data-data de-duplication may be available, means that the organisations critical data is stored using DIASER. Node A is a single point of failure. This is the node in network terms closest to the backup server and if it failed data will cease to transfer. However plans exist to allow node A bypass. Even if node A did prevent data transfer it is expected the systems administrator has the skills and access to resolve any issues. 2.5 Why Linux Linux should not be underestimated for its appropriateness as a storage platform for many reasons. The cost of obtaining Linux is very low and essentially free as in libre and to obtain and use, supported versions can be very good value too. Linux is widely available and has lightweight resource requirements. Licence issues are avoided. Organisations that need the flexibility of deployment with low initial purchase costs can do so when they deploy Linux. Linux is extremely robust under most circumstances, i.e. the ext3 file system under normal circumstances does not require regular de-fragmentation which makes it ideally suited to storage environments. Many of the tools required to enable DIASER are included in standard distributions, even small installations without a GUI or a windowing system. This means DIASER is streamlined, lightweight and does not attempt to needlessly duplicate existing code, i.e. rsync. 3 The package and contents 3.1 Downloading and unpacking DIASER is currently supplied by anonymous download from SourceForge as a diaser-1.0.$.tar.gz, rpm, dist-tarball or deb installation. rpm dependencies will be automatically installed with yum. Makefile as root will allow installation; make, make install. deb package still requires extra dependencies. See INSTALL and section 8 of this manual. 3.2 Main source file diaser - this file unpacks more embedded scripts which are sent to the nodes upon installation, modification and upgrade. 3.3 Configuration files diaser.conf - this is the main configuration file. See section 7 for configuration guidance. A second configuration file can be created manually for development or second deployments. Keep your configuration files in separate directories or rename them. If no configuration file is present then the default values set in diaser will be used, this will not lead to successful deployment. Also see section 7.19 for use of more than one configuration file. 3.4 Example backup software configuration helper_scripts/bacula-dir.conf.extract To fill DIASER with backup volumes created by backup software you need to name volumes in a certain way. This example configuration comes from the Open Source backup software called Bacula. If you use Bacula you can implement volume creation is an identical fashion. If not then use this file as a guide. The scripts generated by the installer residing on node A are called fill_diaser.pl. As the names suggest these collect volumes generated by your backup software, perhaps stored on a share mounted by node A or directly backed up to node A, and fill DIASER with pre-defined named volumes. 3.5 Licence This software is licenced under GPL V3 - gpl.txt and fdl-1.2.txt. The website is licenced under fdl-1-2.txt The manual, DiaserSystem.png and DiaserDocsv1.1.pdf are licenced under Creative Commons Attribution-Share Alike 2.0 UK: England & Wales Licence. 3.6 Documentation Located in directory docs. This includes this technical manual docs/manual.txt .html or .pdf and diagrammatic overview docs/overview.png. Importantly INSTALL contains a quick start guide. More theoretical documentation is available from http://www.diap.org.uk and don't forget to check http://www.diaser.org.uk for up to project date news and other information. A man page is also installed. 4 Requirements 4.1 Hardware Workstation, 1GHz CPU or above, 500MB Ram and network connection. You can also use a node as as the installation platform but you need to ensure all the Perl modules listed below for the workstation are available. 3 x Linux storage nodes (can use VM's) with root access for initial setup. Anything above 1GHz 32bit or 64bit with 500MB Ram. Enough disk space. I'll make all this much simpler to calculate when I have finished subroutine calculate_lmb, see appendix A, tables and calculations. LAN or WAN connection between each server and workstation, the 3 machines must be able to, at least notionally, ping one another. Nodes can be connected across a Virtual Private Network if necessary. 4.2 Software Minimum Perl v5.8.8 enabled (Perl v5.10.0 is recommended for best performance) workstation with Perl modules: Net::SSH::Perl, Net::SFTP, Getopt::Long, AppConfig, Term::ReadKey and Data::Password. Optional for the --bandwidth tool gnuplot v4.2. Install modules i.e. as root ]#yum -y install perl-Net-SSH-Perl or cpan>install Net::SSH::Perl Automatic module installation occurs when installing using the rpm release. Nodes Perl Core (v5.8.8 or above) File::Find (installed as default with most distributions). SSH server on each node, not necessarily port 22. Each node must run services; sshd, crontab, iptables ssh port open, ntpd, rsync (non daemon). 4.3 Skills It is recommended the administrator have at least these skills: Bash command line - ability to move around directories, create files and directories, set permissions and add and remove user accounts. Knowledge of SSH logins, text editor and adding and removing software. Basic knowledge of rsync and the ability to effectively use scp. Use of commands less and cat. Ability to install Perl modules and check versions. Less important are some Perl scripting abilities, Basic bash scripting skills may also help. 5 Primary scripts 5.1 diaser The primary script containing most of the DIASER code. Code embedded within diaser is unpacked and copied to nodes with variables set by the user. For upgrades and configuration changes code is again unpacked and copied over to nodes as required. 5.2 tab_$.pl One for each node and contains the crontab definitions which trigger the internal diaser data copies managed by the scripts hvauto_$.pl. The cron job run every hour i.e. 0 * * * * ~/hvautoc_a.pl and the script reads the local system time, compares the the user set copy phase and if there is a match will initiate data transfer. The script logs to the node, log_$, as does rsync. 5.3 hvautoc_$.pl Each node has a single hvautoc_$.pl script. This script is triggered every hour and depending on the times set by the user variable, HOUR1 and HOUR2 they initiate the rsync data transfers. If the user modifies variables then these updates can be copied to the nodes by replacing the hvautoc_$.pl scripts. 5.4 fill_diaser.pl This script resides only on node A. This is responsible for filling the correct slot with data fed into DIASER by the user. The script is called by cron job set when configuring or modifying DIASER. The script copies the latest created of either Full, Differential or constant volume types to the DIASER directory to either Full01 or d0. Aside from the cron job time there are a number of variables that can be user configured including the volume directory, that is where your backup software stores volumes and the volume prefix, i.e. fullbackup... for Full volumes. Filling is designed to be as simple as possible. Volumes on your file store are assumed to be read/write by user id: $your_diaser_uid. This flow chart provides a detailed overview of the fill process, everything apart from the node A->B copy check has been implemented: fill diaser flow chart fill_diaser.pl automatically clears out the drop off directory ad0 after the contents of which would normally have been transferred to other slots as specified by the architecture. 6 Explanation of features 6.1 Geographical distribution Tapes can be moved from site to site and often are. To emulate this ability distributing data provides geographical redundancy. A simple mirror of a NAS device is one way to achieve this but to spread over three nodes can be difficult to manage. DIASER is a self contained wrapper around the long term archiving across three nodes. We believe the extra resilience provided by storing in three geographical locations give your archives the protection needed for long term planning and data retrieval. Ensuring your archives are safe means a better chance of recovering data when you need it. Being a disk based solution will help render your data more accessible in may scenarios. Planning your installation is important and as the system may run for years spending time before deployment will pay off. DIASER is ready for trail and evaluation. Your chosen storage nodes may also be equipped with RAID. This is highly recommended. 6.2 Security These security precautions have been implemented: The primary script, diaser, does not store any passwords on file. Passwords are stored in memory temporarily while the script runs. When a password is requested the entry view is hidden. New DIASER account passwords are quality checked and a warning given if not secure. Root passwords are only requested when the system is installed and removed. DIASER exists and runs in user space. All network communication is handled by OpenSSH. A unique RSA certificate is generated so the nodes can use password-less logins to transfer data and communication during normal operation. Password-less login certificates can be regenerated using the modify switch --upgrade. A kind of emergency account lock can be initiated with the switch --lock. The perl module Net::SSH::Perl and Net::SFTP are used for all SSH communications and file transfers initiated by the system. Rsync uses SSH to transfer data. It is possible to use different port to the standard SSH port 22 and individually set these for each node. An sha256sum checksum and a date stamp file is created a every volume enters DIASER in a format similar to: 4865c5bdf3cf64709acd797688db5b337e7c8643 2009/mth7/Full01/fullbackup7 Tue Jul 21 07:10:28 BST 2009 For extra security DIASER can run within a Virtual Private Network. It is recommended encrypted partitions are used for DIASER, i.e. when deploying an external USB hard drive. /dev/sdb can be an externally attached USB2 hard disk drive i.e. replace with the disk chosen on your system. # Create a new partition on the disk fdisk /dev/sdb # Generate a mapping and LUKS partition cryptsetup --verbose --verify-passphrase luksFormat /dev/sdb1 cryptsetup luksOpen /dev/sdb1 sdb1 # Format the partition mkfs.ext3 -j /dev/mapper/sdb1 # Mount the partition for the first time mount /dev/mapper/sdb1 /mnt/crypt/ df -h # Open and mount the device after reboot or disk removal cryptsetup luksOpen /dev/sdb1 sdb1 mount /dev/mapper/sdb1 /mnt/crypt/ # Umount and close umount /mnt/crypt/ cryptsetup luksClose sdb1 6.3 SE Linux and AppArmor No problems observed during either installation or operation. 6.4 Upgrade and modify Currently modify switch, see below, is still under review. For now the upgrade switch sends modifications and upgrades to the nodes. This does not and will not modify the archive storage directory structure. Changes to settings and development improvements can be sent using this option. If you use newer version than your previous then follow these steps: 1) rename your current diaser_rel 2) unpack the download, see section 3.1 3) copy your previous diaser.conf to the new diaser_rel 4) run ]$diaser --upgrade to update your DIASER installation 6.5 Filling or loading See section 5.4. The initial entry point for data, d0 (node A, directory 0), resides in each monthly segment and not a single d0 in the root directory. This lessens the risk of deleting or overwriting archive data that may not, for whatever reason, have been successfully transferred to the other nodes. If connection to node B fails there will be at least two copies of the file in d0 and d30 or whatever the last day of the month happens to be, before another Full is generated and the next months d0 is cleared and filled. This adds more resilience at little extra cost. Also, if copies are only set to occur once a month and the copy failed as before and this was not noticed until after the next copy last months data will have been deleted and only a single copy stored. 6.6 Non distinct binary volumes The volumes which have been described are binary files, like those created by Bacula. Other backup software generate directories which need to some processing before they can be collected by DIASER. There are a number of problems to avoid to ensure DIASER operates non-destructively, so instead of manipulating the directories in your data store I suggest you use a script to create tar volumes of the archives you want to be collected. Here is a psudo code suggestion of how this might be achieved. # non distinct binary volume alternative collection # run as a cron job independently of DIASER sub non_full_binary { look for directories, if directories ls if($directories) { check for a previous tar Full -> if no Full this month then tar/shasum/date any directories collected for Full -> Full01 slot and name with the chosen Full volumes prefix. check for a previous tar Diff -> if Full this month then create a tar/shasum/date Diff against it for the day slot name with the chosen Diff volumes prefix. } 6.7 Logging Log files are kept on all nodes and named log_$ where $ is the node; a, b or c. The scripts hvautoc_$.pl, fill_diaser.pl and all rsync transfers log to these files. The log files are created automatically as soon as the system begins operation. All entries are contain [diaser_hvautoc_$] or [diaser_fill] where $ is the node; a, b or c. 6.8 Archive retrieval Either use the simple tool provided using the --retrieve option, which also has additional command line options or login to nodes directly and use scp. The retrieval tool will walk you through a set of questions then list files for you to pick and transfer. The file will retain it's name and be located in the diaser_rel directory. If using cp, scp, rsync or other native tools. The directory structure is human readable and matching the required date to directories can be easily achieved i.e on node $ the archives stored on date June 25th 2009 can be found in ../diaser/2009/mth6/d26. Navigate to the directory and copy the contents to the required recovery destination. It is assumed you have the tools to extract your data provided by your backup software vendor. It is recommended you also archive any backup catalogues or tools generated and provided with your usual backup software. 6.9 Data and node migration Node migration can be achieved using the --migrate tool. 6.10 Reporting and monitoring Bandwidth throughput calculations can been made using the --bandwidth tool. See section 9.3 for more details. This is an example screenshot of the ouput: An important set of features the standard set of which will be implemented for the beta-1 release. 6.11 Multiple instances Share disk space with other organisations or groups by using a different account name and staggering or alternating the transfer times (phases) or lowering the LMB - lowest maximum bandwidth between nodes. See diaser.conf. diaser will allow the use of more than one configuration file. See section 7.19. Also if more than one pair of phases is required, i.e. a morning session and an night session than two instances on the same nodes will archive at alternative phase times. If one instance contains FULL volumes then the second does not necessarily need to archive these as well thus saving disk space. 6.12 Extending operation Operation can be extended. Minimum recommended is two years. You can set DIASER to install to 10 or even 20 years, which means 10-20 years of archive directory structure will be created. Deployment can represent the past if required then manually filled with previously generated archive data. 6.13 Pruning old volumes Not yet implemented. This will allow the user to remove old archives from DIASER freeing up disk space. 6.14 Time zone compensation and leap years Time zone compensation allows all the nodes to work together across time zones. The user is asked for the time zone in UTC+(integer). UTC +/- integer value for node A, B and C; if node A is BST = UTC+1, so use 0 as daylight saving is usually automatic on most systems. For three servers in the same time zone use the same offset integer value for each node. The scripts hvautoc_$.pl all contain an algorithm that will ensure proper interpretation of leap year occurrences. 6.15 Digital volume check-sum or stamp Generating a unique check-sum or stamp and date stamp as a volume enters DIASER to be stored along side the volume. 6.16 Complete removal This will completely remove all DIASER components and all archive data stored within the system. Data recovery is not possible after this operation has been performed. 7 Configuration 7.1 diaser.conf This supplied configuration can be adjusted to suit your deployment requirements. Each parameter is in uppercase the name of which must not change. Change the values to the right of each parameter with a space in between. The default values are there to guide you for your choice. i.e. NODE_A 0.0.0.0 can be interpreted as NODE_A 192.168.2.1. Use the same case and value type for your chosen values as the defaults. 7.2 Number of years of expected operation NUM_YEARS Minimum recommended 2 the default is 3. 7.3 First year of operation START_YEAR This is the year when DIASER begins operation. Would usually be the current year. 7.4 Start time of phases HOUR1 HOUR2 DIASER operates in two phases. Phase one identified by HOUR1 and phase two identified by the variable HOUR2. The operation is split into two phases, these can be at any time over a 24 hour period. It is assumed that the start time is based on your local timezone, i.e. BST or UTC+1. It is recommended to set the phases to early in the morning to avoid using day time bandwidth resources. Once set the operation can be reset by sending a new configuration from diaser. The operation is fixed for at the same time every day once set. Using two phases optimises the use of resources when transferring internally on a node and between nodes and prevents simultaneous transfers from interfering with each other as well as simplifying the management and tracking of transfers. 7.5 Node IP address's NODE_A NODE_C NODE_B 7.6 OpenSSH ports PORT_A PORT_B PORT_C Change from the default port 22. 7.7 Dry run mode DRY_RUN Copies are initiated but no archive data is transferred. This can be used for testing, debugging and trails. Can be toggled at any time and the new setting transferred as for all settings in this section. 7.8 Lowest maximum bandwidth (LMB) LOW_MAX_BW BANDWIDTH control, please enter the Maximum speed in KBPS of your slowest network connection between either A->B or B->C or C->B. I recommend you run some test transfers between nodes using scp, also don't assume the bandwidth will remain constant throughout the cycle so you may need to run some long term viability tests. This feature will be implemented automatically with the subroutine calculate_lmb(). Adjust if you install more than one diaser instance on a single disk or machine. Default is 12500 KBytes per second / 100 Mbits per second 7.9 Time zone compensation For deployments that span different time zones. UTC +/- integer value for node A, B and C; if node A is BST = UTC+1, so use 1. TZONE_A TZONE_B TZONE_C 7.10 Working diaser account name USER_ACC Choose a name for your DIASER user accounts. The same name will be used on all three nodes. Limit this to between 5-10 lower case characters for simplicity. I use diasertest for example. 7.11 Time out TOUT The copy timeout used by rsync for transfers. Set lower than phase periods. 7.12 Home directories DIR_A DIR_B DIR_C Home directory of diaser account, you may need to adjust if a large partition is not in the usual home directory place i.e. /mnt/big/ will evaluate as /mnt/big/diaser. 7.13 FILL_START_TIME Time to initiate the daily filling script this should be set in advance of the DIASER archive transfer phases to ensure DIASER is filled before the phases begin. 7.14 VOLUME_DIR Location of volume storage directory is where you store backup volumes created by your backup software. 7.15 DIFF_CONST_PREFIX Differential or constant volume name prefix. 7.16 COLLECT_FULL Choose whether full volumes are collected or not you want to simply collect constant sized volumes, like CCTV footage. 7.17 COLLECT_FULL_DAY Day of moth when full volumes are collected. 7.18 FULL_PREFIX Full volume name prefix 7.19 More than one configuration file It is possible to force diaser to read a particular configuration file by executing ]$diaser diaser.conf --opts The configuration file can named as the user chooses i.e. ]$diaser my.config --opts Currently, changes will always be written to diaser.conf from the directory diaser was executed in. The user is free to change the name of the configuration file and read it into diaser as described above. This feature is particularly useful when there us more than one installation being managed from a single user account. 8 Installation ]$./diaser --install Use after you have configured diaser.conf as a normal user. As each task is completed you will be informed. At the end of installation you will need to one time only - you will need to login from the diaser account on each node to accept the certificates between nodes, like the 1st time you SSH into a box. A->B, A->C, B->A, B->C, C->A and C-B. Afterwards logins between nodes are password-less, this step will allow DIASER to begin work. This step may be automated depending on user feedback. 9 Command Line Options Please note, not all of these operations have been implemented. Please view the most current development road-map: http://diaser.svn.sourceforge.net/viewvc/diaser/ROADMAP_DEV. As such some of these items may change or be removed altogether or others added. Later in the development cycle I plan to extend command line options so configuration changes can be set using the diaser command. Run all commands from a prompt as a normal user, i.e. ]$diaser --install 9.1 --help Display menu and command line options. DIASER Usage: diaser_setup.pl --help help|-? --bandwidth calculate real bandwidth throughput between nodeX-Y --configure question driven configuration tool --extend extend maximum storage structure date --install install --list list all volumes in storage --lock lock all DIASER node accounts --logs condensed log readings from nodes --migrate migrate node --modify [opts] send modified configuration to nodes either from conf file or command options or both --pause pause operation --recreate recreate a single node from scratch --remove remove from nodes, all data will be lost --resume resume operation --retrieve [opts] retrieve archive data --stats generate statistics --stop stop operation --upgrade apply upgrades --version show version For more information please use man diaser or the more detailed online manual: http://diaser.org.uk/manual.html Please send any FEEDBACK to dbrasher@interlinux.co.uk. I'm especially interested in how DIASER may be of use to you now or in the future. Thank you. 9.2 --bandwidth This option will allow you to view the real, not theoretical, data throughput between two of your chosen storage nodes. You will need to have the OpenSource tool, gnuplot, installed on the system from which you are running this application. This tool will attempt to download and compile the binary NPtcp from the NetPIPE utility suite: http://bitspjoule.org/netpipe/. The tool operates over port 5002 and stats will be collected from the sender. 9.3 --configure Question driven configure tool for new and existing diaser deployments with input validation. 9.4 --extend Extend maximum storage structure beyond the currently installed year. 9.5 --install Install DIASER. See the section 8 Installation above. 9.6 --list This option lists all volumes stored in DIASER. 9.7 --lock Lock all DIASER node accounts. The systems administrator will need to reset the passwords for each diaser user account manually. 9.8 --logs Condensed log readings from nodes. 9.9 --migrate Migrate node to a different server. 9.10 --modify Apply modified settings to the running DIASER on your designated nodes. Any changed settings will also be written to diaser.conf. 9.11 --pause Pause any currently running data transfers on all nodes. Sends kill -STOP. 9.12 --recreate In case you need to rebuild a node. You should only need to rebuild a node in the event of a disk failure or other non-recoverable node loss. In all other cases please consider using the --migrate (node) option. --numyear years of operation required --startyear year to begin storing archives, this can be in the past --phase1 hour between 0 and 23 --phase2 hour between 0 and 23 --nodea ip address in format 0.0.0.0 --nodeb ip address in format 0.0.0.0 --nodec ip address in format 0.0.0.0 --dryrun boolean 1(y) or 0(n) --lmb lowest maximum bandwidth, KBytes per second --tzone [not yet implemented] --tout copy time out in seconds --fillstarttime time to run DIASER fill operation, hour between 0 and 23 --volumedir the directory where your backup volumes reside --diffconstprefix prefix given to your Differential or constant volumes --collectfull are Full volumes to be collected or not, boolean 1(y) or 0(n) --fullprefix prefix given to your Full volumes 9.13 --remove Completely remove DIASER from your previously designated nodes. Please use with caution as all archive data stored in DIASER will be permanently deleted. 9.14 --resume Resume paused data transfers. Sends kill -CONT. 9.15 --retrieve Fetch archived data volumes. A simple tool provided which also has additional command line options. The retrieval tool will walk you through a set of questions then list files for you to pick and transfer. The file will retain it's name and be located in the diaser_rel directory. --r_year which year --r_month which month --r_day which day --r_full if not a day name a full directory - leave as default --nodea ip address in format 0.0.0.0 --nodeb ip address in format 0.0.0.0 --nodec ip address in format 0.0.0.0 --porta int --portb int --portc int --user_acc user account name, usually default set previously 9.16 --stats Displays for each node in GiB; disk space, total daily volumes, total full volumes and total data stored on each node and average differential volume size. 9.17 --stop Discontinue data transfers. Sends kill -9. 9.18 --ugrade Apply product upgrades to an existing nodes with a DIASER installation. Your DIASER account password will be requested. 9.19 --version Show current DIASER and currently installed Perl version. 10 Operation 10.1 Stop This option will stop DIASER copies currently in operation, until the next set of transfer operations are initiated. This will kill any rsync processes. 10.2 Pause This option will pause DIASER copies currently in operation, until the resume options is used. 10.3 Resume This option will resume DIASER copies currently in operation. 10.4 Hard Lock Lock all DIASER node accounts. This is a security feature. Enables the operator with root access to lock all DIASER node accounts immediately. Only by logging in to the nodes as root and re-enabling the DIASER account password will access from node to node and hence operation resume. 10.5 Migrate node Migrate will assist you in moving an existing node from the current machine, server or workstation, to a new one. This may be located anywhere as long as it satisfies the requirements for DIASER inter-node-visibility. The procedure may take anywhere from minutes to hours depending on the amount of data stored on the existing node and network bandwidth available. 11 The Code 11.1 Why Perl? The language is very well suited to the Linux POSIX environments. It is well supported, has good network programming capabilities. Perls is very flexible and allows a simple yet robust coding environment. Cross platform properties are extremely valuable and ensures the code base is portable. Perls inherent text parsing abilities are also valuable and set the language apart from many other contenders. 11.2 Style Style is based as much as possible on the excellent O'Reilley Perl Best Practises by Damian Conway. A modular approach is used to code DIASER. All subroutines take parameters derived from the configuration mechanisms. Only three global variables are used, the rest are passed directly to subroutines and returns read back. 11.3 Modules Popular modules are used where possible. Only modules that are shipped with popular Linux distributions. The installer use a number of modules, the code deployed on nodes only use File::Find (shipped as default with most distributions) and the core Perl shipped as default by most Linux distributions. 11.4 Error handling Under review. 11.5 Contribute Please see http://www.diaser.org.uk/contribute.html. All contributions are received under MIT/X licence terms. 12 Online resources 12.1 Website http://www.diaser.org.uk 12.2 SourceForge http://sourceforge.net/projects/diaser 12.3 Mailing list https://lists.sourceforge.net/lists/listinfo/diaser-devel 12.4 DIAP/LTASP and early project memory http://www.diap.org.uk APPENDIX A Tables and calculations Bandwidth and capacity lookup table =================================== BW Hours GB (Decimal) Mbit/s 1 2 3 4 5 6 1 0.45 0.9 1.35 1.8 2.25 2.7 10 4.5 9 13.5 18 22.5 27 100 45 90 135 180 225 270 1000 450 900 1350 1800 2250 2700 Disk space lookup table ======================= BW Month 1xYr 2xYr Mbit/s 1 20GiB 240GiB 480GiB 10 67GiB 804GiB 9.6TiB 100 542GiB 6.5TiB 78TiB 1000 5.2TiB 62.4TiB 748.8TiB For more calculations information please use the --bandwidth tool. Include more calculation examples. B Glossary of terms Under review Index