Research Data Management: Hydra at the University of Hull

 Visiting University of Hull Library

The University of Hull has refurbished its library.  This is a graphical representation of it.

The University of Hull has refurbished its library. This is a graphical representation of it.

The University of Hull has refurbished its library at a cost of £28m and very nice it is too.

In the library, there is a lot of space, light and high ceilings.  General computing includes iiyama monitors who have a reputation (with geeks) as being a little less off-the-shelf than, for example, DELL.  Maybe the PCs are not not an afterthought to the service either.  I did not see it but there is an art gallery too.

The library has chairs that you might call privacy booths with high sides and a high back that you can sit at with a book or a tablet because of the convenient power sockets.  Eduroam wireless networking exists throughout.  There are student group meeting rooms that are not locked.  There are boardroom style meeting rooms with a mixture of Crestron based AV and old fashioned green leather clad configurable board tables.  The mixture works well, very “library”.

We, Alan Brine and I, were treated to lunch, which was nice and unexpected.  We spent the day with Chris Awre, Head of Information Management, and Richard Green, consultant: Hydra in Hull and related projects.

Hydra and the reason for the visit

Hydra About Hydra Introduction to Hydra

We are looking at developing a service around the curation of digital research data.  We already have DORA, which is our open repository for research output: publications and doctoral theses, which is just over 100GB.  We are used to DSpace and could create a open access data repository using it or even partition DORA somehow and store research data in that but data has the distinction that it can be re-used, re-purposed and, better, parts of different collections could be mashed by anyone who discovers the data.  This is what Hydra may deliver.

Before the visit we, in a rush, passed on these questions:

  • People:
    How many staff are involved in running the service?
    What do they do?
    What is the work flow leading up to creating a new collection?
    Has anyone received external training? In what?
  • Hardware:
    What is size of the archive so far?
    What is the expected growth?
    What are the specifics of the hardware solution?
  • Technical:
    What are the steps related to the software/creating new functionality?
    Are there any yearly costs for support or licensing? PURLs etc.

And while the conversation was freer than the structure above I hope to capture some of the answers here.

The Conversation

The first point I took the time to note was about the company Data Curation Experts who are important in two respects.  They are very much aware of Hydra but also could kick start any effort because of their knowledge around structured and linked data.  It is all very well to be a good custodian of a data repository but the value is in the factoring of data and its re-use.  That is something we need to implement and where we might have a gap in our knowledge and experience.  We need to understand and replicate as many patterns as possible so that others at the university and beyond can dip in to our repository with out too much of a learning curve.  Where possible we might make the effort to make the data machine-readable.

While DMU compared EPrints and DSpace and chose DSpace because of its worldwide community, Hull chose Hydra for similar reasons with the added essential that they always wanted to store data collections.  They started looking at Hydra in August 2011 with a developer working on it full time until 2013.  The developer learnt the programming language Ruby from scratch.  Today, Hull library has access to a developer in their equivalent of our ITMS.  We talked about training and discussed the active community.  In terms of training, Data Curation Experts can help but there is also Hydra Europe and Hydra Camp.  We talked about who has been involved with Hydra at Hull and how it works today.  Chris and Richard have been doing research curation at Hull for ten years plus.  The library has a team called Researcher Services which is partly comprised of their cataloguers.

The conversation moved here and there, some of it was about how Hull are doing library services.  I am a little envious at their facilities and use of Blacklight as the catalogue search frontend to their library management system.  Envious too, because they are being trusted to use, even embrace, open source software to provide university wide services.  Their previous exam paper collection is stored using Hydra as are undergraduate and graduate theses.  We do not currently store undergraduate theses but it seems like a bold initiative.  There are 11,000 objects in their repository and we are a smidge behind at 10,500.

We talked about workflow and the woes of the librarians herding academics; some are keen, some do not know about the service and some have the default position, often right, that their work can not go in to a open repository.  We talked about advocacy and support in governance from our universities.  At Hull, Hydra is seen by the university as infrastructure rather than an end point for dumping research data.  We talked about tools:

  • The Avalon Media System A next generation Hydra Head for Audio and Video delivery
  • Capistrano  A remote server automation and deployment tool written in Ruby
  • Archivematica is a web- and standards-based, open-source application which allows your institution to preserve long-term access to trustworthy, authentic and reliable digital content
  • Sufia is a Rails engine for creating a self-deposit institutional repository
  • Worthwhile is a simple institutional repository for Hydra
  • Hydra in a Box The Digital Public Library of America (DPLA), Stanford University and DuraSpace are partnering to extend the existing Hydra project codebase and its vibrant and growing community to build, bundle, and promote a feature-rich, robust, flexible digital repository that is easy to install, configure, and maintain.

The most important function of the tools above is exampled by Archivematica.  The tool deconstructs datasets, for example, Zip files and and prepares data for long term access.  However, the function is fraught with problems and difficult design decisions.  If we take the example of a Microsoft Excel spreadsheet.  The tools to access that spreadsheet may not always be available because of the proprietary nature of the product but even today we can not convert from Excel to an open format like the OpenOffice spreadsheet format because the tools that exist today can lose information in the Excel spreadsheet when converting to something more reliable.  There is a need for strong governance around what we store and to what formats we convert non-open data formats.  While talking about Archivematica, Richard showed us a workflow that he is looking at:

A possible workflow for Research Data Management, Richard Green. Creative Commons : CC-BY-NC-SA

A possible workflow for Research Data Management, Copyright Richard Green. Creative Commons : CC-BY-NC-SA

Long term preservation of research data is implemented through the use of Archivematica.  Other data that has been through the system before the introduction of the tool can be sent through again.  Data that is now seen as long-term rather than medium-term can also go through the tool.  The view on what format should be used for a long term preservation may change and that is catered for as collections may be re-ingested.

We are starting from scratch, we might like to see Archivematica or similar used for every collection.  This way researchers will access data collections in a similar way every time.


We could take an easier route than Hydra.  We could provide a repository where we make open access datasets discoverable and downloadable.  This approach would tick a box.  We would, however, have no idea what the data is being used for or how many times it has been used.  We would not know which aspects of the data are being used and what, therefore, is useful or has value.  The use of the data is separated from our systems and we can no longer track it.

We can do better than that for the research community.  Other organisations are looking for insight and value from data over and above its original purpose.  We can provide an infrastructure as a service IaaS based around Hydra or similar.  If data is organised in a way that researchers can access facets of it then we will have a better idea of what is being used and how often.  If we are talking about open access to research data as infrastructure then we must be talking about open source software.  If we are to provide an IaaS then anyone using it must be able to use it in a way that is easily reproducible and at as little cost as possible without any lock-in.  Good open source software and the communities around it supports this behaviour by default.  Proprietary software is based on the model of selling and reselling.  The business model includes practices such as built-in obsolesence.  This practice and others like it do not fit with open access to long term curation of data.

Another reason to go the IaaS route is the concern around privacy, intellectual property and anonymity.  This may answer the cloud question too.  There will be data that needs to have restricted access and with a need to be archived but sometimes it will be stored with other interesting data without the same restriction.  With data sets broken up and stored in facets we can control who has access to what.  We can control access with a straight forward repository too, but data that could be released would be hidden as part of a restricted blob of data.

If we implement the storage of research data (even data from other systems) as IaaS then we can create good practices and governance at the university that researchers and others can use.  We can advocate the use of open formats in order to avoid the problem explained above.  Implementing a RDM IaaS will provide the opportunity to reduce the number of systems we need to understand when working with researchers from induction, through maintaining their work and preservation of it.

I have written about this service being IaaS instead of an end point dump for data,  I have talked about the benefits of open source and open standards in general but what of Hydra?  Is it the right tool?  Back in 2010, the risk would be considerable.  Ruby on Rails at the time, while exciting, was considered limited.  The thinking was, that at some point the developer would reach the wall of what is possible and would have to start writing code at a lower entry point in the Ruby stack.  Hydra has been in use for at least five years.  The community is thriving, Ruby and Rails are thriving and Hydra as well as being developed as infrastructure is also being developed as a turnkey solution, for example Hydra in a Box.  Now, would certainly be a better time to enter the community than five years ago.


For many years, oblivious to Hydra, I have wanted to work with Fedora Commons.  Hydra uses Fedora Commons.  Fedora Commons allows for the storage of digital collections using Triples to describe relationships.  Triples are used in “Big Data”.  While some of the hoopla around Big Data has died down an understanding about the need to reliably, with consistency, store data and data about data has come about.  Organisations are re-tooling to include Big Data tools in their infrastructure.  In the same period that Hydra has been maturing, I have learnt that DSpace would run on top of Fedora Commons but missed the opportunity to use it.  Since then I have become aware of Hadoop which promises horizontal scaling and secure storage of data with the added benefit of compute at the node the data sits on.  In looking at Hydra, I have come up with the idea that Hadoop could underpin Fedora Commons which in turn is a key component of Hydra.  These three layers are infrastructure most of the time.  They each expose web services which could be leveraged by programmers and existing tools.  We could sit DSpace on Fedora Commons and then have Hydra sit next to it or replace it.  Tools could sit on any layer to support research or the business functions of the university.  The framework would be very flexible.  DMU is looking at storage solutions at the moment and providers are telling us that they have Hadoop offerings.  Now could be the time.  Hadoop is ten years old and growing rapidly.  If we had Hadoop we might be able use it for teaching too.  The world, it seems, needs data curation and data scientists.

Posted in DORA, DSpace, DSpace, ITMS, Library, Linux SysAdmin, Uncategorized | Tagged , , , , , , , | Leave a comment

WordPress xmlrpc attack work around

Block the naughty IPs using the htaccess file and this code:

$ tail -10000 access_log |grep /xmlrpc.php|awk '{ips[$1]++}END{for (i in ips) print i " " ips[i]}'
68.x.x.52 1
173.x.x.17 1
185.x.x.249 15
117.x.x.46 1
192.x.x.80 1
80.x.x.104 1105
180.x.x.59 1
198.x.x.90 3
192.x.x.130 3
93.x.x.61 1003
192.x.x.244 1
91.x..69 1
85.x.x.26 16
198.x.x.192 1
192.x.x.146 1
192.x.x.250 10
80.x.x.229 1092
190.x.x.155 1

Check the http and https logs.  The offenders are obvious.  Add them to the .htaccess blacklist.

This is a stop gap while we investigate a plugin like Disable XML-RPC Pingback.

Posted in CELT, Linux SysAdmin, The Commons | Tagged , , , | Leave a comment

Migration of Off the Air Recordings

Current system

The current system consists of eleven servers.  Seven of these are in Gateway House and four are in Kimberlin.  The current solution has two lots of storage, one at each site.  At the time when the system was created, the local network was not reliable enough to assume copying across it would work.  We created two stores, one in each building.  This allowed us to make sure we had a copy of a programme and that a copy would survive a disaster in either building.  Having two copies enabled us to split the load for streaming video, we have a streaming server running from each block of storage.

There are five machines used for the ingest of TV programmes.  The TV Control is responsible for the electronic programme guide, telling servers to record a programme and for starting the copy of the MPEG-TS programme to the storage servers.

The TV Control tells a TV node to record a programme.  The programme is copied to the local storage, the local copy is copied across to the remote store, both copies are checked before the copy on the TV node is deleted.  Finally, the TV Control updates the library website to say a programme is ready for transcoding.

The transcoders check for programmes to transcode, then run scripts which repair errors in the MPEG-TS video, converts to MPEG-PS and then transcodes the programme to two videos, one suitable for play out in lecture theatres/desktops and one suitable for desktops/mobile devices.

The initial proof of concept started recording programmes before 2008.  There are now 4,880 recordings over 24TB using this solution.  This storage encompasses the original recording as well as the resulting transcoded files.

Problems in the current system include running out of disk space occasionally, having to re-tune when channels in Freeview move and not being able to tie the EPG internal to the TV Control to the library website.

Good bits of the current system include being written to suit internal work flows, great quality play out to lecture theatres and being able to record East Midlands programmes.

Box of Broadcasts

Box of Broadcast has these extra features (we could have been a contender) as of January 2014:

  • the addition of all BBC TV and radio content dating from 2007 (800,000+ programmes)
  • over 10 foreign language channels, including French, German and Italian
  • an extended 30 day recording buffer – more time to record missed programmes
  • a new look website, improved navigation
  • Apple iOS compatibility – watch BoB on handheld devices
  • searchable transcripts
  • links to social media – share what you’re watching online
  • a one-click citation reference, allowing you to cite programmes in your work

All good stuff.

Migrating the university’s current archive and capture system

In some form the current service needs to be migrated to ITMS’s new infrastructure.  There are too many physical parts to it which need maintenance contracts to support them and they exist outside of what ITMS wants for its infrastructure.  The service has been recording since before 2008.  There are over 4,880 recordings with over 24TB over data.  Two thirds of this data are the original recordings.  They are useful to keep because of the shifting standards used in web browsers to display video.  We will soon need to look at converting video to h.265 and/or WebM/VP9.

Take a look a the diagram:

TV P2V and Project

Diagram showing the current solution and possible split into archive and project

Bob of Broadcasts ticks most of the boxes but does not support the need to record East Midlands programmes and might not fit other of the library’s needs.  The original system can be turned in to an archive or it can be turned in to an archive that can still accept recordings made locally.  It could be that we separate the two systems.  We could keep the archive and have a separate system that satisfies the need to record local TV.

In the migration of most of the service to the new infrastructure several parts and functions need to be re-factored.  The current service reflected the need for an offsite backup and the nature of the network at the time, while the new infrastructure takes care of this for us.  I would like to fixed the smaller video streams so that it plays on modern mobile devices.  The original was created to work on iPad first.  At some point iOS was updated and playback became difficult or impossible.  We could save a lot of enterprise (think money) grade disk space if we can conveniently store the original recordings on tape.  The migration on the face of it should be straight forward.

If we implement the project, that is the TV Recordings Update project, we might come up against a lot of unknowns because the service has been working with out an update for six years.  I am using the software outside of work and have satellite and terrestrial HD versions of the tuner card that accepts four inputs.  While the new version of the TV recording software will bring new features including an electronic programme guide API we don’t know what, in the other tools, will be broken by bringing the service up to date.

To recap then, we can :

  • migrate the archive only.  This should be straight forward
  • migrate the archive and have a separate, provisioned at the desktop, solution for East Midlands recordings
  • migrate the archive and attached an update of the recording system to the archive.
Posted in Uncategorized | Leave a comment

New cloud and dev, the new new.

Looking at Docker

While having a quick look at Docker I happened across a slide show presenting Docker starting with its rapid take up.

A list:

  • Jenkins.  An extendable open source continuous integration server
  • Travis. (From Wikipedia) In software development, Travis CI is a hosted, distributed[2] continuous integration service used to build and test projects hosted at GitHub.
  • Chef. Chef models IT infrastructure and application delivery as code, giving you the power and flexibility to achieve awesomeness.
  • Puppet. Puppet Open Source is a flexible, customizable framework available under the Apache 2.0 license designed to help system administrators automate the many repetitive tasks they regularly perform.
  • Vagrant. Create and configure lightweight, reproducible, and portable development environments.
  • OpenStack. Open source software for building private and public clouds.

But what is this all about?  I’m thinking out loud about about transitioning from classic LAMP in a box applications to elastic applications built admin interface first with functions as web services for responsive apps using the likes of Node.js and Create.js

Posted in ITMS, Virtual Machines | Tagged , | 2 Comments

Updating WordPress to 3.7.1 and then some of its 80+ plugins

Updating WordPress to 3.7.1 and then some of its 80+ plugins

I need to update The Commons.  We have been at 3.5.2 for far too long.  With our CELT team, we have decided to update to the security update 3.7.1 because it is a security update but no further.  This may have the side effect of breaking some plugins.  We will see what we can live with and what fixes, replacements and compromises we have to make on long the way.

Upgrade to 3.7.1

Ho hum, as we are not go up to 3.8.1 which would be as simple as clicking on update I have to manually update using the distributed code.  I followed the instructions.  Before embarking on this journey I had a look for something that would tell me, for our WMPU install, which plugins are activated network wide and which one are activated on individual sites within the network.  To do this I used ‘WPMU Plugin Stats‘.  I printed this to paper and to PDF so I can tick things off and make notes.

Before doing the update it is important to deactivate all plugins and to run wp-admin/update.php and update the network before enabling them again according to the record I have made.

Here goes, the re-activate…

  • External Group Blogs : bp-groups-externalblogs.php on line 308, bad prepare statement

Updating the plugins…went much better than usual.  This gave me the time to look at our missing LDAP Options page.  This was fixed by following the instructions for WPMU Ldap Authentication.  And to tidy up some tables that were not created when we were having server problems.  To fix these I looked for errors in the error logs complaining about not being able to write to tables.  These errors would have the affected blog, a number, as a substring e.g. wp_133_visitor_maps_st.  This script:


mysql -uroot -p ourblog <<HERE
CREATE TABLE \`wp_$1_visitor_maps_wo\` (
  \`session_id\` varchar(128) NOT NULL DEFAULT '',
  \`ip_address\` varchar(20) NOT NULL DEFAULT '',
  \`user_id\` bigint(20) unsigned NOT NULL DEFAULT '0',
  \`name\` varchar(64) NOT NULL DEFAULT '',
  \`nickname\` varchar(20) DEFAULT NULL,
  \`country_name\` varchar(50) DEFAULT NULL,
  \`country_code\` char(2) DEFAULT NULL,
  \`city_name\` varchar(50) DEFAULT NULL,
  \`state_name\` varchar(50) DEFAULT NULL,
  \`state_code\` char(2) DEFAULT NULL,
  \`latitude\` decimal(10,4) DEFAULT '0.0000',
  \`longitude\` decimal(10,4) DEFAULT '0.0000',
  \`last_page_url\` text NOT NULL,
  \`http_referer\` varchar(255) DEFAULT NULL,
  \`user_agent\` varchar(255) NOT NULL DEFAULT '',
  \`hostname\` varchar(255) DEFAULT NULL,
  \`provider\` varchar(255) DEFAULT NULL,
  \`time_entry\` int(10) unsigned NOT NULL DEFAULT '0',
  \`time_last_click\` int(10) unsigned NOT NULL DEFAULT '0',
  \`num_visits\` int(10) unsigned NOT NULL DEFAULT '0',
  PRIMARY KEY (\`session_id\`),
  KEY \`nickname_time_last_click\` (\`nickname\`,\`time_last_click\`)

CREATE TABLE \`wp_$1_visitor_maps_st\` (
  \`type\` varchar(14) NOT NULL DEFAULT '',
  \`count\` mediumint(8) NOT NULL DEFAULT '0',
  \`time\` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
  PRIMARY KEY (\`type\`)


This will represent a big improvement in the service.  Now to look at some blogs to see if the updates have worked…





Posted in CELT, ITMS, Library, The Commons | Tagged , , , , | Leave a comment

Hardening Apache using OpenVAS and RedHat advisories

Hardening Apache using OpenVAS and RedHat advisories

My institution uses a tool provided by Janet to scan for vulnerabilities in web/servers.  We fix problems as soon as we see them.  I have recently been looking at Apache on an up to date CentOS server.  In order to test my changes I installed the FREE OpenVAS tool.  The install is very straight forward and once I set up the firewall on a test server I could start scanning hosts.

The report was more verbose than the “complaint” report I was looking at.  I understand that tools like this can not always tell if the flaw actually exists but instead takes clue emitted from the server e.g. openssh 2.2-v5.  That example, gives out the version of the software for which a flaw may exist but does not know, in this case, that the server is already patched.  In the report, a Common Vulnerabilities and Exposures code is given for each “flaw”.  I looked these up to assess the threat taking RedHat at their word.

When RedHat explains that a CVE is already patched or that it does not apply because of the use of the machine I can override the test in the scan providing a cleaner report next time.

In this specific case, I was looking at the strength of SSL from one of our servers.  Due to OpenVAS, I was lead to look at SSL compression, the tokens Apache emits and TRACE/TRACK methods too.

A big thumbs up for OpenVAS and RedHat’s CVE database.

Posted in ITMS, Linux SysAdmin | Tagged , , | Leave a comment

Protected: CentOS virtual machine template to support LAMP and other applications. Part 2

This content is password protected. To view it please enter your password below:

Posted in Uncategorized | Enter your password to view comments.

CentOS virtual machine template to support LAMP and other applications. Part 1(Updated 2016)

Preparing an install of CentOS to become a VMWare template.

Since being centralised and virtual machine infrastructure being some what new to ITMS, familiar to some, we are homing in on single solutions for systems administration that should provide “wins” in terms of short-order provision of machines for services/applications, development, testing and research/student environments.

We have virtual machine environments in several forms and some colleagues including myself have been preparing for the great day when we converge our infrastructure in to two very reliable data centres.  This week my team, of developers, were discussing the old days of installing Windows (not me…) using 20 floppy disks with the occasional bad sector.  A week of that would constitute work and was common practice.  In the GNU/Linux world, and probably Windows which is losing ground in the server room, dev ops are constantly  struggling to get away from that and to move to instant provision of machines ready to run services.  It is a problem of scale.  At DMU, at last count there was around 800 servers.  Re-provisioning those installing one operating system at a time will take a long time.  With some preparation, now, around agreed practices we can speed up the move to the new infrastructure.

Why am I looking at this now?  There is an instant need to create a server with a LAMP stack on it for DMU Global.  That ties in with a need for a WordPress (LAMP) install and a requirement for an LDAP server to supports the Library’s OpenAthens LA service.  There are requirements in common and tonnes of choices about the best approach.  The requirements are not simply related to the common software components but relate to disk use/partitioning and security of the servers.  There is also the opportunity to create a template that can be used by other systems administrators in ITMS.  I hope to reduce the amount of work that needs to be done post install so that colleagues can get on with the meat of setting up applications.

Some decisions:

When I managed the web team in the library we insisted on secure keys, passphrases and encrypted sessions: command line and file transfer.  This, pretty much, is a unique practice in the university but passwords are being proven to be a weak authentication and I think that secure keys are the way to go.  There is some flexibility in how SSH can be set up.  It is possible to allow passwords to be used by a restricted set of IP addresses.  This is something that we should discuss.  Passwords need to be changed if someone leaves the business.  It might be easier to revoke a secure key.

For fifteen odd years I have separated out custom software, the application (web server) and data from the OS.  This has the advantage of being able to update the operating system or even change it without touching those local aspects.  A disadvantage is that on the data side those changes need to be reflected.  Another systems administrator would have to know about the changes or have my skills and experience to ‘devine’ how the machine is set up, additionally it is harder for those changes to survive an upgrade.  Of course, another advantage is that hackers and root kits can not rely on the usual assumptions of a default install.  Part 2 of this series will detail those changes but the blog entry will be internal to DMU only.

GNU/Linux is faster when it is paravirtualised.  This is when code is shared between the guest virtual machine and the host.  Simplistically, more of the CPU is used in the traditional sense giving an efficiency that virtualising via hardware can not achieve.  Every solution has its own software for doing this.  The software is installed in the guest.  In our preparation for the Great Convergence we have used KVM, Virtual Box and very recently oVirt.  VMware has its own solution too.  We had hoped to move our virtual machines from their original home to the new VMware infrastructure by exporting and importing but if this paravirtualisation software exists on the machine it may confuse things in the VMware environment.  A LAMP stack will work with out the software so I am leaving it out of the install for the template.  It could be added later if, for example, access to a USB dongle is needed.  We are wedded to VMware for a time.  Hopefully, something like oVirt or OpenStack will be considered later on.  The guest tools will help bits spin faster but the administration overhead of moving between environments might something we want to avoid.  The choice for Windows might be different.

Back to security.  Some favour the use of sudo and mulitple accounts with passwords.  I have a feeling I will not win this one.  My small team was very used to being themselves locally but logging in as root to remote machines.  All machines had a small number of accounts and only root had a password.  This is unusual and might have added to our success in that it is not the natural assumption.  Hackers/root kits look for accounts with simple passwords.  We made it impossible to login via any of those accounts.    Of course, we only have to manage the root account on those machines.  Choices like this, local firewall configuration and secure key/passphrases have probably saved us a mountain of trouble as well as increasing the up time of our services.

In our requirements gathering, working with HP, it became obvious that one of our practices will change.  Before virtualisation we had virtual servers in the Apache web server.  Either one IP address and multiple CNAME aliases were used using HTTP 1.1 or multiple IP addresses were used in order to home several web applications with their own sub/domain names.  This, while it feels like a package one administrator would be comfortable with, multiple administrators or new colleagues would have to use their skills to understand the install with any problems that might bring.  Virtual machine infrastructure technology makes for very light virtual machines and running more of them, separated services, make administration easier.  One meaningful sub/domain, one machine and one configuration.  Separating out services allows us to organise services more easily including housing within hosts and in backing them up.  If we think about hybrid cloud solutions and bursting it makes sense simplify virtual machines.  We used to have a dedicated IP address for the machine and separate IP addresses for web servers on a machine.  We will assume, for now, that we will have one IP address for the virtual machine and the service running on it.

There is a choice to be made about disk usage or consumption.  PostgreSQL is a better database manage system than MySQL.  MySQL is very popular and some software using LAMP only works with MySQL.  I could leave the choice the next administrator or make both available and rely on the storage solution to take care of duplicated blocks of data across multiple virtual machines.  If the next administrator knows that that the software is installed she can configure the machine to use the installed software and skip the download.  I have been using PostgreSQL and MySQL for many years.  The greatest advantage is that there are some default tunings that can be made to both DBMSs which will be consistent across all virtual machines if I make the changes for the template.  An alternative to this is a wide tree of templates starting with  the minimum install linked to many templates: a complete LAMP stack with MySQL, a complete LAMP stack with PostgreSQL, just PostgreSQL, just MySQL, Perl instead of PHP etc.

Backups… I am going to use Amanda for now because I know I can support it for disaster recovery.  I’m sure it will get swapped out later but I do not know how it would be provisioned in the mean time.  Amanda is free and does not need a licence; quicker and cheaper.

On file system encryption.  I’m am not encrypting anything now.  This setup will allow the data partition to be encrypted.  If we want to encrypt the operating system partition then we need to separate out /boot from /.

Time.  We are using NTP as a belt and braces approach to keeping machines in sync.  VMware could within its infrastructure guarantee the time but if we teleport a machine to another infrastructure or burst it to a cloud we can not guarantee the time is the same in the new environment.

We are using rsyslog to record interesting changes to files locally.  I have also set up the virtual machine template to share changes logged to a remote server.  Because this is done for the template every virtual machine created from the template will automatically report to the remote server.  That server will be used to generate reports, warnings and help in any compromises should they occur.

I have created a script that can be run before the template virtual machine is shutdown.  This cleans up:

/bin/rm -f /etc/ssh/*key*
/bin/rm -f /var/log/*-???????? /var/log/*.gz
/bin/rm ~www/p*/logs/*-????????
/bin/cat /dev/null > /var/log/audit/audit.log
/bin/cat /dev/null > /var/log/wtmp
/bin/rm -f /etc/udev/rules.d/70*
/bin/sed -i '/^\(HWADDR\|UUID\)=/d' /etc/sysconfig/network-scripts/ifcfg-eth0
/bin/rm -rf /tmp/*
/bin/rm -rf /var/tmp/*
/bin/rm -f ~root/.bash_history

and before the machine is shutdown we should run ‘unset HISTFILE’ to prevent the current sessions history being saved.


  • Need email for the root user (machine and web server) to go somewhere
  • Who should own responsibility for backup of e.g. SQL and disk space?
  • We send syslog events to a remote syslog server
  • Need to adjust logrotate for non-default logs
  • Add webalizer for web statistics later on?

In doing the work I will list parts of the web that have influenced the design:

A year later (update for 2016)

We now have some experience of working with VMware at scale.  We have gained experience in how resources are used and some interesting things have come up.  Backups are interesting.  We don’t have no implemented a solution yet (licensed or free) that will snap a MySQL database and the filesystem so that we have a consistent backup.  We, therefore, spit the SQL out nightly while the application is in maintenance mode and have that backed up by our backup solution.  Some of our services are getting big!  DMU Commons has grown by 50% in the last term.  The previous size represented five years of the service.  Backing up the VM takes, relatively, a long time.  Most VMs are 50GB where The Commons is 200GB.  We want to move the users’ content to a central store and mount it by NFS.  This gives us quicker backups and the ability to easily tune the volume size.  But, WordPress has the application and the data under the same directory.  We need to engineer the disk layout to support WordPress as best we can.  That is it need to make sense to the next tech who is asked to look at it.  We are looking at:

  • /, /tmp, /boot, swap on one volume group, disk, controller
  • /dbms on one  volume group, disk, controller to support MySQL and PostgreSQL
  • /usr1 on one volume group, disk, controller for application/data
  • /usr2, possibly, in case the service grows to a size that user data should be moved.

We are currently looking to implement SAP, SAP recommend separate disk controllers for performance reasons.

The service is being heavily used both by creators and consumers.  We host a web analytics service on the same VM.  Running reports uses lots of RAM and CPU.  We see a need therefore to move the stats service away from the VM.  This is another reason why services should be split one per VM.


Posted in Linux SysAdmin | Tagged , , , | Leave a comment

DORA Regulatory Work Part 2b : Embargo changes

Part 1 is over here. Part 2a is over here.

Embargo in DSpace

To us it seems that the embargo code in DSpace is still being thought about.  It has changed between its first introduction in 1.6.x to 3.1.  We think we understand it and are using it in ‘Simple’ mode.  In ‘Simple’ mode, during the submission, we can add an embargo to a bitstream.  This actually assigns the rights to for an Anonymous/guest to be able to read a bitstream after a start date with no end.

Conversation on the dspace-tech mailing list discusses what embargo metadata should and should not be exposed to programmers because if exposed it would also be available to BadPeople.

Once we got the embargo functions working we discovered that a guest user must click to view a bitstream before they find out that it is restricted.  The message reads:

The file you are attempting to access is a restricted file and requires credentials to view. Please login below to access the file.

We wanted to tell the user that the file is embargoed and that the embargo will be lifted on a certain date before they click on viewI thought that the outcome of that would be that the guest to the website might not come back ala ‘this website is under construction’.  I wanted to add the possibility to create a calendar ICS file on the fly so that the guest can have a reminder of the embargo being lifted.

So, how to do that?  I discovered that the DRI /metadata/handle/xxxx/ZZZZ/mets.xml?rightsMDTypes=METSRIGHTS does not include the start date of the rights restriction.  Also, the code xsl:call-template name=”display-rights” in item-view.xsl does not handle guest views in a way that would convey the embargo to them.  To fix this, for DMU, I added this code:

--- a/dspace-api/src/main/java/org/dspace/content/crosswalk/
+++ b/dspace-api/src/main/java/org/dspace/content/crosswalk/
@@ -248,7 +248,11 @@ public class METSRightsCrosswalk
            //Translate the DSpace ResourcePolicy into a <Permissions> element
            Element rightsPerm = translatePermissions(policy);
+           Element datesMD = new Element("Dates", METSRights_NS);
+           datesMD.setAttribute("START_DATE", String.valueOf(policy.getStartDate()));
+           datesMD.setAttribute("END_DATE", String.valueOf(policy.getEndDate()));
+           rightsContext.addContent(datesMD);
         }//end for each policy

thus exposing rights:Dates/@START_DATE.  I was able to then display an embargo message based on the start date and user group from the metadata above.  I only display the embargo for this situation.  For any other situation the original display-rights code is run.

Now I have the message and a start date I could work on creating a utility to generate an ICS file containing the handle of the item, the date and an alarm.  Working out how to do this was tricky.  This is how I did it…

Modify the sitemap.xmap for our theme:

--- a/dspace/modules/xmlui/src/main/webapp/themes/dmu2011/sitemap.xmap
+++ b/dspace/modules/xmlui/src/main/webapp/themes/dmu2011/sitemap.xmap
@@ -15,7 +15,6 @@
                        Define global theme variables that are used later in this
                        sitemap. Two variables are typically defined here, the theme's
@@ -29,14 +28,28 @@
-        </map:component-configurations>
+                </map:component-configurations>
+                <!-- Owen -->
+                <!-- ICS Calendar -->
+                <map:pipeline>
+                  <map:match pattern="utils/handle/*/*/calendar.ics">
+                    <map:generate src="xml/utils/ICSCalendar.xml">
+                    </map:generate>
+                    <map:transform src="lib/xsl/utils/ICSCalendar.xsl">
+                      <!-- map:parameter name="use-request-parameters" value="true"/ -->
+                      <map:parameter name="startDate" value="{request-param:startDate}"/>
+                      <map:parameter name="handle"    value="{1}/{2}"/>
+                    </map:transform>
+                    <map:serialize type="text" mime-type="text/calendar"/>
+                  </map:match>
+                </map:pipeline>
+                <!-- /Owen -->
                        <!-- Allow the browser to cache static content for an hour -->
                        <map:parameter name="expires" value="access plus 1 hours"/>
             <!-- handle static js and css -->
             <map:match pattern="themes/*/**.js">
                     <map:read type="ConcatenationReader" src="{2}.js">

This creates a pipeline that generates a text file with the correct mime-type ‘text/calendar‘ when URLS similar to:


following the VCALENDAR v2.0 specification.

The dmu2011/lib/xsl/utils/ICSCalendar.xsl code looks like this:

<xsl:stylesheet xmlns:i18n=""
        xmlns:xsl="" version="1.0"
        exclude-result-prefixes="util exdate i18n xsl confman jstring">

  <xsl:param name="startDate" select="$startDate"/>
  <xsl:param name="handle"    select="$handle"/>

  <!-- xsl:value-of select="exdate:date-time()"/ -->

  <xsl:variable name="isoDateStart" select="concat(jstring:replaceAll($startDate, '-', ''), 'T100000')"/>
  <xsl:variable name="isoDateEnd"   select="concat(jstring:replaceAll($startDate, '-', ''), 'T103000')"/>

  <xsl:template match="*">BEGIN:VCALENDAR
UID:<xsl:value-of select="$handle"/>
DTSTART;TZID=Europe/London:<xsl:value-of select="$isoDateStart"/>
DTEND;TZID=Europe/London:<xsl:value-of select="$isoDateEnd"/>
DESCRIPTION:<xsl:value-of select="$handle"/>
DTSTAMP:<xsl:value-of select="$isoDateStart"/>
SUMMARY:De Montfort University Research Archive
DESCRIPTION:Embargo has expired


There are two things to note about this.  The simple one is that it includes an alarm.  That is a design decision that may change.  The second is that I have deliberately left out what the item is.  The guest will have to visit the website to be reminded.  It is likely that this will change.  At the moment the code works.  Titles and descriptions can be long and might contain characters that might break the VCALENDAR format.

As per usual I need to tidy up the code adding the correct internationalisation.  Messages will change to when we have settled on what we thing is the correct language.

Posted in DORA, DSpace, Library | Tagged , | Leave a comment

DORA Regulatory Work Part 2a : Mandatory fields

Part 1 is over here.

We added two new fields to our metadata:

  • dc.funder Body funding the work
  • dc.projectid Identification for the work

These are added to DSpace/dspace/config/input-forms.xml.

         <hint>Enter the name of the funder.</hint>

         <label>Project Identification</label>
         <hint>Enter project identification the box below.</hint>

on page 2 of the submission form.

I still need to add the drop down for usual funders.

Posted in DORA, DSpace, Library | Tagged , | Leave a comment