Research Data Management: Hydra at the University of Hull

 Visiting University of Hull Library

The University of Hull has refurbished its library.  This is a graphical representation of it.

The University of Hull has refurbished its library. This is a graphical representation of it.

The University of Hull has refurbished its library at a cost of £28m and very nice it is too.

In the library, there is a lot of space, light and high ceilings.  General computing includes iiyama monitors who have a reputation (with geeks) as being a little less off-the-shelf than, for example, DELL.  Maybe the PCs are not not an afterthought to the service either.  I did not see it but there is an art gallery too.

The library has chairs that you might call privacy booths with high sides and a high back that you can sit at with a book or a tablet because of the convenient power sockets.  Eduroam wireless networking exists throughout.  There are student group meeting rooms that are not locked.  There are boardroom style meeting rooms with a mixture of Crestron based AV and old fashioned green leather clad configurable board tables.  The mixture works well, very “library”.

We, Alan Brine and I, were treated to lunch, which was nice and unexpected.  We spent the day with Chris Awre, Head of Information Management, and Richard Green, consultant: Hydra in Hull and related projects.

Hydra and the reason for the visit

Hydra About Hydra Introduction to Hydra

We are looking at developing a service around the curation of digital research data.  We already have DORA, which is our open repository for research output: publications and doctoral theses, which is just over 100GB.  We are used to DSpace and could create a open access data repository using it or even partition DORA somehow and store research data in that but data has the distinction that it can be re-used, re-purposed and, better, parts of different collections could be mashed by anyone who discovers the data.  This is what Hydra may deliver.

Before the visit we, in a rush, passed on these questions:

  • People:
    How many staff are involved in running the service?
    What do they do?
    What is the work flow leading up to creating a new collection?
    Has anyone received external training? In what?
  • Hardware:
    What is size of the archive so far?
    What is the expected growth?
    What are the specifics of the hardware solution?
  • Technical:
    What are the steps related to the software/creating new functionality?
    Are there any yearly costs for support or licensing? PURLs etc.

And while the conversation was freer than the structure above I hope to capture some of the answers here.

The Conversation

The first point I took the time to note was about the company Data Curation Experts who are important in two respects.  They are very much aware of Hydra but also could kick start any effort because of their knowledge around structured and linked data.  It is all very well to be a good custodian of a data repository but the value is in the factoring of data and its re-use.  That is something we need to implement and where we might have a gap in our knowledge and experience.  We need to understand and replicate as many patterns as possible so that others at the university and beyond can dip in to our repository with out too much of a learning curve.  Where possible we might make the effort to make the data machine-readable.

While DMU compared EPrints and DSpace and chose DSpace because of its worldwide community, Hull chose Hydra for similar reasons with the added essential that they always wanted to store data collections.  They started looking at Hydra in August 2011 with a developer working on it full time until 2013.  The developer learnt the programming language Ruby from scratch.  Today, Hull library has access to a developer in their equivalent of our ITMS.  We talked about training and discussed the active community.  In terms of training, Data Curation Experts can help but there is also Hydra Europe and Hydra Camp.  We talked about who has been involved with Hydra at Hull and how it works today.  Chris and Richard have been doing research curation at Hull for ten years plus.  The library has a team called Researcher Services which is partly comprised of their cataloguers.

The conversation moved here and there, some of it was about how Hull are doing library services.  I am a little envious at their facilities and use of Blacklight as the catalogue search frontend to their library management system.  Envious too, because they are being trusted to use, even embrace, open source software to provide university wide services.  Their previous exam paper collection is stored using Hydra as are undergraduate and graduate theses.  We do not currently store undergraduate theses but it seems like a bold initiative.  There are 11,000 objects in their repository and we are a smidge behind at 10,500.

We talked about workflow and the woes of the librarians herding academics; some are keen, some do not know about the service and some have the default position, often right, that their work can not go in to a open repository.  We talked about advocacy and support in governance from our universities.  At Hull, Hydra is seen by the university as infrastructure rather than an end point for dumping research data.  We talked about tools:

  • The Avalon Media System A next generation Hydra Head for Audio and Video delivery
  • Capistrano  A remote server automation and deployment tool written in Ruby
  • Archivematica is a web- and standards-based, open-source application which allows your institution to preserve long-term access to trustworthy, authentic and reliable digital content
  • Sufia is a Rails engine for creating a self-deposit institutional repository
  • Worthwhile is a simple institutional repository for Hydra
  • Hydra in a Box The Digital Public Library of America (DPLA), Stanford University and DuraSpace are partnering to extend the existing Hydra project codebase and its vibrant and growing community to build, bundle, and promote a feature-rich, robust, flexible digital repository that is easy to install, configure, and maintain.

The most important function of the tools above is exampled by Archivematica.  The tool deconstructs datasets, for example, Zip files and and prepares data for long term access.  However, the function is fraught with problems and difficult design decisions.  If we take the example of a Microsoft Excel spreadsheet.  The tools to access that spreadsheet may not always be available because of the proprietary nature of the product but even today we can not convert from Excel to an open format like the OpenOffice spreadsheet format because the tools that exist today can lose information in the Excel spreadsheet when converting to something more reliable.  There is a need for strong governance around what we store and to what formats we convert non-open data formats.  While talking about Archivematica, Richard showed us a workflow that he is looking at:

A possible workflow for Research Data Management, Richard Green. Creative Commons : CC-BY-NC-SA

A possible workflow for Research Data Management, Copyright Richard Green. Creative Commons : CC-BY-NC-SA

Long term preservation of research data is implemented through the use of Archivematica.  Other data that has been through the system before the introduction of the tool can be sent through again.  Data that is now seen as long-term rather than medium-term can also go through the tool.  The view on what format should be used for a long term preservation may change and that is catered for as collections may be re-ingested.

We are starting from scratch, we might like to see Archivematica or similar used for every collection.  This way researchers will access data collections in a similar way every time.

Conclusions

We could take an easier route than Hydra.  We could provide a repository where we make open access datasets discoverable and downloadable.  This approach would tick a box.  We would, however, have no idea what the data is being used for or how many times it has been used.  We would not know which aspects of the data are being used and what, therefore, is useful or has value.  The use of the data is separated from our systems and we can no longer track it.

We can do better than that for the research community.  Other organisations are looking for insight and value from data over and above its original purpose.  We can provide an infrastructure as a service IaaS based around Hydra or similar.  If data is organised in a way that researchers can access facets of it then we will have a better idea of what is being used and how often.  If we are talking about open access to research data as infrastructure then we must be talking about open source software.  If we are to provide an IaaS then anyone using it must be able to use it in a way that is easily reproducible and at as little cost as possible without any lock-in.  Good open source software and the communities around it supports this behaviour by default.  Proprietary software is based on the model of selling and reselling.  The business model includes practices such as built-in obsolesence.  This practice and others like it do not fit with open access to long term curation of data.

Another reason to go the IaaS route is the concern around privacy, intellectual property and anonymity.  This may answer the cloud question too.  There will be data that needs to have restricted access and with a need to be archived but sometimes it will be stored with other interesting data without the same restriction.  With data sets broken up and stored in facets we can control who has access to what.  We can control access with a straight forward repository too, but data that could be released would be hidden as part of a restricted blob of data.

If we implement the storage of research data (even data from other systems) as IaaS then we can create good practices and governance at the university that researchers and others can use.  We can advocate the use of open formats in order to avoid the problem explained above.  Implementing a RDM IaaS will provide the opportunity to reduce the number of systems we need to understand when working with researchers from induction, through maintaining their work and preservation of it.

I have written about this service being IaaS instead of an end point dump for data,  I have talked about the benefits of open source and open standards in general but what of Hydra?  Is it the right tool?  Back in 2010, the risk would be considerable.  Ruby on Rails at the time, while exciting, was considered limited.  The thinking was, that at some point the developer would reach the wall of what is possible and would have to start writing code at a lower entry point in the Ruby stack.  Hydra has been in use for at least five years.  The community is thriving, Ruby and Rails are thriving and Hydra as well as being developed as infrastructure is also being developed as a turnkey solution, for example Hydra in a Box.  Now, would certainly be a better time to enter the community than five years ago.

Postscript

For many years, oblivious to Hydra, I have wanted to work with Fedora Commons.  Hydra uses Fedora Commons.  Fedora Commons allows for the storage of digital collections using Triples to describe relationships.  Triples are used in “Big Data”.  While some of the hoopla around Big Data has died down an understanding about the need to reliably, with consistency, store data and data about data has come about.  Organisations are re-tooling to include Big Data tools in their infrastructure.  In the same period that Hydra has been maturing, I have learnt that DSpace would run on top of Fedora Commons but missed the opportunity to use it.  Since then I have become aware of Hadoop which promises horizontal scaling and secure storage of data with the added benefit of compute at the node the data sits on.  In looking at Hydra, I have come up with the idea that Hadoop could underpin Fedora Commons which in turn is a key component of Hydra.  These three layers are infrastructure most of the time.  They each expose web services which could be leveraged by programmers and existing tools.  We could sit DSpace on Fedora Commons and then have Hydra sit next to it or replace it.  Tools could sit on any layer to support research or the business functions of the university.  The framework would be very flexible.  DMU is looking at storage solutions at the moment and providers are telling us that they have Hadoop offerings.  Now could be the time.  Hadoop is ten years old and growing rapidly.  If we had Hadoop we might be able use it for teaching too.  The world, it seems, needs data curation and data scientists.

About c3iq

Opensource, Linux, Unix, Fish, Family
This entry was posted in DORA, DSpace, DSpace, ITMS, Library, Linux SysAdmin, Uncategorized and tagged , , , , , , , . Bookmark the permalink.