Convert Spreadsheet (Excel or Calc) Data Into DSpace Simple Archive Format Package

Profile picture for user Saiful
Dr. Saiful Amin
Submitted by Saiful on Sat, 12/09/2017 - 16:34

On a freshly setup DSpace one would want to upload existing publications, typically maintained in spreadsheets like MS Excel or OpenOffice Calc. I would always recommend uploading such records in a batch process. Doing it manually would not only be prone to errors but also time consuming and cumbersome.

The recommended way to use the batch process is to convert the spreadsheet records into Simple Archive Format (SAF). DSpace documentation points to two applications:

  1. SAF Builder:
  2. saf-archiver:

Both the approaches above use CSV files as input. These scripts can also handle full-text documents. Once you run the scripts as per the instructions you get a neat SAF package that you can import into DSpace in a single batch.

One valid question to ask at this point would be: why do we need another tool? Well, the short answer is TIMTOWDI! But there is more to it than being quick and dirty.

If your data is well formatted in CSV format you are in luck. However, in a sufficiently large set of data I came across few situations where you will find line-breaks inside some fields. This results breaking a single record into multiple lines when you save the file into CSV. Any tool that deals with CSV considers each line as a single record. Getting rid of those pesky line-breaks is not only a time-consuming affair but could also be unwarranted; say the author wants to keep the line breaks in some fields such as some math formula in the abstract.

So here goes my Perl script approach to create SAF packages directly from spreadsheet without converting it to CSV. The existing CPAN packages keeps the lines of code very small. The script not only generates the SAF packages, but also the bash script for loading the SAF packages into DSpace as well.