IMDI batch converter guide

Introduction

The metadata-converter (tab2imdi.py) is a script that takes a spreadsheet saved as a tab-separated text file (the metadata) together with a template (an IMDI-file used to determine what goes where in the IMDI XML-structure) and creates one IMDI-file from each row in the spreadsheet.

It should work on any operating system that can install Python 3 (see requirements below) and is run in a terminal window (no graphical interface).

Some previous knowledge of the IMDI-standard and tools such as Arbil is assumed.

Requirements

Validation (optional) also requires:

Quick Workflow Overview

  • Put your metadata in a spreadsheet, row 1 contains headers
  • Export as tab-separated values, UTF-8 encoded
  • Create an IMDI-template in Arbil
  • Save and export the IMDI-template
  • Collect spreadsheet, template and IMDI_3.0.XSD in a folder
  • Create an output folder (default name: “imdi”)
  • Run the converter with the necessary flags

Preparing your spreadsheet

Requirements: Spreadsheet editor

Use a unicode font with good support for your purposes, e.g. IPA support. Also, make sure that your spreadsheet editor is capable of exporting to a UTF-8 encoded plain text file with tab-separated values.

NOTE: By design, any information included in an IMDI-file can be searched and viewed by anyone browsing the corpus tree online. This means you might want to include the location, but not necessarily your note-to-self comments or sensitive information.

With this in mind, try not to mix sensitive information and non-sensitive information in the same cell of the spreadsheet. Instead, create a separate column for the sensitive information. You will then have the option to exclude the sensitive column when you create a template in Arbil. In this way you can keep sensitive information in the spreadsheet for reference while excluding it from the IMDI-files you upload to the server.

Each row in the spreadsheet corresponds to one metadata/IMDI-file. Whether an IMDI-file describes a set of data files or a single resource is up to you (e.g. both sides of a digitized cassette could sometimes be covered well enough by the information in one IMDI-file). There are no requirements as to what kinds of information need to be included in the metadata, but in terms of formatting your spreadsheet there are a few things to keep in mind:

  • The first row should only contain *headers*. These label the type of metadata contained in each column, e.g., date of recording, language, see Fig. 1. These headers will also be passed on to the converter during IMDI-creation. Therefore, do NOT use spaces in the headers. If needed, use underscores instead. To make the headers easy to spot, preferably use upper case, e.g. FILE_ID in spreadsheet (FIGURE REF).
  • You also need to create a column with unique IDs that the converter will use for file names for the IMDI-files, e.g. the FILE_ID column in (FIGURE REF). If you already have unique IDs for your data files you could for example put those into this column for easier pairing of metdata and data. Make sure that these are unique as your file system will not accept two or more files with identical names in the same directory.
  • Rows 2 and on contain your metadata. For some types of information the IMDI-standard requires ISO-formats, e.g. date is YYYY-MM-DD (FIGURE REF). Otherwise, type as you normally would for continuous text, such as for longer content descriptions. Refer to the IMDI-specification (you can also test this in Arbil). Fields that require ISO-formats or follow other criteria (e.g. boolean true/false) will turn red if a value is incorrectly put in.

FIGURE

Make sure that every column that contains information has a header in row 1. When all the metadata has been entered, export the spreadsheet to a UTF-8 encoded tab-separated plain-text file.

NOTE: When you use “Save as…” in order to save a tab-separated file in LibreOffice, make sure that “Character set:” is set to “Unicode (UTF-8)” and that “Field delimiter:” is set to “{Tab}” in the export window that follows after selecting the .csv-format and pressing OK. Otherwise the converter will fail.

Creating the IMDI-template

The template is an actual IMDI-file that you prepare in Arbil, just as if you were creating all your metadata files this way. The difference is that instead of entering metadata here, you just enter the Headers from your spreadsheet in the appropriate fields in Arbil. These will act as variables for the converter.

Each row (starting from rows 2 and on) in your spreadsheet corresponds to an IMDI-file. You can choose to include as little or as much of your spreadsheet as you want. Only the columns you select will be transferred to the IMDI-files.

The headers in your spreadsheet will have to be enclosed in “%%” in order for the converter to include them, i.e. “LOCATION” in your spreadsheet becomes “%%LOCATION%%” in Arbil (see ARBIL FIGURE REF).

When you are finished, export the template as an IMDI-file, e.g. template.imdi (right-click session/branch and click “Export”).

Refer to Arbil’s documentation and the IMDI-specification for detailed information.

FIGURE: headings become variables

NOTE: For languages, the IMDI-standard expects the user to specify which ISO standard is being used by prefixing it to the language code, e.g., if using ISO639-3, “eng” for “English” becomes “ISO639-3:eng”.

TIP: If the column containing the three letter ISO code “eng” in your spreadsheet has the header “LANGUAGE_ISO”, the corresponding field in your template in Arbil could be entered as “ISO639-3:%%LANGUAGE_ISO%%”. After running the converter, the generated IMDI-file/s should now all have language ISO entered correctly as “ISO639-3:eng”.

Running the converter

Requirements: Python 3, tab2imdi.py, metadata file, lxml (optional), IMDI_3.0.xsd (optional)

Note that quotation marks around the commands should be excluded when typed.

Python 3 and lxml

Python 3 can be installed in a few different ways. If unsure what these are, download the latest installation package for your platform from http://www.python.org (on Windows, make sure to check “Add to PATH” during the installation.). This package includes the utility pip, which can be used to install the lxml module.

After the installation has finished, try the following:

  • If you want validation, type pip3 install lxml (depending on platform and previous Python-installations the command could also be pip install lxml. Linux users might opt for a pre-compiled binary via the central package manager instead), http://lxml.de. Wait for the installation procedure to finish.
  • Type python3 (depending on platform and previous Python-installations the command could also be python), followed by enter. This should return some version information and a >>> prompt, indicating that you are in the Python interpreter environment.
  • Still in the Python interpreter, type import lxml, followed by enter. If lxml has been correctly installed, the >>> prompt will return on the next line with no further messages (i.e. the package could be imported with no issues).
  • Exit the Python interpreter by pressing ctrl+d, alternatively type quit() and press enter.

Why validate?

Validating the generated IMDI-files at the conversion stage can save a great deal of time and troubleshooting. LAMUS will do this after upload anyway, but will not accept problematic IMDI-files (e.g. a date written as 95/09/23 in the “Date” field means the IMDI-file will be rejected). It is usually easier to find the exact problem when using the converter for validation, since it will return the problematic line within the IMDI-file in an error message. Having passed validation pre-upload ensures LAMUS will accept the IMDI-files.

Workflow

  • Collect all the necessary files and folders in a working folder:
    • tab2imdi.py
    • Spreadsheet saved as plain text file with tab-separated values (note that Libreoffice will use .csv regardless of separation method. The converter does not care, as long as the file contains tab-separated, UTF-8 encoded text.)
    • IMDI-template (e.g. template.imdi)
    • IMDI_3.0.xsd (optional, required for validation)
  • In your working folder, create an folder named “imdi” for the generated IMDI-files.
  • Run the converter (see the section on the converter syntax below)

Your IMDI-files are now ready for upload.

NOTE: For those who feel a bit in the dark when using a command line interface, everything can be prepared outside of a terminal window (e.g collect the files and create the required folder/s in the graphical interface before running the converter). If you use the same settings (‘flags’) for the converter every time it is run, you could also copy the command the first time you run it then save it in a text file for pasting into the command line for later. Just make sure you run the converter from the correct folder.

Converter syntax

The syntax breaks down as follows:

tab2imdi.py -o NAME -d DIR --validation --xml-schema-file IMDI_3.0.xsd TEMPLATE.imdi METADATA.csv

  • -o specifies file name column
  • -d specifies outpout directory
  • --validation --xml-schema-file IMDI_3.0.xsd: Validation using the xml-schema IMDI_3.0.xsd
  • TEMPLATE.imdi: IMDI template
  • METADATA.csv: Metadata file

Example without validation:

python3 tab2imdi.py -o IMDI_ID -d kjg_narrative -s SKIP_EXPORT TEMPLATE.imdi kjg_narrative.csv

Example with validation:

python3 tab2imdi.py -o IMDI_ID -d kjg_narrative -s SKIP_EXPORT --validate --xml-schema-file IMDI_3.0.xsd TEMPLATE.imdi kjg_narrative.csv

Breakdown of the examples:

  • -o IMDI_ID: Instructs the converter to use a column titled “IMDI_ID” in the original spreadsheet for the IMDI-file name.
  • -d kjg_narrative: Instructs the converter to use an output folder called “kjg_narrative” for the IMDI-files. It can be can be called anything but it has to be created before running the converter.
  • -s SKIP_EXPORT (optional) This is for skipping rows if needed for any reason. If the converter finds any text in the specified column, the corresponding row will not generate an IMDI-file. In the example, the converter checks whether the cells in a column called “SKIP_EXPORT” contains text or not. As long as the converter finds text – any text – in the cells of this column, the corresponding rows will be skipped when running the converter.
  • --validate --xml-schema-file IMDI_3.0.xsd (optional) Validates the generated IMDI-files using the XML-schema “IMDI_3.0.xsd”. This ensures that they conform to the IMDI standard so that LAMUS will accept them.
  • TEMPLATE.imdi Instructs the converter to use an imdi-file called “TEMPLATE.imdi” as an xml-template. This is the file you prepared in Arbil earlier.
  • kjg_narrative.csv This is your metadata, saved in UTF-8 as a tab-separated plain-text file.

Troubleshooting

If the converter fails and reports an error message, check what it says. There are a few common mistakes that are easily fixed.

In your spreadsheet check for:

  • Mistyped column headers (check for spaces, typos)
  • Columns that contain information but have no headers in the first row.
  • Some information, such as date, is not using the expected ISO format.

In your template (using Arbil), check for:

  • Column headers that are not properly enclosed in %%, e.g. %DATE%, rather than the correct %%DATE%%.
  • Mistyped headers that consequently do not correspond to any header in the spreadsheet.
  • ISO codes that are not entered correctly. ISO for language needs to be specified, e.g. prefixed by “ISO639-3:” (see the language note, last in the IMDI-template section).