Back to top

Why collect metadata?

Metadata is descriptive information that helps people to understand, use and manage records. In digitisation projects, metadata can be used to:

  • find and use digital images
  • link images to the business process they document
  • demonstrate that images are accurate and reliable renditions of the original paper records
  • document the digitisation process itself
  • document formats and dependencies to help manage images over time.

Images without appropriate metadata will quickly become useless. They will be impossible to find, view or migrate to new technology as this inevitably becomes necessary. [1]

Failing to identify and collect suitable metadata may prevent your organisation from reaping the business benefits of a digitisation project.

Back to top

Where will metadata come from?

It is likely that your organisation will already have good metadata that can be automatically applied to all of your digital images.

Optical Character Recognition (OCR) technology offers greater possibilities for automatic metadata capture. Automatic capture of key fields might be possible by writing scripts, especially when the original paper records use a standard format or template. [2]

For example:

The Department of Education and Communities used document definition forms to automate a large amount of their data collection. See Case Study: Department of Education and Communities pilot digitisation of HR records.

Housing NSW did not capture image level metadata, but still managed to automate much of their metadata capture. See Case Study: Housing NSW – Outsourcing the digitisation of client files.

Note: OCR may not be suitable for some types of back-capture digitisation projects, e.g. if the records to be digitised contain handwriting.

If your organisation has an electronic document and records management system (EDRMS), digitisation software may be able to be integrated with it to facilitate the automatic metadata capture of most of the metadata needed for access and management purposes.

Digital images may also be able to inherit some metadata from business systems they are linked to.

For example:

You may use OCR technology to extract metadata from the digital images and import it (usually through an XML schema) for use in a business system. If you do this you will need to map the fields in the current system to the metadata to be collected. [3]

For further information and different models for managing records and metadata created by business systems, see the Guidelines and functional requirements for records in business systems.

Your organisation should try to automate the capture of metadata wherever possible. Manual collection of metadata should be a last resort as it is costly and can lead to a lack of attention to detail and poor quality collection.

With back-capture digitisation projects some manual data entry may be unavoidable. This is costly in terms of time and resources and should not be underestimated. If records are required as State archives, State Records may require specific and more detailed metadata to be captured as part of the digitisation and transfer process. Contact State Records and be very careful in defining exactly what metadata is essential.

Back to top

Consider metadata early in your project

Your organisation should determine all of the individual pieces of metadata (properties and values) that need to be captured as early as possible.

It helps to know your metadata needs prior to liaising with vendors over digitisation equipment purchases. Then you can determine whether the equipment can facilitate automatic metadata capture and get specific technical advice on how to achieve this.

The metadata generated during digitisation will also usually need to be imported into your corporate EDRMS or a specific business system, along with the digital images. An early understanding of your digitisation metadata needs will help with this import. It will also help you to define what metadata can be inherited from or automatically generated by your EDRMS or business systems, and what will need to be applied during the digitisation process.

If you are intending to transfer original paper records to State Records as State archives after digitisation, it is essential that you contact State Records to discuss what metadata they will require.

For example:

An organisation conducted a back-capture digitisation project with the intention of transferring the original paper records to State Records. They created a database where metadata was recorded. However, they did not discuss their metadata requirements with State Records first. When the time came to transfer, they found that they could not extract the required metadata from the database to generate a consignment list. In addition, there was some metadata, e.g. end date, which was not collected as part of the digitisation project, but was necessary for transfer.

Back to top

Determine what metadata you need

Each back-capture digitisation project has different aims and may require different metadata. Consider the aims and drivers for your project to determine what you need.

Good metadata is a requirement of digitisation and all other recordkeeping projects because good metadata is essential to the ongoing use and management of digital data.

A range of metadata is automatically generated by digitisation software. This usually consists of automatically generated numerical title strings (such as ‘doc20101115155012.pdf’), often based on digitisation sequencing and date data. In determining what metadata you need, you should look at any auto-generated metadata provided by your system and assess whether it actually meets your business needs.

Unique identifier

A unique identifier helps to distinguish a record from other records. This identifier can be at various levels of aggregation or all levels, depending on what suits your organisation.

Your organisation may decide to have a unique identifier for every digital image within a file, and also a unique identifier for a file.

For example:

A digital image may have the identifier ‘D10/2009’ while the file it is attached to has the identifier 10/0252.

If the digital image is saved into a recordkeeping system, the system will usually automatically assign a unique identifier at the image level. The image will also inherit a file identifier when it is attached to a file.

Some business systems may also be able to automatically generate identifiers.

Title

Title is one of the most significant metadata elements to facilitate retrieval so you should consider carefully what metadata is required here.

Again the title field can be at various levels of aggregation.

You can have file titles and also titles for digital images within files.

For example:

Your file could be called ‘Occupational Health and Safety - Committees’ and the image name may be ‘Minutes 2008-02-24’.

Naming conventions

You should consider the use of naming conventions at either or both the file and the image level. These should work together to facilitate retrieval rather than contain duplicate information.

Standard, well devised and rigorously applied naming conventions can facilitate sharing of information. Conversely, inconsistent naming of files and images can make locating files and images problematic, leading to frustrating searches and wasted time, and may result in information being unavailable when it is needed.

File and image names should be meaningful as metadata is self-referencing. They may reflect the existing names of the equivalent original paper files and documents or you can design other conventions to meet your needs.

If non-descriptive file or image names are to be used, e.g. a sequential numbering string, the files and images must be associated with metadata stored elsewhere which will identify the file or image.[4]

Large scale digitisation projects may be able to use machine-generated names and rely on a database for sophisticated searching and retrieval of associated metadata. [5]

Note: This approach relies on very robust connections between the imaged records and the controlling database and depends on these connections being maintained over time. This can be costly and complex.

As part of your metadata design process, you should determine whether it is more cost and business effective to apply meaningful title metadata to an image when it is created, rather than rely on separately stored data.

Existing classification tools and metadata automation tools may assist in automatically generating titles or components of titles.

The following recommendations for file and image names may be considered to help to promote searching and interoperability: [6]

In general, file and image names should:

  • be unique
  • be consistently structured
  • include the use of leading zeros to facilitate sorting in numerical order (applies when a numerical scheme is used)
  • avoid special characters (e.g. tabs or symbols), including spaces (use underscores as an alternative) as they can cause problems across operating platforms.

Metadata embedded in file names (such as scan date, page number etc.) should be recorded in another location in addition to the image name. This provides a safety net for moving images across systems in the future, in the event they have to be renamed. In particular, sequencing information and major structural divisions of multi-part objects should be explicitly recorded in the structural metadata and not only embedded in image names. [7]

Multiple pages within an image

In some cases, multiple pages may be captured within one digital image. Metadata added to the image title (or in another metadata field) can help to identify where this has happened.

For example:

An 80 page document may be captured in four digital images, each with 20 pages. In this case your organisation will need to add metadata about what pages are included within each image and how the four images relate to each other.

Versions and derivatives

You may create multiple versions of a digital image before arriving at a satisfactory output. These may need to be temporarily distinguished from each other with metadata (e.g. with version numbers). Once a final version has been reached it should be saved as the official record and its version number need not be retained. The other versions can be deleted using Normal Administrative Practice.

If you create derivatives of a digital image (versions at lesser quality for different uses) and intend to keep these for future use, you will need to consider how to distinguish these through metadata.

For example:

You may choose to retain the same title, but add a qualifier at the end of the title to show its intended use. A typical example is adding 'p' for published version or 't' for thumbnail after the image title to clarify how the images differ. Qualifiers have an advantage over entirely new names as they keep all associated versions linked. [8]

Specific needs of back-capture digitisation projects

Some back-capture digitisation projects may include a range of records where it is not as straightforward to set standard naming conventions for titles. You will need to consider how to manage these.

Some older photographs may require more descriptive information included in the title.

Where records are required as State archives, State Records may stipulate the capture of specific metadata in a specific way. Therefore it is vital that you contact State Records to discuss your digitisation project prior to setting standard titling for digital images.

Date of creation

Date of creation refers to the date that an original paper record was created, not the date that a record was digitised.

It is important to capture the date of creation of an original record as this provides key accountability, use and management information.

If you are creating digital images of incoming correspondence and capturing these straight into the organisation’s EDRMS, then the date of registration may well be the same date as the date of creation.

This metadata can be applied at an aggregate or file level and/or image level as well. Therefore this metadata could record the date an original paper file was created or the date the individual record was created.

Who/what created the record

This metadata refers to the person who created an original paper record, not the person who created the digital image. For the purpose of doing business, it is important to document an original record's creator so that this data can be searched for or reported on as required.

This metadata can be applied at an aggregate or file level and/or at the image level as well. Therefore this metadata could record who created the original paper file or who created the individual record.

Business function/process it relates to

This metadata records the business an original paper record relates to. It is important to connect a record to the business it documents. This is usually done by linking it to a file.

Creating application

It is important to record this metadata for each individual digital image. This requirement refers to the specific data format that provides the structure for the image.

For example:

For digital images created using PDF, this could be 'PDFCreator Version 2.5' or 'Adobe Professional Version 3' etc.

It is important the data format is captured (see Technical metadata) and, where appropriate, the creating application and version is captured for all digital images. Often creating application can be automated.

Technical metadata

With digital images your organisation will also need to capture some technical metadata about each image and the imaging process. This type of metadata helps to support image quality assessment, ensures an image can be rendered accurately, and demonstrates the provenance of the production of an image. [9]

Technical metadata can include elements like the following:

Technical metadata
Extent (file size in bytes)
Scan resolution
File bit depth
Format
Colour
Compression
Image manipulation (if relevant), i.e. any information about manipulation of the image including de-speckling, de-skewing and enhancement
Manipulation package (if relevant)

Ideally this technical metadata should be linked to each digital image.

Note: As part of digitisation projects your organisation should retain documentation about all these technical metadata elements as part of planning, reporting and procedures. These may be called on to verify standard procedures used for digitisation if digital images are ever questioned in court.

If the technical metadata cannot be linked to each image, your organisation should be able to determine what technical specifications were used by referring to this documentation. For example, you should be able to know what technical specifications were used for particular digital images created on particular dates.

Note: The most important technical element for the ongoing use and accessibility of digital images is file format. Without knowledge of this metadata an image may not be able to be read in the future.

Process metadata

Process metadata captures information about specific processes that are performed on records (also known as event metadata).

The key process in relation to digitisation is registration. Metadata can be used to document the date a digital image was registered in a system and who registered it.

Other process metadata, such as disposal or migration metadata, will usually be applied and maintained at the file level.

Other metadata

Your organisation will also need to consider if any other metadata is required to meet your business needs and the aims of your digitisation project.

With back-capture digitisation projects where access is the primary concern, it may be relevant to put greater emphasis on indexing and item level metadata collection. Your organisation may also consider using resource discovery metadata for images that are to be published on the web. See http://www.agls.gov.au

If you are transferring original paper records to State Records as State archives, you will need to capture the metadata required for transfer.

Back to top

Capturing metadata

Policies and procedures for metadata capture and management

You will need to develop internal policies and procedures for metadata capture and management. These may form part of general digitisation procedures and should address:

  • capturing metadata (including the elements to capture, conventions for recording names, places and dates, using controlled vocabularies when manually entering metadata, encoding schemes etc, who captures what elements and when, what tools are used etc.)[8]
  • accommodating images with incomplete metadata
  • checking the relevance and accuracy of metadata
  • checking grammar, spelling and punctuation, especially for manually entered metadata
  • ensuring consistency in the creation of and interpretation of metadata
  • ensuring the completeness of metadata
  • the metadata required when registering digital images into recordkeeping systems
  • the documentation required to be kept regarding metadata capture.

Full quality checking of metadata must be completed before any original paper records are destroyed and the results of checks must be documented.See Benchmarks and quality assurance for more information.

If outsourcing digitisation, you will need to communicate the documented requirements for metadata to service providers.

Training for staff

Training for staff involved in the creation and maintenance of metadata is critical to its successful collection. Procedures for metadata creation and maintenance should be easy to follow and appropriate support should be provided. Any tools such as templates and data entry forms which facilitate the entry of metadata in a user friendly manner may prove to be beneficial for staff.

See Staffing digitisation projects for information on skill sets for digitisation.

Encoding schemes

For fields that cannot be automated, you may also benefit from developing encoding schemes where relevant for your project. Encoding schemes or ‘pick lists’ enable you to provide users with choices from which to select values when populating metadata fields.

For example:
In an organisation ‘electoral district’ was a changeable field that users needed to specify. For this field, users were provided with a drop-down list of all the electoral districts in the State. They simply needed to choose the relevant one.

The advantages of encoding schemes are that you can determine consistent ways of displaying information and are not reliant on idiosyncratic data entry by staff. Fields can remain consistent with no spelling errors.

Back to top

Maintaining metadata over time

Metadata needs to remain persistently linked with digital images and aggregations of digital records, including through migrations.

It is also important to remember that metadata itself is a record. Metadata needs to be retained in accordance with the State Records Act 1998 and the relevant disposal classes within approved retention and disposal authorities.

Primary control records (including metadata) for most records need to be kept for a minimum of 20 years after the records to which they relate are destroyed or finally disposed of. Some are kept longer. See General retention and disposal authority: administrative records for more information.

Back to top

Checklist

Metadata requirements Yes No
Has the organisation identified and documented technical and other metadata requirements for the project?    
Is metadata captured automatically, e.g. inherited from existing systems, where possible?        
Are procedures for metadata capture and quality control documented, e.g. as part of digitisation procedures?        
Are relevant staff trained in metadata capture and management?    
Is metadata managed as a record and retained for as long as it is required?           

Footnotes

[1] Howard Besser, Introduction to Imaging: Metadata, Revised edition, Getty Research Institute, n.d.

[2] National Archives of Australia, Digitising accumulated physical records, 2011, p.22.

[3] Loc.cit

[4] Archives New Zealand, Digitisation standard, 2007, p.38.

[5] Loc.cit

[6] Loc.cit

[7] Loc.cit

[8] Ibid., p.39

[9] Ibid., p.18

[10] Public Records Office of Victoria, Guide to digitisation requirements, 2010, p.13.

Published 2014 / Revised February 2015

Back to top
Recordkeeping Advice