Skip to content

Latest commit

 

History

History
387 lines (335 loc) · 16.5 KB

metadata_formats.md

File metadata and controls

387 lines (335 loc) · 16.5 KB

metadata conventions

Metadata formats

The metadata-crawler can process metadata in four formats:

  • [required] A fileList.json file generated by the repo-filechecker
  • [optional] A named entities metadata spreadsheet with a structure as created by the arche-create-metadata-template targetDirectory NamedEntities
  • [optional] A horizontal metadata spreadsheet with a structure as created by the arche-create-metadata-template targetDirectory Collection or arche-create-metadata-template targetDirectory TopCollection
  • [optional] A vertical metadata spreadsheet.
  • [optional] An RDF metadata file.

All of the metadata files should be gathered in one directory (a metadata directory). Filenames do not matter as the metadata-crawler recognizes the file format by its content.

An example metadata directory can be found here.

fileList.json

This is a file produced by the repo-filechecker at the end of its checks.

It is required as it is serves as a list of files and directories (the metadata-crawler does not inspect the disc content by itself).

It also provides information on file/directory RDF class and values for acdh:hasFilename, acdh:hasRawBinarySize, acdh:hasFormat and (in the future) acdh:hasCategory metadata properties.

You need to know what is the absolute path of the collection in the repo-filechecker output so the metadata-crawler can generate correct acdh:hasIdentifier values from file paths. If you run the repo-filechecker on the repo-ingestion@hephaistos, then it is most probably /ARCHE/staging/{collectionName}/data.

named entities file

This is a file we send to depositors so they provide us information about named entities (persons, places, organizations, projects, etc.) referenced in collection metadata.

Each time we send these files to depositors, they should be generated with the arche-create-matadata-template script. This is to assure they are in line with the current ontology version (as the arche-create-matadata-template reads the ontology from the ARCHE production instance).

For persons, organizations and places it is generally enough if we get acdh:hasTitle and acdh:hasIdentifier values where at least one identifier would resolve against an authority file we honor (GND, geonames, etc.). For projects and publications we depend solely on the information provided by the depositor.

The value of the acdh:hasTitle column has to be unique within a given entity type. This is because this column is used to reference entities between worksheets of the spreadsheet (e.g. to choose project's achd:hasContact from persons listed in the Person worksheet) and lack of uniquness would make the choice ambigous. A file containing non-unique (within a given entity type) titles can not be processed (a corresponding error message is being displayed).

To mark a value with a language tag, the cell should end with @{langTag}, e.g. description of the project, blah, blah@en. There is (currently) no way to provide the default lang tag only for a given horizontal metadata file nor for a given property in it.

If the metadata folder contains more than one named entity files and there is a named entity described in more than one of them, the metadata from the file which is processed last is used. There is a corresponding warning message displayed.

Providing multiple property values for a single entity

  • Leave other cells empty, e.g.
    | hasTitle | hasIdentifier | hasContact |
    |----------|---------------|------------|
    | foo      | fooId1        | John       |
    |          | fooId2        | Alice      |
    |          |               | Andy       |
    | bar      | barId         | Clara      |
    
  • Repeat the hasTitle for each rows describing a given entity
    | hasTitle | hasIdentifier | hasContact |
    |----------|---------------|------------|
    | foo      | fooId1        | John       |
    | foo      | fooId2        | Alice      |
    | foo      |               | Andy       |
    | bar      | barId         | Clara      |
    

If you need to provide hasTitle in multiple languages, you should repeat the hasIdentifier column value, e.g.:

| hasTitle | hasIdentifier | hasContact |
|----------|---------------|------------|
| foo@en   | fooBarId1     | John       |
| bar@de   | fooBarId1     | Alice      |
|          | fooBarId2     |            |
| bar      | barId         | Clara      |

Referencing named entities from other files

In other metadata files (horizontal, vertical and RDF ones) the named entities can be refered using either their acdh:hasTitle or any of their acdh:hasIdentifier stated in the named entities file.

E.g. if there is an entry like that in the named entities file:

| hasTitle | hasIdentifier | hasContact |
|----------|---------------|------------|
| foo      | http://id1    | John       |
|          | http://id2    | Alice      |

then a horizontal and vertical metadata files can mention this named entity as any of foo, http://id1 and http://id2.

Similarly for the RDF file all triples below are valid:

<someResource> <someProperty>
   <foo> ,
   <http://id1> ,
   <http://id2> .

There are two corner cases though:

  • For referencing by a title to work, the title has to be globally unique. Titles which do not fulfill this condition are reported as warnings during the named entities metadata file parsing.
  • If the title contains characters not allowed in an URI (most importandly a space), and you want to refer to it in the RDF metadata file, you must write it as a literal, e.g.
    | hasTitle | hasIdentifier   |
    |----------|-----------------|
    | John Doe | http://john/doe |
    
    requires
    <someResource> acdh:hasAuthor "John Doe" .
    
    and not
    <someResource> acdh:hasAuthor <John Doe> .
    

horizontal metadata file

These are files we send to depositors so they provide us information on the top collection (and, when needed, on collections).

They are called horizontal here as multiple values of a single metadata property come in adjacent columns (horizontally). This is just a naming convention though with no further implications.

Each time we send these files to depositors, they should be generated with the arche-create-matadata-template script. This is to assure they are in line with the current ontology version (as the arche-create-matadata-template reads the ontology from the ARCHE production instance).

The metadata from a horizontal file is merged with other metadata based on the values of the acdh:hasIdentifier property provided in the file. The file name of the horizontal metadata file does not matter at all.

To mark a value with a language tag, the cell should end with @{langTag}, e.g. description of the project, blah, blah@en. There is (currently) no way to provide the default lang tag only for a given horizontal metadata file nor for a given property in it.

Remarks:

  • Generated templates have no examples because we lack this information in the ontology.
  • If there is a need to provide metadata on multiple collections in the horizontal format, just name files differently. As the matching is done based on acdh:hasIdentifier and not based on the file name, file names do not really matter.

vertical metadata file

This kind of metadata files is useful for providing information for a limited number of metadata properties for large number of files (and directories. We are sometimes provided with such files by depositors, although a file from a depositor most probably requires a litle tuning.

The format is pretty flexible. Just a few conditions must be fulfilled:

  • The file must contain a header line (it does not need to be a first line
    • just like in our horizontal files, but the data is read only starting from the header line)
  • The header line must contain:
    • either directory and filename columns or the path column
    • at least one column being a property name (either a full property URI or the part after the https://vocabs.acdh.oeaw.ac.at/schema# prefix), e.g. https://vocabs.acdh.oeaw.ac.at/schema#hasTitle or hasTitle

Remarks:

  • Supported file formats are XLSX, ODS and CSV
  • To mark a value with a language tag, the cell should end with @{langTag}, e.g. resource title@en.
  • To mark a default lang for the whole column, the column name in the header should end with @{langTag}, e.g. hasTitle@de
  • The file can contain any number of columns which are not mapped to metadata properties. Such column are just ignored
  • There can be any number of vertical metadata files in the metadata directory. Information from multiple files is combined.
  • To provide multiple values of a given property three conventions can be used:
    • Leave the column(s) indicating the path (path or directory and filename) empty for all rows describing the same file/directory:
      | path  | hasTitle  | hasDescription |
      |-------|-----------|----------------|
      | foo   | title1@en | description1   |
      |       | title2@de | description2   |
      |       | title3@fr |                |
      | bar   | title@en  | desciption     |
      
    • Repeat the column(s) indicating the path (path or directory and filename) empty for all rows describing the same file/directory:
      | path  | hasTitle  | hasDescription |
      |-------|-----------|----------------|
      | foo   | title1@en | description1   |
      | foo   | title2@de | description2   |
      | foo   | title3@fr |                |
      | bar   | title@en  | desciption     |
      
    • Put multiple value in multiple columns
    | path  | hasTitle@en | hasTitle@de | hasTitle@fr | hasDescription | hasDescription |
    |-------|-------------|-------------|-------------|----------------|----------------|
    | foo   | title1      | title2      | tilte3      | description1   | description2   |
    | bar   | title       |             |             | desciption     |                |
    
    • Mixing convention also works, e.g.
    | path  | hasTitle@en | hasTitle  | hasDescription |
    |-------|-------------|-----------|----------------|
    | foo   | title1      | title2@de | description1   |
    | foo   |             | title2@fr | description2   |
    | bar   | title       |           | desciption     |
    

RDF metadata file

Metadata can be also provided as an RDF. Supported formats include Turtle, TriG (Turtle with graphs), n-triples, n-quads and RDF-XML.

RDF metadata file name does not matter. If there are multiple files in the metadata directory, information from all of them is combined (just as a union).

RDF metadata file allows applying metadata both to single files/directories and for groups of them. This is driven by a combination of a triple/quad subject and graph.

All examples below assume a following collection stucture:

.            - a top collection with
               acdh:hasIdentifier of acdhi:myCollection
               acdh:TopCollection class
file1        - a file in the collection root with 
               acdh:hasIdentifier of acdhi:myCollection/file1
               acdh:Resource class
subdir       - a directory in the collection root with 
               acdh:hasIdentifier of acdhi:myCollection/subdir
               acdh:Collection class
subdir/file2 - a file in the subdirectory with 
               acdh:hasIdentifier of acdhi:myCollection/subdir/file2
               acdh:Resource class
  • If the subject matches an acdh:hasIdentifier of a single resource, the metadata is applied only to a resource. E.g.
    @prefix acdh:  <https://vocabs.acdh.oeaw.ac.at/schema#> .
    @prefix acdhi: <https://id.acdh.oeaw.ac.at/> .
    acdhi:myCollection acdh:hasMetadataCreator acdhi:sstuhec .
    
    adds the information on the metadata creator only to the resouce with id of acdhi:myCollection so the resulting metadata will be just:
    @prefix acdh:  <https://vocabs.acdh.oeaw.ac.at/schema#> .
    @prefix acdhi: <https://id.acdh.oeaw.ac.at/> .
    acdhi:myCollection acdh:hasMetadataCreator acdhi:sstuhec .
    
  • If the subject is owl:Thing (http://www.w3.org/2002/07/owl#Thing), the metadata is applied to all resources which acdh:hasIdentifier starts with the quad's graph (and if the graph is not specified, then just to all resources). E.g.
    @prefix acdh:  <https://vocabs.acdh.oeaw.ac.at/schema#> .
    @prefix acdhi: <https://id.acdh.oeaw.ac.at/> .
    @prefix owl:   <http://www.w3.org/2002/07/owl#> .
    owl:Thing acdh:hasMetadataCreator acdhi:sstuhec .
    
    will result with acdh:hasMetadataCreator acdhi:sstuhec to be added to all resources:
    @prefix acdh:  <https://vocabs.acdh.oeaw.ac.at/schema#> .
    @prefix acdhi: <https://id.acdh.oeaw.ac.at/> .
    acdhi:myCollection acdh:hasMetadataCreator acdhi:sstuhec .
    acdhi:myCollection/file1 acdh:hasMetadataCreator acdhi:sstuhec .
    acdhi:myCollection/subdir acdh:hasMetadataCreator acdhi:sstuhec .
    acdhi:myCollection/subdir/file2 acdh:hasMetadataCreator acdhi:sstuhec .
    
    and (the TriG syntax is used here to denote the graph):
    @prefix acdh:  <https://vocabs.acdh.oeaw.ac.at/schema#> .
    @prefix acdhi: <https://id.acdh.oeaw.ac.at/> .
    @prefix owl:   <http://www.w3.org/2002/07/owl#> .
    {
      owl:Thing acdh:hasMetadataCreator acdhi:sstuhec .
    }
    acdhi:subdir {
      owl:Thing acdh:hasMetadataCreator acdhi:mzoltak .
    }
    
    will attach acdh:hasMetadataCreator acdhi:sstuhec to all resources but the ones inside the acdhi:subdir directory where acdh:hasMetadataCreator acdhi:mzoltak will be used instead:
    @prefix acdh:  <https://vocabs.acdh.oeaw.ac.at/schema#> .
    @prefix acdhi: <https://id.acdh.oeaw.ac.at/> .
    acdhi:myCollection acdh:hasMetadataCreator acdhi:sstuhec .
    acdhi:myCollection/file1 acdh:hasMetadataCreator acdhi:sstuhec .
    acdhi:myCollection/subdir acdh:hasMetadataCreator acdhi:mzoltak .
    acdhi:myCollection/subdir/file2 acdh:hasMetadataCreator acdhi:mzoltak .
    
  • If the subject is an RDF class, the metadata is applied to all resources of a given class which acdh:hasIdentifier starts with the quad's graph (and if the graph is not specified, then just to all resources of a given class). E.g.
    @prefix acdh:  <https://vocabs.acdh.oeaw.ac.at/schema#> .
    @prefix acdhi: <https://id.acdh.oeaw.ac.at/> .
    @prefix owl:   <http://www.w3.org/2002/07/owl#> .
    {
      acdh:Resource acdh:hasMetadataCreator acdhi:sstuhec .
    }
    acdhi:subdir {
      acdh:Collection acdh:hasMetadataCreator acdhi:mzoltak .
    }
    
    will attach acdh:hasMetadataCreator acdhi:sstuhec to all resources of class acdh:Resource and acdh:hasMetadataCreator acdhi:mzoltak to all resources of class acdh:Collection within the acdhi:subdir directory:
    @prefix acdh:  <https://vocabs.acdh.oeaw.ac.at/schema#> .
    @prefix acdhi: <https://id.acdh.oeaw.ac.at/> .
    acdhi:myCollection/file1 acdh:hasMetadataCreator acdhi:sstuhec .
    acdhi:myCollection/subdir acdh:hasMetadataCreator acdhi:mzoltak .
    

If there are multiple sources of information on a given property for a given resource, the most precise source is used. E.g. for the acdhi:subdir/file2 all the possible combinations of assigning it an information on acdh:hasMetadataCreator have following priorities:

@prefix acdh:  <https://vocabs.acdh.oeaw.ac.at/schema#> .
@prefix acdhi: <https://id.acdh.oeaw.ac.at/> .
@prefix owl:   <http://www.w3.org/2002/07/owl#> .
{
    owl:Thing          acdh:hasMetadataCreator <priority1> .
    acdh:Resource      acdh:hasMetadataCreator <priority2> .
    acdhi:subdir/file2 acdh:hasMetadataCreator <priority3> .
}
acdhi:subdir {
    owl:Thing          acdh:hasMetadataCreator <priority4> .
    acdh:Resource      acdh:hasMetadataCreator <priority5> .
    acdhi:subdir/file2 acdh:hasMetadataCreator <priority6> .
}
acdhi:subdir/file2 {
    owl:Thing          acdh:hasMetadataCreator <priority7> .
    acdh:Resource      acdh:hasMetadataCreator <priority8> .
    acdhi:subdir/file2 acdh:hasMetadataCreator <priority9> .
}

Here the resulting metadata for the whole collection would be:

@prefix acdh:  <https://vocabs.acdh.oeaw.ac.at/schema#> .
@prefix acdhi: <https://id.acdh.oeaw.ac.at/> .
acdhi:myCollection acdh:hasMetadataCreator <priority1> .
acdhi:myCollection/file1 acdh:hasMetadataCreator <priority2> .
acdhi:myCollection/subdir acdh:hasMetadataCreator <priority4> .
acdhi:myCollection/subdir/file2 acdh:hasMetadataCreator <priority9> .