The metadata-crawler can process metadata in four formats:
- [required] A
fileList.json
file generated by the repo-filechecker - [optional] A named entities metadata spreadsheet with a structure as created by the
arche-create-metadata-template targetDirectory NamedEntities
- [optional] A horizontal metadata spreadsheet with a structure as created by the
arche-create-metadata-template targetDirectory Collection
orarche-create-metadata-template targetDirectory TopCollection
- [optional] A vertical metadata spreadsheet.
- [optional] An RDF metadata file.
All of the metadata files should be gathered in one directory (a metadata directory). Filenames do not matter as the metadata-crawler recognizes the file format by its content.
An example metadata directory can be found here.
This is a file produced by the repo-filechecker at the end of its checks.
It is required as it is serves as a list of files and directories (the metadata-crawler does not inspect the disc content by itself).
It also provides information on file/directory RDF class and values for acdh:hasFilename, acdh:hasRawBinarySize, acdh:hasFormat and (in the future) acdh:hasCategory metadata properties.
You need to know what is the absolute path of the collection in the repo-filechecker output
so the metadata-crawler can generate correct acdh:hasIdentifier values from file paths.
If you run the repo-filechecker on the repo-ingestion@hephaistos, then it is most probably
/ARCHE/staging/{collectionName}/data
.
This is a file we send to depositors so they provide us information about named entities (persons, places, organizations, projects, etc.) referenced in collection metadata.
Each time we send these files to depositors, they should be generated with the arche-create-matadata-template
script.
This is to assure they are in line with the current ontology version (as the arche-create-matadata-template
reads the ontology from the ARCHE production instance).
For persons, organizations and places it is generally enough if we get acdh:hasTitle
and acdh:hasIdentifier
values where at least one identifier would resolve against an authority file we honor (GND, geonames, etc.).
For projects and publications we depend solely on the information provided by the depositor.
The value of the acdh:hasTitle
column has to be unique within a given entity type.
This is because this column is used to reference entities between worksheets of the spreadsheet
(e.g. to choose project's achd:hasContact
from persons listed in the Person worksheet)
and lack of uniquness would make the choice ambigous.
A file containing non-unique (within a given entity type) titles can not be processed
(a corresponding error message is being displayed).
To mark a value with a language tag, the cell should end with @{langTag}
,
e.g. description of the project, blah, blah@en
.
There is (currently) no way to provide the default lang tag only for a given horizontal
metadata file nor for a given property in it.
If the metadata folder contains more than one named entity files and there is a named entity described in more than one of them, the metadata from the file which is processed last is used. There is a corresponding warning message displayed.
- Leave other cells empty, e.g.
| hasTitle | hasIdentifier | hasContact | |----------|---------------|------------| | foo | fooId1 | John | | | fooId2 | Alice | | | | Andy | | bar | barId | Clara |
- Repeat the
hasTitle
for each rows describing a given entity| hasTitle | hasIdentifier | hasContact | |----------|---------------|------------| | foo | fooId1 | John | | foo | fooId2 | Alice | | foo | | Andy | | bar | barId | Clara |
If you need to provide hasTitle
in multiple languages, you should repeat the hasIdentifier
column value, e.g.:
| hasTitle | hasIdentifier | hasContact |
|----------|---------------|------------|
| foo@en | fooBarId1 | John |
| bar@de | fooBarId1 | Alice |
| | fooBarId2 | |
| bar | barId | Clara |
In other metadata files (horizontal, vertical and RDF ones) the named entities can be refered
using either their acdh:hasTitle
or any of their acdh:hasIdentifier
stated in the named entities file.
E.g. if there is an entry like that in the named entities file:
| hasTitle | hasIdentifier | hasContact |
|----------|---------------|------------|
| foo | http://id1 | John |
| | http://id2 | Alice |
then a horizontal and vertical metadata files can mention this named entity
as any of foo
, http://id1
and http://id2
.
Similarly for the RDF file all triples below are valid:
<someResource> <someProperty>
<foo> ,
<http://id1> ,
<http://id2> .
There are two corner cases though:
- For referencing by a title to work, the title has to be globally unique. Titles which do not fulfill this condition are reported as warnings during the named entities metadata file parsing.
- If the title contains characters not allowed in an URI (most importandly a space),
and you want to refer to it in the RDF metadata file, you must write it as a literal,
e.g.
requires
| hasTitle | hasIdentifier | |----------|-----------------| | John Doe | http://john/doe |
and not<someResource> acdh:hasAuthor "John Doe" .
<someResource> acdh:hasAuthor <John Doe> .
These are files we send to depositors so they provide us information on the top collection (and, when needed, on collections).
They are called horizontal here as multiple values of a single metadata property come in adjacent columns (horizontally). This is just a naming convention though with no further implications.
Each time we send these files to depositors, they should be generated with the arche-create-matadata-template
script.
This is to assure they are in line with the current ontology version (as the arche-create-matadata-template
reads the ontology from the ARCHE production instance).
The metadata from a horizontal file is merged with other metadata based on the values of the acdh:hasIdentifier
property provided in the file. The file name of the horizontal metadata file does not matter at all.
To mark a value with a language tag, the cell should end with @{langTag}
,
e.g. description of the project, blah, blah@en
.
There is (currently) no way to provide the default lang tag only for a given horizontal
metadata file nor for a given property in it.
Remarks:
- Generated templates have no examples because we lack this information in the ontology.
- If there is a need to provide metadata on multiple collections in the horizontal format,
just name files differently. As the matching is done based on
acdh:hasIdentifier
and not based on the file name, file names do not really matter.
This kind of metadata files is useful for providing information for a limited number of metadata properties for large number of files (and directories. We are sometimes provided with such files by depositors, although a file from a depositor most probably requires a litle tuning.
The format is pretty flexible. Just a few conditions must be fulfilled:
- The file must contain a header line (it does not need to be a first line
- just like in our horizontal files, but the data is read only starting from the header line)
- The header line must contain:
- either
directory
andfilename
columns or thepath
column - at least one column being a property name (either a full property URI
or the part after the
https://vocabs.acdh.oeaw.ac.at/schema#
prefix), e.g.https://vocabs.acdh.oeaw.ac.at/schema#hasTitle
orhasTitle
- either
Remarks:
- Supported file formats are XLSX, ODS and CSV
- To mark a value with a language tag, the cell should end with
@{langTag}
, e.g.resource title@en
. - To mark a default lang for the whole column, the column name in the header
should end with
@{langTag}
, e.g.hasTitle@de
- The file can contain any number of columns which are not mapped to metadata properties. Such column are just ignored
- There can be any number of vertical metadata files in the metadata directory. Information from multiple files is combined.
- To provide multiple values of a given property three conventions can be used:
- Leave the column(s) indicating the path (
path
ordirectory
andfilename
) empty for all rows describing the same file/directory:| path | hasTitle | hasDescription | |-------|-----------|----------------| | foo | title1@en | description1 | | | title2@de | description2 | | | title3@fr | | | bar | title@en | desciption |
- Repeat the column(s) indicating the path (
path
ordirectory
andfilename
) empty for all rows describing the same file/directory:| path | hasTitle | hasDescription | |-------|-----------|----------------| | foo | title1@en | description1 | | foo | title2@de | description2 | | foo | title3@fr | | | bar | title@en | desciption |
- Put multiple value in multiple columns
| path | hasTitle@en | hasTitle@de | hasTitle@fr | hasDescription | hasDescription | |-------|-------------|-------------|-------------|----------------|----------------| | foo | title1 | title2 | tilte3 | description1 | description2 | | bar | title | | | desciption | |
- Mixing convention also works, e.g.
| path | hasTitle@en | hasTitle | hasDescription | |-------|-------------|-----------|----------------| | foo | title1 | title2@de | description1 | | foo | | title2@fr | description2 | | bar | title | | desciption |
- Leave the column(s) indicating the path (
Metadata can be also provided as an RDF. Supported formats include Turtle, TriG (Turtle with graphs), n-triples, n-quads and RDF-XML.
RDF metadata file name does not matter. If there are multiple files in the metadata directory, information from all of them is combined (just as a union).
RDF metadata file allows applying metadata both to single files/directories and for groups of them. This is driven by a combination of a triple/quad subject and graph.
All examples below assume a following collection stucture:
. - a top collection with
acdh:hasIdentifier of acdhi:myCollection
acdh:TopCollection class
file1 - a file in the collection root with
acdh:hasIdentifier of acdhi:myCollection/file1
acdh:Resource class
subdir - a directory in the collection root with
acdh:hasIdentifier of acdhi:myCollection/subdir
acdh:Collection class
subdir/file2 - a file in the subdirectory with
acdh:hasIdentifier of acdhi:myCollection/subdir/file2
acdh:Resource class
- If the subject matches an
acdh:hasIdentifier
of a single resource, the metadata is applied only to a resource. E.g.adds the information on the metadata creator only to the resouce with id of@prefix acdh: <https://vocabs.acdh.oeaw.ac.at/schema#> . @prefix acdhi: <https://id.acdh.oeaw.ac.at/> . acdhi:myCollection acdh:hasMetadataCreator acdhi:sstuhec .
acdhi:myCollection
so the resulting metadata will be just:@prefix acdh: <https://vocabs.acdh.oeaw.ac.at/schema#> . @prefix acdhi: <https://id.acdh.oeaw.ac.at/> . acdhi:myCollection acdh:hasMetadataCreator acdhi:sstuhec .
- If the subject is
owl:Thing
(http://www.w3.org/2002/07/owl#Thing
), the metadata is applied to all resources whichacdh:hasIdentifier
starts with the quad's graph (and if the graph is not specified, then just to all resources). E.g.will result with@prefix acdh: <https://vocabs.acdh.oeaw.ac.at/schema#> . @prefix acdhi: <https://id.acdh.oeaw.ac.at/> . @prefix owl: <http://www.w3.org/2002/07/owl#> . owl:Thing acdh:hasMetadataCreator acdhi:sstuhec .
acdh:hasMetadataCreator acdhi:sstuhec
to be added to all resources:and (the TriG syntax is used here to denote the graph):@prefix acdh: <https://vocabs.acdh.oeaw.ac.at/schema#> . @prefix acdhi: <https://id.acdh.oeaw.ac.at/> . acdhi:myCollection acdh:hasMetadataCreator acdhi:sstuhec . acdhi:myCollection/file1 acdh:hasMetadataCreator acdhi:sstuhec . acdhi:myCollection/subdir acdh:hasMetadataCreator acdhi:sstuhec . acdhi:myCollection/subdir/file2 acdh:hasMetadataCreator acdhi:sstuhec .
will attach@prefix acdh: <https://vocabs.acdh.oeaw.ac.at/schema#> . @prefix acdhi: <https://id.acdh.oeaw.ac.at/> . @prefix owl: <http://www.w3.org/2002/07/owl#> . { owl:Thing acdh:hasMetadataCreator acdhi:sstuhec . } acdhi:subdir { owl:Thing acdh:hasMetadataCreator acdhi:mzoltak . }
acdh:hasMetadataCreator acdhi:sstuhec
to all resources but the ones inside theacdhi:subdir
directory whereacdh:hasMetadataCreator acdhi:mzoltak
will be used instead:@prefix acdh: <https://vocabs.acdh.oeaw.ac.at/schema#> . @prefix acdhi: <https://id.acdh.oeaw.ac.at/> . acdhi:myCollection acdh:hasMetadataCreator acdhi:sstuhec . acdhi:myCollection/file1 acdh:hasMetadataCreator acdhi:sstuhec . acdhi:myCollection/subdir acdh:hasMetadataCreator acdhi:mzoltak . acdhi:myCollection/subdir/file2 acdh:hasMetadataCreator acdhi:mzoltak .
- If the subject is an RDF class, the metadata is applied to all resources
of a given class which
acdh:hasIdentifier
starts with the quad's graph (and if the graph is not specified, then just to all resources of a given class). E.g.will attach@prefix acdh: <https://vocabs.acdh.oeaw.ac.at/schema#> . @prefix acdhi: <https://id.acdh.oeaw.ac.at/> . @prefix owl: <http://www.w3.org/2002/07/owl#> . { acdh:Resource acdh:hasMetadataCreator acdhi:sstuhec . } acdhi:subdir { acdh:Collection acdh:hasMetadataCreator acdhi:mzoltak . }
acdh:hasMetadataCreator acdhi:sstuhec
to all resources of classacdh:Resource
andacdh:hasMetadataCreator acdhi:mzoltak
to all resources of classacdh:Collection
within theacdhi:subdir
directory:@prefix acdh: <https://vocabs.acdh.oeaw.ac.at/schema#> . @prefix acdhi: <https://id.acdh.oeaw.ac.at/> . acdhi:myCollection/file1 acdh:hasMetadataCreator acdhi:sstuhec . acdhi:myCollection/subdir acdh:hasMetadataCreator acdhi:mzoltak .
If there are multiple sources of information on a given property for a given resource,
the most precise source is used.
E.g. for the acdhi:subdir/file2
all the possible combinations of assigning it
an information on acdh:hasMetadataCreator
have following priorities:
@prefix acdh: <https://vocabs.acdh.oeaw.ac.at/schema#> .
@prefix acdhi: <https://id.acdh.oeaw.ac.at/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
{
owl:Thing acdh:hasMetadataCreator <priority1> .
acdh:Resource acdh:hasMetadataCreator <priority2> .
acdhi:subdir/file2 acdh:hasMetadataCreator <priority3> .
}
acdhi:subdir {
owl:Thing acdh:hasMetadataCreator <priority4> .
acdh:Resource acdh:hasMetadataCreator <priority5> .
acdhi:subdir/file2 acdh:hasMetadataCreator <priority6> .
}
acdhi:subdir/file2 {
owl:Thing acdh:hasMetadataCreator <priority7> .
acdh:Resource acdh:hasMetadataCreator <priority8> .
acdhi:subdir/file2 acdh:hasMetadataCreator <priority9> .
}
Here the resulting metadata for the whole collection would be:
@prefix acdh: <https://vocabs.acdh.oeaw.ac.at/schema#> .
@prefix acdhi: <https://id.acdh.oeaw.ac.at/> .
acdhi:myCollection acdh:hasMetadataCreator <priority1> .
acdhi:myCollection/file1 acdh:hasMetadataCreator <priority2> .
acdhi:myCollection/subdir acdh:hasMetadataCreator <priority4> .
acdhi:myCollection/subdir/file2 acdh:hasMetadataCreator <priority9> .