TREE Discovery and Context Information

Draft Community Group Report,

More details about this document
This version:
https://w3id.org/tree/specification/discovery
Feedback:
public-treecg@w3.org with subject line “[TREEDiscovery] … message topic …” (archives)
Issue Tracking:
GitHub
Inline In Spec
Editor:
Pieter Colpaert

Abstract

This specification defines how a client selects a specific dataset and search tree, as well as extracts relevant context information.

Status of this document

1. Definitions

A tree:Collection is a subclass of dcat:Dataset ([vocab-dcat-3]). The specialization being that this particular dataset is a collection of _members_.

A tree:SearchTree is a subClassOf dcat:Distribution. The specialization being that it uses the main TREE specification to publish a search tree.

A node from which all other nodes can be found is a tree:RootNode.

Note: The tree:SearchTree and the tree:RootNode MAY be identified by the same IRI when no disambiguation is needed.

A TREE client MUST be provided with a URL to start from, which we call the _entrypoint_.

2. Initializing a client with a url

The goal of the client is to understand what tree:Collection it is using, and to find a tree:RootNode to start the traversal phase from. This discovery specification extends the initialization step in the TREE specification, for the cases in which multiple options are possible.

The client MUST dereference the URL, which will result in a set of quads. The client now MUST first perform the init step from the main specification. If that did not return any result, then the client MUST check whether the URL before redirects (E) has been used in one of the following discovery patterns described in the subsections:

  1. E is a tree:Collection: then the client needs to select the right search tree

  2. E is a dcat:Dataset: then the client needs to select the right distribution or dataservice from a catalog

  3. E is a ldes:EventStream: then the client MAY take into account LDES specific properties

  4. E is a dcat:Distribution: then the client needs to process it accordingly

  5. E is a dcat:DataService: then the client needs to process it accordingly

  6. E is a catalog or is not explicitly mentioned: then it needs to select a dataset based on shape information and DCAT Catalog information

2.1. Selecting a collection via shapes

When multiple collections are found by a client, it can choose to prune the collections based on the tree:shape property. The tree:shape property will refer to a first sh:NodeShape. The collection MAY be pruned in case there is no overlap with the properties the client needs.

Will we document the precise algorithm to use? Should we extend shapes with cardinality approximations as well?

2.2. Selecting a collection via a catalog

A DCAT Catalog is an overview of datasets, data services and distributions. As TREE clients first need to select a dataset, and then a search tree to use, it aligns with how DCAT-AP works. DCAT discovery extends upon the previous section in which a collection or dataset can be selected based on the tree:shape property.

For now, we will assume the DCAT information is available in subject pages.

Do we need more text on how to handle different types of DCAT interfaces?

The dataset descriptions can be used for filtering the datasets available in a catalog to a list of datasets that can be useful for the client. Such properties may include the spatial extent, the time extent, or how it is possibly a part of another dcat:Dataset.

How precise do we need to be in this specification?

When the dcat:Dataset is a tree:Collection, the DCAT catalog is going to contain a dct:type property with https://w3id.org/tree#Collection or https://w3id.org/ldes#EventStream as the object.

2.3. Choosing from multiple SearchTrees with TREE

This is yet to be done

2.4. Selecting a search tree via a DCAT dataset

The are two ways in which you can find a search tree from a dataset: via the distributions and via the data services. Both need to be tested. Selecting a distribution or data service when multiple are available needs to be done based on the search tree description. If nothing is available, all need to be tested by processing them as exemplifie din the next subsections.

2.4.1. Selecting a search tree via DCAT Distribution

E dcat:distribution ?D . ?D dcat:downloadURL ?N . then ?N is a rootnode of E.

This is yet to be done

2.4.2. Selecting a search tree from a DCAT data service

This is yet to be done

2.5. Linked Data Event Streams

In case the client is not made for query answering, but only for setting up a replication and synchronization system, then there is a special type that can be used to indicate the search tree is made for this purpose: the ldes:EventSource. Clients that want to prioritize taking a _full_ copy MAY give full priority to this server hint.

E a ldes:EventSource ;
  tree:rootNode|dcat:downloadURL </node1> .

3. Extracting content information

This is yet to be done

Context information enables a client to understand who the creator of a certain dataset is, when it was last changed, what other datasets it was derived from, etc.

3.1. DCAT and dcterms

This is yet to be done

3.2. Provenance

This is yet to be done

3.3. Linked Data Event Streams

This is yet to be done

LDES (https://w3id.org/ldes/specification) is a way to evolve search trees in a consistent way. It defines every member as immutable, and a collection as append-only. Therefore, one can make sure to only process each member once. Extra terms are added, such as the concept of an EventStream, retention policies and a timestampPath.

Conformance

Document conventions

Conformance requirements are expressed with a combination of descriptive assertions and RFC 2119 terminology. The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in the normative parts of this document are to be interpreted as described in RFC 2119. However, for readability, these words do not appear in all uppercase letters in this specification.

All of the text of this specification is normative except sections explicitly marked as non-normative, examples, and notes. [RFC2119]

Examples in this specification are introduced with the words “for example” or are set apart from the normative text with class="example", like this:

This is an example of an informative example.

Informative notes begin with the word “Note” and are set apart from the normative text with class="note", like this:

Note, this is an informative note.

References

Normative References

[RFC2119]
S. Bradner. Key words for use in RFCs to Indicate Requirement Levels. March 1997. Best Current Practice. URL: https://datatracker.ietf.org/doc/html/rfc2119
[VOCAB-DCAT-3]
Simon Cox; et al. Data Catalog Vocabulary (DCAT) - Version 3. URL: https://w3c.github.io/dxwg/dcat/

Issues Index

Will we document the precise algorithm to use? Should we extend shapes with cardinality approximations as well?
Do we need more text on how to handle different types of DCAT interfaces?
How precise do we need to be in this specification?
This is yet to be done
This is yet to be done
This is yet to be done
This is yet to be done
This is yet to be done
This is yet to be done
This is yet to be done