SystematicMetadata

State	Idea
Date	Mo 08 Okt 2012 04:39:16 CEST
Proposed by	Ichthyostega <prg@ichthyostega.de>

Abstract

give a short summary of this proposal

Lumiera is a metadata processing application: Data is media data, and everything else is metadata. Since our basic decision is to rely on existing libraries for handling data, the “metadata part” is what we are building anew.

This RfC describes a fundamental approach towards metadata handling.

Description

Metadata is conceived as a huge uniform tree. This tree is conceptual — it is never represented as a whole. In the implemented system, we only ever see parts of this virtual tree being cast into concrete data representations. These parts are like islands of explicitly defined and typed structure, yet they never need to span the whole virtual model, and thus there never needs to be an universal model data structure definition. Data structure becomes implementation detail.

Parts of the system talk to each other by describing some subtree of metadata. This description is transferred in the form of a tree diff: the receiver pulls a sequence of diff verbs from a diff iterator, and interpreting these verbs will walk him down and expand the tree in question. Sub-scopes are “opened” and populated, similar to populating a filesystem. It is up to the receiver to assemble these information into a suitable representation. Some receiver might invoke an object factory, while another serialises data into an external textual or binary representation.

Abstract Metadata Model

The conceptual model for metadata is close to what the JSON format uses:
There are primitive values as null, string, number and boolean. Compund values can be arrays or records, the latter being a sub-scope populated with key-value pairs.

We might consider some extensions

having data values similar to BSON of MongoDB: integrals, floats, timestamps
introducing two special magic keys for records: "type" and "id"

Sources and Overlays

Metadata is delivered from sources, which can be layered. Similarly, on the receiving side, there can be multiple writeable layers, with a routing strategy to decide which writeable layer receives a given metadata element. This routing is implemented within a pipeline connecting sender and receiver; if the default routing strategy isn’t sufficient, we can control the routing by introducing a a meta-tree in some separate branch, this way making the metadata self-referential.

Some points to note

this concept doesn’t say anything about the actual meaning of the metadata elements, since that is always determined by the receiver, based on the current context.
likewise, this concept doesn’t state anything about the actual interactions, the involved parts and how the interaction is initiated and configured; this is considered an external topic, which needs to be solved within the applicable context (e.g. the session has a specific protocol how to retrieve a persisted session snapshot)
there is no separate system configuration — configuration appears just as a local record of key-value pairs, which is interpreted according to the context.
in a similar vein, this concept deliberately doesn’t state anything regarding the handling of defaults, since these are so highly dependent on the actual context.

Tasks

define the interaction API WIP
scrutinise this concept to find the pitfalls WIP
build a demonstration prototype, where the receiver fabricates an object TBD

Discussion

Pros

the basic implementation is strikingly simple, much simpler than building a huge data structure or any kind of serialisation/deserialisation scheme
parts can be combined in an open fashion, we don’t need a final concept up-front
even complex routing and overlaying strategies become manageable, since they can be treated in isolation, local for a given scope and apart from the storage representation
library implementations for textual representations can be integrated.

Cons

the theoretical view is challenging and rather uncommon
a naive implementation holds the whole data tree in memory twice
how the coherent “islands” are combined is only a matter of invocation order and thus dangerously flexible

Alternatives

The classical alternative is to define a common core data structure, which needs to be finalised quickly. Isolated functional modules will then be written to work on that common data set, which leads to a high degree of coupling. Since this approach effectively doesn’t scale well, what happens in practice is that several independent storage and exchange systems start to exist in parallel, e.g. system configuration, persisted object model, plug-in parameters, presentation state.

Rationale

Basically common (meta) data could take on a lot of shapes between two extremes:

the precise typed structure, which also is a contract
the open dynamic structure, which leaves the contract implicit

The concept detailed in this RfC tries to reconcile those extremes by avoiding a global concrete representation;
this way the actual interaction — with the necessity of defining a contract — is turned into a local problem.

Comments

Back to Lumiera Design Process overview

git://git.lumiera.org/LUMIERA →Gitweb	TRAC · timeline · roadmap
master · gui · proc · back · dok · web	recent · stalled · core-work · non-code
Builddrone · log	API Documentation (Doxygen)	Impressum