Semantic tags

State	Idea
Date	Do 30 Aug 2012 21:06:54 CEST
Proposed by	Christian Thaeter <ct@pipapo.org>

Abstract

We have a lot documentation which needs to be cross referenced. Adding well the known tags concept and extend it slightly with some semantics will aid future automatic processing.

Description

Every document (including sourcecode) could extended with some metadata, aka tags which are then used to build automatic crossreferences.

Commonly tags are just words which are picked up and crossreferenceds. I propose to extend this scheme slightly.

Overall this scheme must be very natual and easy to use. A user should not need to know about the underlying machinery and a tag as in a single lowercase word should be sufficient in almost all cases. Moreover Tags should be optional.

Ontology

To give tags some sematics we introduce a simple ontology:

Tags can have namespaces, delimited by a dot foo.bar.baz. Tags are looked up from right to left baz would suffice as long it is unique. Non unique cases will be handled in context (sometimes non uniqunes is desired)
We introduce simple "Is a" and "Has a" relationships. These are defined by the casing of the tag: ALL_UPPERCASE means "Is a" and anything else (including mixed case) means "Has a". Note that for most cases the "Is a" relation will be defined implicitly, ín normal cases one doesnt need to care.
define some tag algebra for lookups (group tags by comma and semicolons, where comma means and and semicolon means or). Used to query the tag database. regex/globbing might become handy too.

Implicit Tags

Tags can be implicit by generating them from the document:

Derrive tags from the type and location of the Document. RFC’s are RFC, source files are SOURCE.C and so on.
Derrive Tags from the content of the document. Asciidoc titles will be used here. A simple preprocessor generates a tag from a title (make it CamelCase, simplify etc.) The resulting tag is only used iff it is unique

Use this tags

Tags are collected/discovered by some script which creates a tag-database (possibly plaintext asciidoc files) as big project index linking back to the content, details need to be worked out.

We create special asciidoc macros for crossreferencing tags for example: RFC:foobar SOURCE:builder, details need to be worked out later.

Note: this Proposal is about including tags in the first place, processing them is only suggested and left out for later.

Tasks

We need to define how to integrate tags in different documents syntactically. For RFC’s these will likely become a part of the initial table. in other Asciidoc documents they could be a special comment or header. For Source files special comments will be used.

Tags themself will be added lazily on demand (unless we find someone with the patience to go over all documents and tag them properly).

Creating the infrastructure handling this tags (cross indexing etc) is not part of this proposal, nevertheless we planning this since some time and it will be defined in other RFC’s.

Discussion

Pros

Gives a simple graspable way to build a cross reference over the whole project

Cons

adding tags and developing the tools manging them will take some time

Alternatives

We have the ht/dig search function over the Website which give a much simpler way to find documents.

Rationale

It is very urgent and important that we make our content much easier accessible.

Comments

You may recall this proposal created some heated debate at the last developer meeting. After thinking it over some time, I can see now more clearly what irritated me.

for me, the proposal seems somewhat to lack focus. Right now we have some shortcomings at rather basic operations when authoring content at the website. This proposal tends to be more interested in some kind of automated content discovery.
the term “tag” in this proposal is overlayed with different meanings. For one it means an attached textual property of some document, but also it denotes to some kind of inferred categorisation. I’d rather propose to stick to the former meaning (which is common place) and treat the latter as one source for data within an categorisation algorithm. This way, such categorisation sources can remain an implementation detail and don’t need to be fixed in an universal way.
I have serious concerns against the ontology part of the proposal. Not only is the syntax unintuitive, but more importantly, this ontology is not well aligned with real world usage.

To underpin the last diagnosis, just look at the existing tags in our Wiki:
- automation (3)
- Builder (20)
- classes (6)
- Concepts (9)
- decision (19)
- def (90)
- design (43)
- discuss (19)
- draft (55)
- example (3)
- excludeMissing (6)
- GuiIntegration (6)
- img (40)
- impl (36)
- Model (22)
- operational (19)
- overview (20)
- Player (12)
- plugin (2)
- Rendering (24)
- rewrite (3)
- Rules (8)
- SessionLogic (30)
- spec (76)
- systemConfig (9)
- Types (4)

The absolute majority of these are neither is-a nor has-a. The great thing with tags, why everyone seems to love them, is exactly that they are not formalized. You can just throw in some tags and keywords and use them for a plethora of unrelated and unstructured purposes and generally just assume that your reader will somehow “get it”.

Ichthyostega: Mi 10 Okt 2012 05:36:35 CEST _{<prg@ichthyostega.de>}

Back to Lumiera Design Process overview

git://git.lumiera.org/LUMIERA →Gitweb	TRAC · timeline · roadmap
master · gui · proc · back · dok · web	recent · stalled · core-work · non-code
Builddrone · log	API Documentation (Doxygen)	Impressum