The new emerging NLE for GNU/Linux
State Parked
Date 2008-09-21
Proposed by nasa

Delectus Shot Evaluator

This is a brain dump about the shot evaluator subproject.


Brainstorm on Delectus

Some (many) of the ideas presented herein come from the various parties involved in the Lumiera discussion list and IRC channel #lumiera. — the main discussion thread

Additionally, a lot of great concepts for how to streamline the interface are derived in part from KPhotoAlbum.

I use tags, keywords, and metadata almost interchangeably, with the exception that metadata includes computer generated metadata as well. These are not tags in the conventional sense — they don’t have to be text. In fact the planned support (please add more!) is:

  • Text — both simple strings (tags) and blocks

  • Audio — on the fly (recorded from the application) or pregenerated

  • Video — same as audio

  • Link — back to a Celtx or other document resource, forward to a final cut, URL, etc

  • Still image — inspiration image, on set details, etc

  • ID — such as the serial number of a camera used, the ISBN of a book to be cited, etc

As such, the tags themselves can have metadata. You can see where this is going…

Also, the tags are applied to "clips" — which I use interchangeably between source material imported into the application and slice of that material that tags are applied to. Any section of a video or audio source can have tags applied to it.

Two key functions: assign metadata and filter by metadata.

clips are one thing; but in reality most clips are much longer than their interesting parts. Especially for raw footage, the interesting sections of a clip can be very slim compared to the total footage. Here is a typical workflow for selecting footage:

  1. Import footage.

  2. Remove all footage that is technically too flawed to be useful.

  3. Mark interesting sections of existing clips, possibly grouped into different sections.

  4. Mark all other footage as uninteresting.

  5. Repeat 3-4 as many times as desired.

Some key points:

  • Import and export should be as painless and fast as possible.

  • Technically flawed footage can be both manual and computer classified.

  • In some cases (e.g. documentaries, dialog) audio and video clips/footage can follow different section processes. It is possible to use video from footage with useless audio or use audio from footage with useless video.

  • "Interesting" is designed to be broad and is explained below.

  • steps 2-5 can be performed in parallel by numerous people and can span many different individual clips.

In simple editors like Kino or iMovie, the fundamental unit used to edit video is the clip. This is great for a large number of uses, such as home videos or quick Youtube postings, but it quickly limits the expressive power of more experienced engineers in large scale productions (which are defined for the purposes of this document to include more than 2 post-production crew members). The clip in those editors is trimmed down to include only the desired footage, and these segments are coalesced together into some sort of coherent mess.

The key to adequate expressive power is as follows:

  • Well designed, fast metadata entry. Any data that can be included should by default, and ideally the metadata entry process should run no less than about 75% as fast as simple raw footage viewing. Powerful group commands that act on sections of clips and also grouping commands that recognize the differences between takes and angles (or individual mics) enhance and speed up the process.

  • Good tools to classify the metadata into categories that are actually useful. Much of the metadata associated with a clip is not actively used in any part of the footage generation.

  • Merging and splicing capabilities. The application should be smart enough to fill in audio if the existing source is missing. For example, in a recent project I was working on a camera op accidently set the shotgun mike to test mode, ruining about 10% of the audio for the gig. I was running sound, and luckily I had a backup copy of the main audio being recorded. This application should, when told that these two are of the same event at the same time, seamlessly overlay the backup audio over the section of the old audio that has been marked bad and not even play the bad audio. This is just background noise, and streamlining the immense task of sorting through footage needs to be simplified as much as possible.

  • Connection to on site documentation and pre-production documentation. When making decisions about what material to use and how to classify it, it is essential to use any tools and resources available. The two most useful are onsite documentation (what worked/didn’t work, how the weather was, pictures of the setup, etc all at the shoot) and pre-production (what the ideal scene would be, what is intended, etc). Anything else that would be useful should be supported as well.

  • Be easily accessible when making the final cut. Lumiera is, if the application gets up to speed, going to serve primarily to render effects, finalize the cut, and fine tune what material best fits together. Any metadata, and certainly any clipping decisions, should be very visible in Lumiera.

  • Notes, notes, notes! The application should support full multimedia notes. These differ from (4) in that they are generated during the CLASSIFICATION process, not before. This fits in with (5) as well — Lumiera should display these notes prominently on clip previews. The main way for multiple parties to communicate and even for a single person to stay organized is to add in notes about tough decisions made and rationale, questionable sections, etc. These notes can be video, audio, text, etc from one of the clips, from the machine used to edit (such as using a webcam or microphone), or over the network (other people’s input).

Too technically flawed

A clip is said to be too technically flawed if it has no chance of making it to the final product whatsoever. This does not, however, preclude its use throughout the post-production process; for example, part of a clip in which the director describes his vision of the talent’s facial expression in a particular scene is never going to make it into the final product, but is invaluable in classifying the scene. In this case, the most reasonable place to put the clip would be as a multimedia note referenced by all takes/angles of the scene it refers to.

As mentioned above, flawed video doesn’t necessarily mean flawed audio or vice-versa.


An "interesting" clip is one that has potential — either as a metadata piece (multimedia note, talent briefing, etc) or footage (for the final product OR intermediary step). The main goal of the application is to find and classify interesting clips of various types as quickly as possible.

Parallel Processing

Many people, accustomed to different interfaces and work styles, should be able to work on the same project and add interactive metadata at the same time.

Classification interface

The classification interface is divided into two categories: technical and effective. Technical classification is simply facts about a clip or part of a clip: what weather there is, who is on set, how many frames are present, the average audio level, etc. Effective classification allows the artist to express their feelings of the subjective merits (or failures) of a clip.


The project is organized around a distributed content management system which allows access to all existing materials at all times. Content narrowing allows for a more digestible amount of information to process, but everything is non-destructive; every change to the clip structure and layout is recorded, preferably with a reason as to why it was necessary or desired.

Content narrowing

With all of the information of an entire production available from a single application, information overload is easy. Content narrowing is designed to fix that by having parts of individual clips, metadata, or other files be specific to one aspect of the overall design. This allows for much more successful use of the related information and a cleaner, streamlined layout. As an example, metadata involving file size has no effect whatsoever on the vast majority of most major decisions — the answer is almost always "whatever it takes." Thus, it would not appear most of the time. Content narrowing means that it is easy to add back footage — "widen the view" one step, add it back, and "narrow the view" again.

Multiple cuts

There is no need to export a final cut from this application; it merely is the first step in the post-production chain. It is the missing link between receiving raw footage from the camera and adding the well executed scenes to the timeline. What should come out of the application is a classification of

Situational, take, and instance tagging

This is VERY powerful. The first step to using the application is to mark which scenes are the same in all source clips — where same means that they contain sections which would both not run. This can include multiple takes, different microphones or camera angles, etc. The key to fast editing is that the application can edit metadata for the situation (what is actually going on IN THE SCENE), take (what is actually going on IN THIS SPECIFIC RUN), and instance (what is actually going on IN THIS CLIP). If editing a situation, the other referenced clips AUTOMATICALLY add metadata and relevant sections. This can be as precise and nested as desired, though rough cuts for level one editing (first watchthrough after technically well executed clips have been selected) and more accurate ones for higher levels is the recommended method.


This came up on the discussion list for Lumiera, and it will be supported, probably as a special tag.

nasa’s Laws of Tagging

  1. There is always more variety in data than tags. There are always more situations present in the data than can be adequately expressed with any (reasonable) number of tags. This is OK. All that is needed is the minimum set of unique tags to progress to the next cycle without losing editing intent or the ability to rapidly evaluate many situations.

  2. Many tags are used many times. "Outdoors" will be a very, very common tag; so will "redub." If conventional names are decided upon and stuck to, it is significantly easier to map the complex interactions between different content situations.

  3. Avoid compound tags. Do not have "conversation_jill_joe" as a tag; use "conversation," "jill," and "joe" instead. It is very easy to search for multiple tags and very hard to link data that doesn’t use overlapping tags.

The interface — random idea

This is not meant to be a final interface design, just something I wrote up to get ideas out there.

key commands mutt/vim-style — much faster than using a mouse, though GUI supported. Easy to map to joystick, midi control surface, etc. Space stop/start and tag enter Tab (auto pause) adds metadata special Tracks have letters within scenes — Audio[a-z], Video[a-z], Other[a-z] (these are not limits) — or names. Caps lock adds notes. This is really, really fast. It works anywhere. This means that up to 26 different overlapping metadata sections are allowed.

Prompting Prompting for metadata is a laborious, time-consuming process. There is no truly efficient way to do it. This application uses a method similar to KPhotoAlbum. When the space key is held and a letter is pressed, the tag that corresponds to that letter is assigned to the track for the duration of the press. (If the space is pressed and no other key is pressed at the same time, it stops the track.) For example, suppose that the following mapping is present: o = outside x = extra p = protagonist c = closeup

Then holding SPACE over a section and pressing one of these keys would assign the tag to the audio AND video of the section over which the space was held. If instead just the key is pressed (without space being held), that tag is assigned to the section over which it is held. This is very fast and maps well to e.g. PS3 controller or MIDI control.

If LALT is held down instead of SPACE, the audio is effected instead. If RALT is held, just the video is effected.

In order to support scenario/take/clip tagging: The default is situation. If the keybinding to x is: x = t:extra ; effect only take x = ts:extra ; effect take and scenario x = c:extra ; extra only visible in this clip! x = tc:extra ; this take and clip show the extra etc

Other keyargs (the part in front of the colon) can be added to account for other uses (e.g. l = all taken on the same location).

Tab is pressed to add metadata mappings. Tab is pressed to enter metadata edit mode; this pauses video. Then press any key to map; and type the tag to associate (with space, multiple tags can be added.). The following specials are defined: [:keyarg:]:TAG is special tagging for scenario/take/clip. !TAG removes TAG if it is present. This is useful because it allows huge sections of the clip to be defined as a certain tag, then have parts removed later. a:TAG applies TAG only to the audio. v:TAG applies TAG only to the video. p:PATH adds a link to PATH as a special tag.

(This will have a nice GUI as well, I just will always use the keyboard method so I am describing it first. Mapping configurations can be stored in a separate file, as a user config, or in the specific project.)

If ESC is pressed, all currently ranged tags are ended.

Finally, if single_quote is pressed without SPACE or {L,R}ALT down, it marks an "interesting location." Pressing SHIFT+single_quote goes to the next "interesting location" and pressing CNTRL+' goes to the previous "interesting location." This allows for very quick review of footage.


Rating - Quantitative Rating as well as Qualitative Tagging

The importance/value of the video for various factors uses, can vary through the video. It would be helpful to have the ability to create continuous ratings over the entire track. Ratings would be numerical. Automatic clip selection/suggestion could be generated by using algorithms to compute the usefulness of video based on these ratings (aswell as "boolean operations"/"binary decisions" done with tags). The ratings could be viewed just like levels are - color coded and ovelayed on track thumbnails.

  • Tree 2008-10-25

MultiView - useful for concurrent ratings input

It would be convenient to have an ability to view the different tracks (of the same scene/time sequence) at once, so the viewer can input their ratings of the video "on the fly", including a priority parameter that helps decide which video is better than what other video.See the GUI brainstorming for a viewer widget, and key combinations that allow both right and left hand input, that could be used for raising/lowing ratings for up to six tracks at once.

  • Tree 2008-10-25

I like the idea of rating clips (or rather, takes) a lot. It would be cool to include both "hard," "relative," and "fuzzy" rating. Hard is an exactly defined value (scaled 0-1) that puts the clip in an exact location in the queue. Relative means that one is higher or lower rated than another. Fuzzy is a slider which is approximate value, and there is some randomness. The best part is that these can be assigned to hardware sliders/faders. Pressure sensitive buttons + fuzzy ratings = really easy entry interface. Just hit as hard as needed! Multiple tracks at once also an astounding idea. I could image some sort of heap (think binary heap, at least for the data structure) which determines the priorities and decides which clips are played. Then the highest rated clips are played first, down to the worst.

Possible Collaboration with the people from Ardour?

I guess if the thing can do all the things we talked about here, it would be perfectly suitable for sound classification too, and maybe could fill another gap in FOSS: Audio Archival Software, like this: (which is very expensive)… maybe the Ardour people would be interested in a collaboration on this?

I like the suggestion of sound classification with a similar (or, even better, identical) evaluator. SoundMiner looks interesting, but like you say very expensive. I’m a sound guy, so I feel your pain…


Decided on Developer meeting, until someone wants to investigate this further.

Do 14 Apr 2011 02:52:30 CEST Christian Thaeter