02 May, 2016

Concise guide for technical mind on image metadata (EXIF, IPTC, XMP, etc)

Almost everyone realize the importance of media metadata when you have a non trivial amount of files to manage. However, various standards come into play whenever one try to dig deeper and to understand how all these thing fit together.

In an attempt to migrate my iPhoto library to Lightroom, I found this topic is interesting yet messy. All image metadata, file format, schema are well maintain and published as some kind of standard, but there seems no easier way to make sense these topics easily. Consider the target is just to ensure I can migrate and manage the metadata in a rational and sensible way, it does not make sense to go through all the information with days of effort.

Hence, I try to summarize this information in a concise manner (hope this can be digested within 1 hour for people with technical background). If you are technical person but just new to image metadata, and you want to have a sound foundational knowledge, you should benefit from this article (at least this is my goal).

This article will first go through various kind of Metadata, following technical details on how this metadata embedded in the image file. Finally, the document will illustrate how to use exiftool tools to better understand the data.


Types of Metadata

EXIF is probably the earliest types of data. It primarily focus is technical information (here technical in the sense of photography). Sample information include capturing time, capture location (GPS), exposure setting (aperture,ISO,shutter), camera/lens maker and model. The standard defined the tag schema (what is the meaning for each tag) as well as on binary level how EXIF meta data can be represented.

IPTC is a metadata schema originated by news/media agency.  The focus on IPTC is support annotation and organization (for example Keywords).  There are two generation of IPTC, IPTC IIM (Information Interchange Model) and IPTC4XMP. Usually when people refers IPTC tags, it refers to IPTC IIM. (We will come back to IPTC4XMP later on when we discuss XMP). IPTC IIM also define the both the logical schema and binary representation.

XMP define a logical schema for representing media metadata. Very general in nature. The serialization format is XML/RDS. It defines how the standard can be extended to cover different vocabulary (via namespace). The standard consists of 3 parts, the first part different some serialization details (for example how to represent scalar values and tuples/lists. Part 3 define some details on how to represent the XMP data in binary files. This also define how XMP can exists as a sidecar file.

IPTC4XMP bring the vocabulary of IPTC under the XMP standard. It consists of IPTC-Core and IPTC-Extension. In additions to IPTC/XMP, XMP also include dulin core (XMP-DC namespace), Dublin Core are general XML schema for metadata. It originated for librarian purpose.

There are some overlap between IPTC and Dublin Core (given both try to solve the problem for information organization). A few field are defined as inter-changeable between two schema. The most important one probably is: IPTC:Keywords vs DC:Subject).


Image File Format for Embedding Metadata

JPEG file consists of 'segment', each segment starts with 2 bytes to denote the type of segment. For example, 0xFF0xD8 refers to Start of Image, 0xFF0xE1 refers to APP1 segment, 0xFF0xEC refers to APP13 segment. The 'App' segment is generic mechanism for application extension. Usually App segment also follow with additional application tag. The 0xFF0xE1 App1 segment is followed with EXIF to denote the segment is an EXIF block. The block content will then be encoded according to EXIF.

Similarly, 0xFF0xEC refers to App3 which is used by IPTC IIM, while 0xFF0xE1 with a  segment header "http://ns.adobe.com/xap/1.0/\x00" denote a 'XMP' segment.

A typical JPEG pictures will both embedded with EXIF segment, IPTC-IIM segment and XMP segment. While the XMP segment will include IPTC core elements, Dublin Core elements as well as application specific extension like XMP:LR namespace will contains Lightroom specific extension under XMP format. XMP is serialized as XMP hence you can see an XML text embedded in the JPEG file if you open it with a hex or text editor.

Different metadata schema have some overlap, application usually will synchronize the meta when saving the information. In Lightroom, when an external tools update the Subject tag (part of XMP:DR), LR can be instructed to read this, when LR save the metadata, it will populate the value into both IPTC-IIM:Keyword and XMP-DC:Subject, as well as XMP-LR:HierarchicalSubject.


TIFF have similar mechanism, and most all RAW format is derived from TIFF (include CR2, ARW, and DNG). While RAW files have extension mechanism for embedding metadata, however, Lightroom allow metadata to be written as a sidecar file (all metadata, include post processing settings, are serialized in XMP format, as a separate file seat aside the original file).


Playing with Metadata

EXIFTool is a friendly CLI tools to manipulate image metadata. You can use it inspect file in details or update the metadata (hence you can write simple script and batch update a particular tag)

To read the tags, simple dump
exiftool myPhoto.jpg

Each tag is prefixed with the group (ie: whether it is EXIF, IPTC, XMP, etc)
exiftool -G myPhoto.jpg

To print all meta data in XMP
exiftool -X myPhoto.jpg

To update a particular tag
exiftool -Subject='OuterSpace' Photos/IMG_4882.jpg

To make Lightroom recognize meta data is updated, need to update IPTC Digest tag
exiftool -Subject='OuterSpace' -IPTCDigest=new Photos/IMG_4882.jpg

Further References

In additions to reference provided within the text above, some other sites provide very indepth information:

11 May, 2014

Linux goes mainstream on Desktop

Linux goes mainstream on Desktop!

Well, this is a statement I read almost every year on Internet. But I never really think it is going to be the case (even I'm paid to work on it). Yet my thoughts changed recently. Not because the state of Linux is getting significantly better than its rivals. Rather it is caused by the dynamics happens on the industry.

Think about what had happened over last few years. The web is advanced so much. More and more device comes to the market with great success. People is now reading books, surfing, listening music, chat, watching movie on tablet / iPad / or even with just a smartphone. In their home, new TV have a OS there which you can do a lot of thing. Anticipating the advance on web based platform, IoT and wearable computing, I imagine in near future, general users don't need a PC anymore.

When general users don't need a PC, who needed it? What kind of users still need a PC? Users who need powerful computing like developers, video editing, scientists will be the key user of PC. In this situation, which platform is most suited the needs? I think it is OSX and Linux, obvious Linux will be choice for developers, IT specialists, scientists.

No one know what the future will looks like, this is just my wild guess. Let's put this article aside, come back and revisit after 5 years!

20 March, 2011

Tips for improving Eclipse Performance

Eclipse is a great development platform. However it may seems getting slower and slower over time. While I believe real solution to performance problem only comes after correctly identify the actual root cause, doing an objective measurement, however, will take some time. Before actually spent time on perform serious profiling, I gathered some "Tips" through googling. The tips are re-organized and presented below.

18 December, 2010

The best CLI tools all the time

Every December, each geeky website will prepare "Best Windows/Mac Software of 20xx". Why there is not a "Best CLI Tools All the Time" article?
Why not "Best CLI Tools of 20xx"? Sorry, there would not be too useful to have such a list every year. As good CLI goods all survive throughout years. Ever since I know the use of ssh, it keeps to top on my list. What software allow you to forward whatever port you like and execute commands on different hosts?
While "Best CLI Tools All the TIme" comes to be head, I googled an interesting site: http://www.commandlinefu.com/

11 December, 2010

Does Dynamic Programming Languages more Productive for Web Development?

While Dynamic Programming Languages like Python and Groovy are hot topics in recent years. Surprisingly, there is not a clear definition for "Dynamic Programming Language". So, what "Dynamic Programming Language" refers to and does it really helps to spped your next web application?

StAX - Streaming API for XML

StAX, is a Java processing XML API which allow software to processing XML stream in a push streaming style. In certain scenario, StAX will be the most efficient approach for prcoessing XML stream. This article aims to provide an overview on this API as well as comparison with two other commonly used APIs: DOM and SAX.

07 December, 2010

YAML - YAML Ain't Markup Language

YAML, which stands for "YAML Ain't Markup Languge", is a data serialization format, which can be used to represent data or message.
This article provide a background on YAML as well as a comparison between YAML and XML using the Maven POM as a example.

05 December, 2010

Using SyntaxHighlighter at Blogspot

Blogspot doesn't support syntax highlight by default. What a shame, considering that all decent Wiki engine are providing certain level of support for this already. While Blogspot isn't a wiki, but without syntax highlighting, I just keep have a feel that, Blogspot is not a place for technical post. Anyway, SyntaxHighlighter is a solution for this.
The official installation instruction require you to upload the script and link the scripts from your blogspot template. Castillo has post a workaround on how to use this lovely tools on Blogspot without having a hosting space. The post seems out-dated. It isn't too hard to figure out the way, below are the up-to-date procedure.
1. Update the Blogspot template and add following in the <head>
  <script language='javascript'   
    src='http://bitbucket.org/alexg/syntaxhighlighter/raw/b7578b438a69/scripts/XRegExp.js'/>
  <script language='javascript' 
    src='http://bitbucket.org/alexg/syntaxhighlighter/raw/b7578b438a69/scripts/shCore.js'/>
  <script language='javascript' 
    src='http://bitbucket.org/alexg/syntaxhighlighter/raw/b7578b438a69/scripts/shAutoloader.js'/>

  <link rel='stylesheet' type='text/css'
    href='http://bitbucket.org/alexg/syntaxhighlighter/raw/b7578b438a69/styles/shCoreDefault.css' />

2. Apply the following snipplets near the end of <body>
<script language='javascript'>
SyntaxHighlighter.config.bloggerMode = true;
SyntaxHighlighter.autoloader(
  'xml  http://bitbucket.org/alexg/syntaxhighlighter/raw/b7578b438a69/scripts/shBrushXml.js',
  'css  http://bitbucket.org/alexg/syntaxhighlighter/raw/b7578b438a69/scripts/shBrushCss.js'
);
SyntaxHighlighter.all();
</script>
3. In the post, use <pre> with the class attribute to define the corresponding syntax
<pre class="brush: xml>
  <echo>Hi SyntaxHighlighter!</echo>
</pre>
SyntaxHighlighter support different kind of syntax, all you need to do is just to define it in the "autoloader" call. The full list of supported syntax can be found here. SyntaxHighlighter also provided a few different theme which can be configure by importing different javascript, the full list can be found in here.

25 September, 2010

SLAX - Beautiful Live CD and Live USB

SLAX is simple Live CD / Live USB distribution. It allows easily to web interface for creating tailored made build.

In additions to the modules available on the official website, it is extremely easy to create modules. To illustrate how easy it is, the following is a steps by steps:

 1. Download a vanilla SLAX ISO
 2. Boot-up the image with a spare machine or Virtual Machine
 3. After image boots up, login as root
 4. Create a directory
    $ mkdir /tmp/rootcopy
 5. Put some files under this directory
 6. Create the module file
    $ cd /tmp
    $ dir2lsm rootcopy mymodule.lzm
 7. Put this module file under /slax/modules

Fixing the Cygwin ACL Problem

Cygwin is always one of the best tools for me. It provide an emulated *nix like environment include a shell environment, POSIX path, etc. Of all the emulation provided, however, the POSIX-like permission is the only thing I am not appreciated.

The purpose of permission and ACL is for controlling access of files under a multi-user environment. Whilst it will be useful for running network services, it is pointless for most development use cases. Files are mostly accessed by only one user, that is me!

When Cygwin is emulating POSIX permission using Windows ACL, it can easily caused undesirable trouble when one is trying to access the files from both Cygwin environment and ordinary Windows based tools. To get rid of this, Cygwin provided means for disabling this. Unfortunately, the mechanism is different between 1.5.x and 1.7x. And this can even more frustrating when you are trying to upgrade from 1.5.x to 1.7.x.

All the details below can be found on Cygwin manual. But it may take sometimes to dig it out even with the help of google. So, here we go!

For version prior to 1.7, the method is setting the following environment variable:

    CYGWIN=nontsec

Since 1.7, the method is setting 'noacl' as part of the mount point options in fstab. You should have a line like this:

    none /cygdrive cygdrive noacl,binary,posix=0,user 0 0

Note that, no extra entries should present for all mount point which can be mount up automatically.

For upgrading from 1.5 to 1.7, you should also clean up the mount point defined in the registry. From Cygwin 1.7, registry is only used for defining the installation directory of Cygwin itself.
When in doubt, use cygcheck to have a look, which can shows the mount point configuration inherited from registry.