07 December, 2010

YAML - YAML Ain't Markup Language

YAML, which stands for "YAML Ain't Markup Languge", is a data serialization format, which can be used to represent data or message.
This article provide a background on YAML as well as a comparison between YAML and XML using the Maven POM as a example.

What is YAML?

While there are many of data serialization format, the strengths of YAML is the focus on human readability and still provide rich constructs for representing different data structure.
In YAML, each data stream can consists of multiple YAML data, each data is separated by the '---' marker. And '...' can be used to signal an end of stream. Hash prompt '#' is used to start an comments which ended by the end-of-line. Data in YAML can be represented in three different ways: Scalar, Associative Array and Sequence.
For associative array (aka Map), it can be represented in this way:
  name: Xyz Conference
  maxParticipants: 20
  startTime: 2010-10-14 08:30:00.00
  breakfastProvided: yes
Alternatively, associative array can also model with JSON style (as a short form):
  [ name: Xyz Conference, maxParticipants: 20, startTime: 2010-10-14 08:30:00.00,  breakfastProvided: yes]
For sequence, it can be represented in this way
  - Java
  - Python
  - Ruby
Alternatively, sequence can also model in JSON style (as a short form):
  [ Java, Python, Ruby ]
The short form syntax in YAML also imply that, JSON document is compliant with YAML.
To get a taste of YAML, below is a personal organizer modeled using YAML:
---
# Document which representing my personal organizer

# Tasks
tasks:
 - 
   summary:     Prepare project proposal
   tags:        [School, Urgent]
   update time: 2010-10-14 18:59:23.17
   completed:   true
 - 
   summary:     Arrange a gathering
   tags:        [Friend, Someday]
   update time: 2010-10-14 18:59:23.17
   completed:   false

# Memos
memos:
 - 
   content:     >
                Precedence of UrlMappings for status code is defined by lexical order.
                While regex based URL mapping is defined by the precedence rules.
   update time: 2010-10-14 18:59:23.17
   archived:    false
 - 
   content:     >
                Pattern-Oriented Software Architecture Volumn 2
                http://www.cs.wustl.edu/~schmidt/POSA/POSA2/
   update time: 2010-10-10 18:59:23.17
   archived:    true

...
From the example above, it is extremely easy to comprehend the content. It consists of two tasks and two memos. The structure of each task and memo is also readily understandable. In this example, several strengths of YAML is demostrated.
Richer information model used in YAML.
Compare with XML which each documents consists of element node. YAML document consists of four types of node, which are sequence, scalar, mapping and alias. While you can certainly model sequence or key to value pair in XML, it is more clumsy when compare to the one modeled in YAML.
In YAML, indentation is used to to present the hierarchy structure of data. Child elements will have more indentation then it's parent node. However, only whitespace character (ascii x20) is used for indentation, tab character (ascii x09) is not allowed.
Readily defined type
In YAML, some commonly used data types are already defined, which include String, Time, Boolean, Integer Number and Floating Point Number.
While in XML, you can use DTD or Schema to defined the data types for each of element, for this typing is offered from YAML, any YAML document will be benefits with this offering.
A full reference of the data supported in YAML can be found here.
Flexibility on encoding String
In YAML, string can be encode without any quote. In additions, you will find that, the use of tag in YAML is minimized. It make it perfert to encoding messages block in a YAML documents. For example, we can easily put a HTML or XML block, or some code fragments in the YAML document without resorting escape the characters in the text.
In XML, if you want to encode a HTML fragment, or a Math formula, or a code fragment, you will need escape some reserved characters with escape sequence.
Space efficiency
As YAML do not rely on balanced mark up (like XML) to represent the data structure, most of the document payload is representing the data itself, instead. Compare with XML, each tag must have a balanced closing tag, the size used for representation a piece of data in YAML is usually lesser than XML. Below, a direct XML and YAML comparison will demonstrate this points
Well, YAML is full of strengths, but it don't means it will be a perfect choice for representing all kind of data or messages. As always, no one tools can serve all purpose and YAML is no exception. YAML does have some weakness, below are some key issues:
Lack of schema definition
YAML doesn't provide any facility to define the schema and the semantics of the Document.
If different parties are going to interchange a YAML document, the exact specification of the document must be specified with somewhere else.
Compare with XML, in which the schema can be defined clearly with XML Schema or Relax NG, YAML just doesn't provide this kind of feature.
Lack of supporting technology
YAML doesn't provide any "query language" (like XQuery, XPath in XML) to support querying YAML.
Lack of namespace support
YAML does not provide any namespace mechanism. Without a namespace concept, it also imply there is not possible to unambiously weaving different YAML documents.
Comparing with XML directly
While XML are the most commonly used data format, there are some problem. One frequently heard complaint on XML is the space efficiency.
The following is a sample Maven POM file which describe a project / module build configuration
<project>
  <modelVersion>4.0.0</modelVersion>
  <name>Maven Default Project</name>

  <repositories>
    <repository>
      <id>central</id>
      <name>Maven Repository Switchboard</name>
      <layout>default</layout>
      <url>http://repo1.maven.org/maven2</url>
      <snapshots>
        <enabled>false</enabled>
      </snapshots>
    </repository>
  </repositories>

  <pluginRepositories>
    <pluginRepository>
      <id>central</id>
      <name>Maven Plugin Repository</name>
      <url>http://repo1.maven.org/maven2</url>
      <layout>default</layout>
      <snapshots>
        <enabled>false</enabled>
      </snapshots>
      <releases>
        <updatePolicy>never</updatePolicy>
      </releases>
    </pluginRepository>
  </pluginRepositories>

  <build>
    <directory>target</directory>
    <outputDirectory>target/classes</outputDirectory>
    <finalName>${artifactId}-${version}</finalName>
    <testOutputDirectory>target/test-classes</testOutputDirectory>
    <sourceDirectory>src/main/java</sourceDirectory>
    <scriptSourceDirectory>src/main/scripts</scriptSourceDirectory>
    <testSourceDirectory>src/test/java</testSourceDirectory>
    <resources>
      <resource>
        <directory>src/main/resources</directory>
      </resource>
    </resources>
    <testResources>
      <testResource>
        <directory>src/test/resources</directory>
      </testResource>
    </testResources>
  </build>

  <reporting>
    <outputDirectory>target/site</outputDirectory>
  </reporting>

  <profiles>
    <profile>
      <id>release-profile</id>

      <activation>
        <property>
          <name>performRelease</name>
        </property>
      </activation>

      <build>
        <plugins>
          <plugin>
            <inherited>true</inherited>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-source-plugin</artifactId>

            <executions>
              <execution>
                <id>attach-sources</id>
                <goals>
                  <goal>jar</goal>
                </goals>
              </execution>
            </executions>
          </plugin>
          <plugin>
            <inherited>true</inherited>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-javadoc-plugin</artifactId>

            <executions>
              <execution>
                <id>attach-javadocs</id>
                <goals>
                  <goal>jar</goal>
                </goals>
              </execution>
            </executions>
          </plugin>
          <plugin>
            <inherited>true</inherited>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-deploy-plugin</artifactId>

            <configuration>
              <updateReleaseInfo>true</updateReleaseInfo>
            </configuration>
          </plugin>
        </plugins>
      </build>
    </profile>
  </profiles>

</project>
If it is modeled with YAML, it will looks something like this:
project:
  modelVersion: 4.0.0
  name: Maven Default Project

  repositories: 
    - id: central
      name: Maven Repository Switchboard
      layout: default
      url: http://repo1.maven.org/maven2
      snapshots: 
        enabled: false

  pluginRepositories: 
    - id: central
      name: Maven Plugin Repository
      url: http://repo1.maven.org/maven2
      layout: default
      snapshots: 
        enabled: false
      releases: 
        updatePolicy: never
      
  build: 
    directory: target
    outputDirectory: target/classes
    finalName: ${artifactId}-${version}
    testOutputDirectory: target/test-classes
    sourceDirectory: src/main/java
    scriptSourceDirectory: src/main/scripts
    testSourceDirectory: src/test/java
    resources: 
      - directory: src/main/resources
    testResources: 
      - directory: src/test/resources

  reporting: 
    outputDirectory: target/site

  profiles: 
    - id: release-profile

      activation: 
        property: 
          name: performRelease

      build: 
        plugins: 
          - inherited: true
            groupId: org.apache.maven.plugins
            artifactId: maven-source-plugin
            executions: 
              - id: attach-sources
                goals: 
                  goal: jar

          - inherited: true
            groupId: org.apache.maven.plugins
            artifactId: maven-javadoc-plugin
            executions: 
              - id: attach-javadocs
                goals: 
                  goal: jar
                
          - inherited: true
            groupId: org.apache.maven.plugins
            artifactId: maven-deploy-plugin
            configuration: 
              updateReleaseInfo: true
One notable difference amongst these two document is, the YAML one is more readable. When looks into details, the total size for the XML version is 2941 bytes while the size for the YAML is only 1714 bytes, which is effectively 42% reduction in size. In additions, the number of lines in the XML version is 110 lines, while the YAML version is only 68 lines, which is 38% reduction.
The difference is mainly due to the fact that, YAML doesn't requires a balance end tag. In additions, YAML natively providing constructs to support sequence and maps as top level elements. For example, two model lists of repository in XML, we need two tags (repositories and repository) to module a list, and two extra lines for putting the end tags. In YAML the end tag is never needed.

Conclusion

Considering both the pros and cons, YAML has great advantage on the human friendiness and clarity, while the lack of schema / namespace support will prevent it from being employed on large scale or complex environment.

No comments: