23 July, 2005

ID3 tag data and encoding issue

I finally spend some time to look into the encoding issue about ID3 tag. The following is what I have studied so far!

ID3v1 and ID3v2 are two totally different thing despite the common purpose. ID3v1 (and ID3v1.1) is very simple, 128 bytes appended at the end of the media file. While ID3v2 is a more formal specification.tag is an informal standard.

ID3v1 and ID3v1.1
ID3v2 website has two very good articles for the background and format
ID3v2 (version 2.2.0, 2.2.1, 2.3.0 and v2.4.0)
Much more mature specification for msuic tag data. The documents can be found here. (ID3v2.0)

What encoding is used to encode text data?

With ID3v1, there is no place for storing the encoding / codepage of text data (title, artist name, etc). In practice, locale dependent encoding is used!

With ID3v2, text data is stored inside a 'text frame'. Within each text frame, there is a frame header, followed by a "frame len byte", which followed by a "text encoding description byte". However, according the specification, if text data is not stored as Unicode, it must be stored as ISO8859-1.

For ID3v2.3.0, the "text encoding description byte" has two possible values:

0x00 : ISO8859-1 (latin1)
0x01 : Unicode (UCS2?) (MUST BEGIN with BOM and END with UTF16 0x0000)

For ID3v2.4.0, the "text encoding description byte" has four possible values:

0x00 : ISO8859-1 (latin1)
0x01 : UTF-16 (BEGIN with BOM and END with 0x0000)
0x02 : UTF-16BE (DO NOT BEGIN with BOM and END with 0x0000)
0x03 : UTF-8

Issues:

ID3v2 provide much more functionality and is far better than the old ID3v1. However, there is a problem plaguing the usability of ID3v2. To comply with the specification, text data inside the ID3v2 tag MUST be encoded as Unicode, if it cannot be encoded as ISO8859-1. Some software will just encode the text data with the legacy encoding (which is used as in ID3v1), and set the "text encoding description byte" as 0x00 (ISO8859-1). When user use another truly standard compliant software to open the music file, the tag will not be rendered correctly, as the software will treat the text data as ISO8859-1, even though the text data is actually encoded with legacy encoding. Clearly, the problem will not go away in the near future, as most users will have hundreds (if not thousand) of music files in their harddisk, is it impractical to convice everyone converting the old files just for standard conformance.

This problem is discussed a few times in various mailing list / bugzilla.

  1. Rhythmbox mailing lists
  2. id3lib mailing lists
  3. Gstreamer bugzilla

Workaround:

To overcome the problem, existing software have to provide some workaround. For example, by setting 'GST_ID3_TAG_ENCODING', gst-plugins-mad will interprete non-unicode text data as legacy encoded strings instead of ISO8859-1 encoded strings. $ export GST_ID3_TAG_ENCODING=big5hkscs
$ rhythmbox
For eyeD3, I have made a patch myself to do something similar, you can use '--legacy-encoding' to specify the legacy encoding to use. $ eyeD3 --legacy-encoding big5hkscs [other options...]

Related software

The following list is only a small subset of existing software, but this will be those of which I have/will play attention to :)