[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Objection against the "widetext" media type,


I just became aware of the proposal for a main MIME media type
called "widetext" [draft-hoffman-widetext-01.txt, December 12, 1998].
I don't think that this is a good idea. The reason why it is wrong
is two fold:

(1) character set issues should not determine media types, specifically
    not main media types.

(2) the widetext media type would not be necessary if the MIME
    specification would be revised to remove overspecification.

ad 1: while I can not argue using some law cast in stone, I try to
appeal to common sense. Mime main media types should be very rough
categories of data. Text, being largely expressions of written human
language, is one of them. A UTF-16 encoded Unicode text is 100%
semantically equivalent to the same text in UTF-8 encoding and it
seems to be very strange that two so closely related kinds of data
will be scattered over two major MIME types.

ad 2: the widetext media type would not be needed because the UTF-16
encoded Unicode could be sent using

  Content-type: text/plain; charset=ISO-10646-UCS-2
  Content-transfer-encoding: base64

the draft in section 2 justifies the need for widetext by a critique-
less adoption of RFC2046 (MIME Media Types), section 4.1.1

   The canonical form of any MIME "text" subtype MUST always represent a
   line break as a CRLF sequence. Similarly, any occurrence of CRLF in MIME
   "text" MUST represent a line break. Use of CR and LF outside of line
   break sequences is also forbidden.

While there is no doubt that MIME [RFC2045] defines CRLF as being the
octett sequence (0x0D, 0x0A), the above quoted text seems to be an
unwise over specification and should be revised rather than to
establish such an encoding-dependent media type.

The intention of RFC2046 obviously is to allow all text/* media types
to be displayed verbatim in the absence of special rendering software.
Section 4.1 notes:

   Beyond plain text, there are many formats for representing what might
   be known as "rich text".  An interesting characteristic of many such
   representations is that they are to some extent readable even without
   the software that interprets them.  It is useful, then, to
   distinguish them, at the highest level, from such unreadable data as
   images, audio, or text represented in an unreadable form. In the
   absence of appropriate interpretation software, it is reasonable to
   show subtypes of "text" to the user, while it is not reasonable to do
   so with most nontextual data. Such formatted textual data should be
   represented using subtypes of "text".

However, the mechanism to use character sets other than US-ASCII for
plain text, that does include US-ASCII incompatible character sets
(e.g., ebcdic-cp-us, or EUC-JP) requires the character set parameter
to be obeyed for display in order for the above quoted note to make
sense. Thus, the strict adherence to the verbatim octett sequence
(0x0D, 0x0A) as required by section 4.1.1 seems to be a useless

The intention of RFC2046, i.e. to allow text/* data to be displayed
verbatim, provided that the selected character set is supported would
be met as well by stating that line termination be indicated by a the
sequence of carriage return and linefeed *characters* (not US-ASCII
octetts). The wording of RFC2046 would not need to be changed if the
RFC2045 section 2.1 would be changed to say:


   The term CRLF, in this set of documents, refers to the sequence of
   *octets* corresponding to the two US-ASCII characters CR (decimal value
   13) and LF (decimal value 10) which, taken together, in this order,
   denote a line break in RFC 822 mail.


   The term CRLF, in this set of documents, refers to the sequence of
   *characters* CR and LF. For the default character set US-ASCII
   that applies to all RFC822 headers (including MIME headers), CRLF
   is the octett sequence (decimal value) 13, 10 which, taken
   together, in this order, denote a line break in RFC 822 mail.

What sense does it make to rule a certain octett sequence into
character sets for which these octetts mean entirely different things?
If the standard encoding of line breaks is of such an utmost
importance, why is it that the encoding of "the latin letter A" is not
also ruled with such rigor? What sense does it make to allow for
different character sets and then implying that their bit
representation is compatible with US-ASCII?

Rather than introducing a new high level MIME type, a subsequent
revision of the MIME standard should step back from creating such


Gunther Schadow ----------------------------------- http://aurora.rg.iupui.edu
Regenstrief Institute for Health Care
1001 W 10th Street RG5, Indianapolis IN 46202, Phone: (317) 630 7960
schadow@aurora.rg.iupui.edu ---------------------- #include <usual/disclaimer>

Privacy Policy | Terms of Service | Cookies Policy