2010-10-13

Standard delimiters

After decades of programming experience, I only now learn that the ASCII spec defined delimiters. As an extension of ASCII, Unicode adopted these too.

  • Character 31 is for separating fields of data.
    – ASCII "unit separator" (US for short)
    – Unicode "INFORMATION SEPARATOR ONE"
  • Character 30 is for separating records of data.
    – ASCII "record separator" (RS)
    – Unicode "INFORMATION SEPARATOR TWO" 
If you need higher levels of hierarchy or grouping:
  • Character 29 is for separating groups of records.
    – ASCII "group separator" (GS)
    – Unicode "INFORMATION SEPARATOR THREE"
  • Character 28 is for separating groups of groups, that is, a document or file.
    – ASCII "file separator" (FS)
    – Unicode "INFORMATION SEPARATOR FOUR"
This blog and this Wikipedia page explain. Correctly using these delimiters is properly called ASCII Delimited Text.

These delimiters were defined way back in the 1960s! So why on earth has common practice been the perversion of Tab, Linefeed, Carriage Return, Comma, etc. for delimiting data? Perhaps because text editors are built as stripped-down word-processors for visualizing text on screen, yet are often used for viewing and editing data streams. Tab, Linefeed, Carriage Return, Comma, etc. characters where meant to control presentation of text, but delimiters are meant to identify chunks of data in a stream. Since Tab, Linefeed, Carriage Return, Comma, etc. can very well be present inside a valid text stream, it is silly to use these as delimiters.

There are two distinct purposes being served by these code points:
  • Text white-space
    Presentation of text characters on screen or on paper for display to humans, controlling placement of text in column position and lines, spreadsheet cells, and so on.
  • Information separators
    Identify meaningful chunks of information inside a data stream, and show a hierarchical relationship of those chunks.

Mixing up the two (text white-space vs. information separators) is like serving the plastic green frilly stuff found between butchers' trays inside your hamburger.

As for white-space, carriage return, linefeed, tab, form feed, and such were originally intended to control presentation of data in a printer. Unfortunately, their purpose has been bent into indicating end-of-line and end-of-paragraph. Unicode had the bright idea of defining two new characters just for those purposes, to eliminate the ambiguities. Unfortunately, I've never seen them used.

• U+2028  (decimal 8,232) = SEPARATOR, LINE
• U+2029  (decimal 8,233) = SEPARATOR, PARAGRAPH

1 comment:

  1. Interesting. It's a shame they are not widely supported, I've always found CSV annoying with it's use of comma as the separator.

    I guess the problem was these control characters are not easy to type. Still, it would be nice if file formats (for spreadsheets and the like) used these characters as standard.

    ReplyDelete