33 Years of the Digest ... founded August 21, 1981
Copyright © 2014 E. William Horne. All Rights Reserved.

The Telecom Digest for Sep 20, 2014
Volume 33 : Issue 163 : "text" Format

Messages in this Issue:
Re: Is it time for a new charset in the Digest? (Gordon Burditt)
Re: Is it time for a new charset in the Digest? (Garrett Wollman)
Re: Is it time for a new charset in the Digest? (Michael Moroney)
Q.: at&t to sell SNET holdings to Frontier? (tlvp)

I like the dreams of the future better than the history of the past.  - Thomas Jefferson


See the bottom of this issue for subscription and archive details.

Date: Fri, 19 Sep 2014 00:52:37 -0500 From: gordonb.moz6t@burditt.org (Gordon Burditt) To: telecomdigestsubmissions.remove-this@and-this-too.telecom-digest.org. Subject: Re: Is it time for a new charset in the Digest? Message-ID: <xcGdnSFVKt04WYbJnZ2dnUVZ_tmdnZ2d@posted.internetamerica> > I've been using the ISO-8859-1 "Latin1" character set in the Digest > for a few years now: we adopted it as the standard after a reader made > me awaare that there are no accented characters in ASCII, so I figured > that I'd implement a way for him to spell his name properly, and also > be able to add "Internationalization" to my résumé. > > I'm wondering if it's time for another change, either to one of the > "transitional" Unicode formats, such as UTF-8, or perhaps to a > permanent solution such as UCS-16. There is no UCS-16. There are UCS-2 or UTF-16. The "TF" in "UTF" stands for "Transformation Format", not "Transitional Format". Another thing that uses the term "transitional" is HTML, which is not related to character sets. I recommend that you go to UTF-8, or stick with ISO-8859-1 (or Windows-1252, which is a superset of ISO-8859-1). I don't think the other choices are reasonable. Trying to go with ISO-8859-*, where 15 different charsets with lots of overlap are distinguished by charset tags is going to cause problems when someone using ISO-8859-X quotes someone using ISO-8859-Y, where X != Y, and characters outside the common subset are used. I like UTF-8. I hope it becomes permanent for things like the web and email. It has the advantage that no byte sequence for any character is a subset of the byte sequence for any other character, so a pattern-search designed for ASCII still works. Actually, a lot of things "just work" with UTF-8 for programs expecting ASCII. That won't happen for UTF-16. I hope UTF-16 and UCS-2 die out. They encourage a halfway solution in which characters with codes that won't fit in 16 bits aren't supported. They also have the byte-order abomination. They do NOT solve the issue of variable-width characters. Even UCS-4 or UTF-32 does not do that, due to the existence of "combining characters". The byte order mark of UTF-16 is a problem for mail and news articles. Where do you put it? If it's before the headers, then most every mail and news server currently running will interpret it as part of the headers, mangling one of them, or worse, interpret it as a division between (no) headers and the body of the message, and ending up with a lot of rejected mail due to "missing" headers like From:, Subject: or Newsgroups: . If you put it at the start of the body, well, I can imagine the mess you end up with replies to articles with quoting, even if everyone is using UTF-16. No BOM. Multiple conflicting BOMs. BOMs in the middle of text where they aren't looked at. How often have you needed to translate something to be posted from whatever character set it was in to ISO-8859-1, and ended up with untranslatable characters? If the answer is "never", there's probably no pressing need to change. If your only concern is people's names, there may be no need to change, unless you get a lot of contributers with Japanese, Chinese, Korean, or Vietnamese names who still write in English. But if you are going to change, please choose UTF-8, not UTF-16. One problem that often arises from using multiple charsets in a newsgroup or mailing list is that quoted text with charset A included in a post with charset B often results in a mess on the screens of readers. Using UTF-8 won't solve this, but it will reduce it. It's even worse when characters in charset A used in the quoted post have no equivalent in charset B (possible with, for example, ISO-8859-1 vs. ISO-8859-5). At least if charset B includes all the characters, translation is possible. Unless you try putting your foot down and claiming that all submissions must be in UTF-8, you'll probably still have to translate parts of some submissions. You should check out browser and mail reader support for various charsets. I believe the only required charsets for browsers are: ASCII, ISO-8859-1 ("Latin1"), Windows-1252 (a superset of ISO-8859-1), and UTF-8. I may be wrong about "required"; it may just mean "essential for the success of the program". In any case, a browser that does not support UTF-8 is going to miss out on a lot of the web. In a survey of character sets used on the web in August, 2014 (http://w3techs.com/technologies/overview/character_encoding/all), these are some of the results (a web site may use more than one character set, so results may add to more than 100%, but not by much):
#1 UTF-8 81.4%
#2 ISO-8859-1 9.9%
#3 Windows-1251 2.3%
#4 GB2312 1.4%
#5 Shift JIS 1.3%
#6 Windows-1252 1.2%
#7 GBK 0.4%
...    
#18 US-ASCII 0.1%
...    
...UTF-16less than 0.1%
Web sites with unidentified character sets aren't counted. I presume that means that HTML with no charset tag is treated as "unidentified", not UTF-8, even if there's a rule that says untagged HTML should be treated as UTF-8. However, even if a browser supports an encoding of Unicode (or several), it probably won't have all the fonts needed to render every character installed. That situation may also exist if a browser is trying to support all of the ISO-8859-* character sets. Characters covered by ISO-8859-* (accented letters, etc.) will probably be well-supported. Characters used by dead languages (e.g. Egyptian hieroglyphics and Linear B) will likely not be. You'll also have problems with unofficial additions to Unicode in the Private Use Areas (e.g. the Klingon language) due to lack of an official registrar. Somehow I doubt that you will have any submissions about Ancient Egyptian area codes or long-distance rates to the Klingon homeworld. > I'd like to hear opinions from you, particularly if you have expertise > in choosing character sets for online publicatoins such as The Telecom > Digest. TIA. Well, I'm no expert, but from the survey you can see what webmasters have chosen for web pages. Given that lots of mail readers are web-based, this is probably significant. Gordon L. Burditt
Date: Sat, 20 Sep 2014 04:47:01 +0000 (UTC) From: wollman@bimajority.org (Garrett Wollman) To: telecomdigestsubmissions.remove-this@and-this-too.telecom-digest.org. Subject: Re: Is it time for a new charset in the Digest? Message-ID: <lvj0s5$iq4$1@grapevine.csail.mit.edu> In article <xcGdnSFVKt04WYbJnZ2dnUVZ_tmdnZ2d@posted.internetamerica>, Gordon Burditt <gordonb.moz6t@burditt.org> wrote: >I like UTF-8. I hope it becomes permanent for things like the web >and email. It has the advantage that no byte sequence for any >character is a subset of the byte sequence for any other character, >so a pattern-search designed for ASCII still works. Actually, a >lot of things "just work" with UTF-8 for programs expecting ASCII. Because it was designed specifically to have that property, of course. (UTF-8 is intellectually descended from FSS-UTF -- "file system safe" -- which was invented by the Plan 9 people at Bell Labs, at a time when the Unicode Consortium was dead set on 16-bit characters. The actual encoding is slightly different.) Other than that, I agree with pretty much everything that Gordon says. (And I say that as someone whose universe is pretty much all still ISO 8859-1.) Becase UTF-8 degrades gracefully to ASCII (erm, ISO 646), for most purposes, in English-language documents, there is no penalty to using it. -GAWollman --
Garrett A. Wollman
wollman@bimajority.org
Opinions not shared by
my employers.
What intellectual phenomenon can be older, or more oft
repeated, than the story of a large research program
that impaled itself upon a false central assumption
accepted by all practitioners? - S.J. Gould, 1993
Date: Fri, 19 Sep 2014 23:46:25 +0000 (UTC) From: moroney@world.std.spaamtrap.com (Michael Moroney) To: telecomdigestsubmissions.remove-this@and-this-too.telecom-digest.org. Subject: Re: Is it time for a new charset in the Digest? Message-ID: <lvif8h$a0f$1@pcls7.std.com> If you do decide to stick with ISO-8859-1, consider ISO-8859-15 instead. It is nearly the same as ISO-8859-1 except for a few minor differences that allow more languages. But the big difference is that ISO-8859-15 has the Euro character.
Date: Fri, 19 Sep 2014 21:13:45 -0400 From: tlvp <mPiOsUcB.EtLlLvEp@att.net> To: telecomdigestsubmissions.remove-this@and-this-too.telecom-digest.org. Subject: Q.: at&t to sell SNET holdings to Frontier? Message-ID: <lbi09usffthl.1jgji1ayb900d.dlg@40tude.net> >From a presorted First-Class U.S. Mail post card received today: > Dear Valued Customer, > > Pending regulatory approval, Frontier Communications Corporation will > assume ownership of the Southern New England Telephone Company (SNET) > and SNET America, Inc. (SAI) as soon as late October 2014. There's more, but it's just customer assuagement talk. And local loop at&t Customer Service reps seem not to have had much of a briefing yet on how to respond to customer inquiries about this, other than to reassure folks that Frontier's corporate headquarters is (grammar: are?) in Stamford, CT. Thus far, only the marketing arm of Comcast seems to have had wind of this impending change, using it as an anti-carrot with which to lure internet clients away from at&t/Yahoo! HSI DSL services to Comcast cable. What can the repercussions be of this change on local loop service, DSL (or other high-speed ISP) service, the continued existence of the email domains sbcglobal.net, att.net, snet.net, etc., pricing, and billing? Cheers, & thanks in advance, -- tlvp (from deep in the heart of SNET-land) -- Avant de repondre, jeter la poubelle, SVP.

TELECOM Digest is an electronic journal devoted mostly to telecom- munications topics. It is circulated anywhere there is email, in addition to Usenet, where it appears as the moderated newsgroup 'comp.dcom.telecom'.

TELECOM Digest is a not-for-profit educational service offered to the Internet by Bill Horne.

The Telecom Digest is moderated by Bill Horne.
Contact information: Bill Horne
Telecom Digest
43 Deerfield Road
Sharon MA 02067-2301
339-364-8487
bill at horne dot net
Subscribe: telecom-request@telecom-digest.org?body=subscribe telecom
Unsubscribe: telecom-request@telecom-digest.org?body=unsubscribe telecom

This Digest is the oldest continuing e-journal about telecomm- unications on the Internet, having been founded in August, 1981 and published continuously since then. Our archives are available for your review/research. We believe we are the oldest e-zine/mailing list on the internet in any category! URL information: http://telecom-digest.org Copyright © 2014 E. William Horne. All rights reserved.


Finally, the Digest is funded by gifts from generous readers such as yourself. Thank you!

All opinions expressed herein are deemed to be those of the author. Any organizations listed are for identification purposes only and messages should not be considered any official expression by the organization.


End of The Telecom Digest (4 messages)

Return to Archives ** Older Issues