Jump to content

Talk:ISO/IEC 8859-1

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by TiffaF (talk | contribs) at 16:27, 3 July 2012 (English (UK and US) - UK needs £, US does not.). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

windows-1252

Why does windows-1252 redirect to this page? 1252 is NOT the same as iso-8859-1. Redirecting references to windows-1252 to this page (I think) reinforces the mistaken impression that 1252 and 8859-1 are the same (and they are most definitely not). Perhaps this deserves an entire page, but Microsoft's loose labeling of email with (close but not exact) MIME character sets is a BIG problem.

You're free to make the relevant section a (linked) article of its own. The section however tries to make it clear that although both, CP1252 and Latin-1, are supersets of ISO/IEC 8859-1, they differ in 8x and 9x area. The German version takes a little different approach in emphasizing the differences.
The advantage of keeping CP-1252 and MacRoman on the ISO-8859-1 page is that differences can be made more clear. And Windows (and IIS) misreports CP-1252 as ISO-8859-1 by default, so most people falsely assume CP-1252 *is* ISO-8859-1. Latin-1 of course is a valid implementation of ISO 8859-1, as it is nothing but an alias for ISO-8859-1. Jor 13:24, 12 Mar 2004 (UTC)
Latin-1 of course is a valid implementation of ISO 8859-1, as it is nothing but an alias for ISO-8859-1. Wrong. ISO/IEC 8859-1 (the standard) only specifies the characters for the 20-7E and A0-FF byte ranges. The characters for bytes 00-1F and 7F-9F are left undefined. The ISO-8859-1 character map registered with the IANA fills the missing spots with the C0 and C1 control sets (defined elsewhere), thus covering 00-FF. This map's approved aliases are: ISO_8859-1:1987, iso-ir-100, ISO_8859-1, ISO-8859-1 (preferred MIME name), latin1, l1, IBM819, CP819, and csISOLatin1. - mjb 22:07, 12 Mar 2004 (UTC)
So remove the dash. Latin1 *is* a valid alias for ISO-8859-1, which is an encoding based on ISO 8859-1. Jor 22:17, 12 Mar 2004 (UTC)
like it or not in the internet world now ISO-8859-1 is always interpreted as windows-1252. stuff from windows-1252 was being used by editors here with no reported problems all the time before the switch to utf-8. Sometimes you just have to accept that formal standards and reality aren't the same thing. Plugwash 6 July 2005 10:53 (UTC)

"However, the draft HTML 5 specification requires that documents advertised as ISO-8859-1 actually be parsed with the Windows-1252 encoding." The citation does not make any indication of whether or not this statement is true. 194.232.128.102 (talk) 14:47, 11 September 2009 (UTC)[reply]

The current HTML5 draft does mention it insection 8.2.2.2 . Of course these things are subject to change at any time. Plugwash (talk) 17:07, 30 January 2012 (UTC)[reply]

correct quotation marks

In this respect ISO-8859-1 was't worse than a typewriter, or am I mistaken? So, only special characters for typesetting are missing from ISO--8859-1 (also ligatures, like ff, fi, ...). Pjacobi 23:02, 17 Sep 2004 (UTC)

You're not making yourself very clear, please elaborate. -- Ævar Arnfjörð Bjarmason 23:42, 2004 Sep 17 (UTC)

As far as I know, mechanical or electrical typewriters, didn't provide symbols for typesetting either. They are or were lacking the different quotation marks, different length (typographic) hyphens, ligated versions of "ff" etc. Not that it would make much sense on a monospacing machine. Only the specialised input machines for typesettings did have all those.

So, when setting coded character sets for use on computers, I don't think that the lack of those signs can be viewed as not supporting any languages which would use these signs in typesetting. Typesetting was done using specialised markup, shortcuts or automatic conversion (as done by troff).

Even today, Unicode considers some typesetting related issues to be out of scope for coded character sets, a step back from the Adobe approach, which even did put some "ff" ligatures in the "Expert Character Set".

In summary, I consider the remark "missing correct quotation marks" to be slightly misleading.

Pjacobi 12:51, 18 Sep 2004 (UTC)

In German (and I believe other languages as well) it is considered an orthographic error to use something like "speech" instead of „speech“, although »speech«, of which Latin1 is capable, can be acceptable, too. Width of dashes/hyphens and interword/intercharacter spaces is a different story completely. Crissov 18:40, 18 Sep 2004 (UTC)
No it is considered bad typography. In handwriting or as typoscript, it is perfectly acceptable. -- Pjacobi 19:35, 18 Sep 2004 (UTC)
Especially in handwritten German it is not acceptable, ask any German teacher or the Duden. You could compare it to ")foo)" or "(foo(", which no-one would claim correct. It can only be acceptable in technically limited environments. In English "foo" and “foo” are propably similar enough.
The Duden is no authority on glyph shapes. Using the " as glyph shape for both punctation characters 'german begin of direct speak' and 'german end of direct speak' is no othographic error, at least not in Hamburg, Germany. Likewise the Duden doesn't regulate the glyph shapes for small "s" and "t", where there is much variation in German handwriting. Pjacobi 18:16, 19 Sep 2004 (UTC)

Mac-Roman

Is it really best to remove the comparative chart between Mac-Roman and ISO-8859-1? Is it really true that Mac-Roman has no relation to ISO 8859-1 or ISO-8859-1? Also, there appears to be hyphenation differences throughout this page (e.g. MacRoman vs. Mac-Roman, CP1252 vs. CP-1252, etc.). GPHemsley 00:50, Mar 29, 2005 (UTC)

The Macintosh Roman character sets, Mac-Roman and MacRoman, both inherit the ASCII characters, but have nothing else in common with ISO-8859-1. Mac-Roman was introduced with the first Mac in 1984, so I don't think it could possibly be a descendent of ISO Latin. MacRoman changed one character from Mac-Roman (added the Euro). I'll update the confusing text in this article. Michael Z. 2005-03-29 01:30 Z
iirc they do however cover much the same characters which should probablly be mentioned and possiblly detailed somewhere.

lead section

the previous lead section was a one liner far shorter than Wikipedia:Guide_to_writing_better_articles#Lead_section reccomends. Futhermore it didn't even introduce two important variations (iso-8859-1 and windows-1252) which redirect here. I tried to expand it and was reverted by mjb (whose removals i have noe reverted back. mjb claimed it was redundant which is true but Wikipedia:Guide_to_writing_better_articles#Lead_section clearly states "If the article is long (more than one page), the remainder of the opening paragraph should summarize it." a summary is by definition redundant with the more detailed information in the rest of the article. Plugwash 6 July 2005 10:15 (UTC)

But your "summary" was terrible. It introduced concepts and dove into technical details that are not required to achieve a basic understanding of the ISO/IEC 8859-1 standard. I'm not saying the article can't use another sentence in the intro, but if you re-read the article from the beginning, it sounds very sloppy when you immediately start talking about there being no control codes and certain code value ranges being reserved/unassigned — these topics were not even introduced yet and seem completely out of context at that point. I would also disagree with taking too strict an interpretation of the style guide; an intro paragraph does not need to summarize every topic that is mentioned in the article; if it can't introduce a topic without repeating or requiring one to read the whole article, then further simplification of the statements is advisable. — mjb 6 July 2005 11:12 (UTC)

ISO-8859-1 and windows-1252 redirect here and are not just misspellings so they need to be introduced in the summary. If we don't do so then we are misleading users into thinking there are the same thing as ISO 8859-1. If you can think of a way of doing so without mentioning technical details then go for it. Plugwash 6 July 2005 11:14 (UTC)

Ah, see, that's the real issue; there are these redirects, and there is discussion in the article about these oft-confused character maps that are based on the standard. We can offer the reader this information without getting into any details that would require them to have already read the article. I've put one in, but perhaps it could be further improved. — mjb 6 July 2005 11:29 (UTC)

On another topic, do you have an opinion about "maintained by ISO and IEC"? I think it sounds awkward to say "ISO and IEC" rather than "the ISO and the IEC," but it seems equally awkward to put two "the"s in there. Is there a policy or style guide for using definite articles with organizations known by their initials? (The main point I was trying to make in the first sentence was that for a while, the standard was just "ISO 8859-1" and this is what everyone knows it as, but at some point the IEC became involved and any formal citation, especially an encyclopedia entry that has a responsibility not to perpetuate common errors, must say "ISO/IEC 8859-1".) — mjb 6 July 2005 11:34 (UTC)

Character table format

I've been seeing more and more 8-bit character table formats popping up in various articles. There are currently three different styles in use on this page alone. ASCII has two more, and Code page 437 has yet another. I think the template-based approach in the Code page 437 article is a good idea, but I'm not sure it's flexible enough to accommodate the kind of ad-hoc linking we have going on. Also, the auto-scaled 100% table widths are not ideal for all media. Other issues to consider are where to link each character to, and how much info to try to cram into each cell. We are discussing character linking over on Talk:Unicode. Questions to consider are below. — mjb 6 July 2005 12:02 (UTC)

  • Should we standardize the 8-bit character code chart formats?
  • What info should the charts contain?
  • What should the charts look like? Are column/row headings important?
  • Where should character entries link to? (see Talk:Unicode#Nifty_resource.)
  • What's the ideal representation of things like space, soft hyphen, and control codes?
  • What about difference highlighting? Keep?
  • Can we achieve these goals with a template?
my preffered way to handle character linking is to just let the character link to a page titled with itself and then redirect it to the most appropriate place. This allows all references to a character to be updated to point to the same place at once as well as allowing users to enter those characters through the search box and be taken straight to the appropriate place. Plugwash 6 July 2005 22:54 (UTC)
Well, we could use templates, but I'm not sure the approach in Code page 437 is the best. There Template:chset-cell is used, which looks like this:
  • <span style="font-size: large; font-family: serif">&#x{{{1}}};</span><br /><small>{{{1}}}</small>
Those hexdecimal character references are infact converted to UTF-8 by the Wikimedia software.
Furthermore it uses Template:chset-tableformat, Template:chset-left, Template:chset-ctrl and all are put into a table by hand. We could do the same and better with something like Template:8-bit charset, which would look be used something like this:
  • {{8-bit charset|Name=ISO 8859-1|{{C0 control codes}}|{{ASCII character codes}}|7F|{{C1 control codes}}| A0|A1|A2|A3|A4|A5|A6|A7|A8|A9|AA|AB|AC|AD|AE|AF| B0|B1|B2|B3|B4|B5|B6|B7|B8|B9|BA|BB|BC|BD|BE|BF| C0|C1|C2|C3|C4|C5|C6|C7|C8|C9|CA|CB|CC|CD|CE|CF| D0|D1|D2|D3|D4|D5|D6|D7|D8|D9|DA|DB|DC|DD|DE|DF| E0|E1|E2|E3|E4|E5|E6|E7|E8|E9|EA|EB|EC|ED|EE|EF| F0|F1|F2|F3|F4|F5|F6|F7|F8|F9|FA|FB|FC|FD|FE|FF}}
where the templates contain just the hexcodes, e.g. Template:C0 control codes (Control character#Tables):
  • 00|01|02|03|04|05|06|07|08|09|0A|0B|0C|0D|0E|0F| 10|11|12|13|14|15|16|17|18|19|1A|1B|1C|1D|1E|1F
Oh, I just realized that we would have to take special care of control codes (and a few others), because they do not work with links and display, maybe:
  • NUL|SOH|STX|ETX|EOT|ENQ|ACK|BEL|BS|HT|LF|VT|FF|CR|SO|SI| DLE|DC1|DC2|DC3|DC4|NAK|SYN|ETB|CAN|EM|SUB|ESC|FS|GS|RS|US
I think giving alternatives with &124; does not work (well) in templates. Anyhow, the 8-bit charset template would then build a 16×16 table out of the 256+1 arguments it recieved, {{{Name}}} would be put into the caption (|+). How that table should look (hex, dec, oct and/or bin headers, U+ codes [probably by reusing Template:chset-cell]) is open to discussion, but all those codepage and charset tables would look the same. The then unnecessary chset-* templates should be deleted. Christoph Päper 7 July 2005 15:46 (UTC)

History of CP1252

I'm having a hard time finding what year Microsoft introduced code page 1252. I'm particularly interested in MS's support for the curved apostrophe and quotation marks ‘ ’ “ ”. The best I could find so far was "around 1986". — Hippietrail 01:41, 22 July 2005 (UTC)[reply]

my guess is it came in with the windows concept of ansi code pages. I don't know how far back that dates though (p.s. i notice that whatever font is used for standard wikipedia text doesn't seem to differentiate between the opening and closing quotes but the font i see in the edit box does). Plugwash 02:09, 22 July 2005 (UTC)[reply]
Minor point of interest is that the IANA did not accept Windows-1252 in its charset registry until early 2000, based on a proposal made in December 1999. The other Windows-125x code pages were accepted by the IANA in 1996 after being proposed by someone at Microsoft's Russian branch. — mjb 03:00, 23 December 2005 (UTC)[reply]
And windows-874 still isn't in the IANAs list despite being actively used by at least outlook 2000, i pointed this out to the iana-charsets list but they didn't seem to care. Plugwash 08:46, 7 October 2006 (UTC)[reply]

Merge request

Someone tagged the article with a merge request. They apparently did not realize that this article forked off of the ISO/IEC 8859-1 article a while ago. Please present a case for the merge or the request will be removed. — mjb 03:00, 23 December 2005 (UTC)[reply]

it was part of a mass split done a while back by a fairly new user that created a LOT of small ugly stubs. i've linked all the merge tags to a proposal at the main ISO 8859 talk page. Please comment there if you don't wan't me to go ahead with the mass re-merging. Plugwash 17:16, 12 January 2006 (UTC)[reply]
These should NOT be remerged. They are (were) separate for a reason: ISO/IEC 8859-n is in no case identical to ISO-8859-n (when both exist, which is not always the case). These entries should be split again, with appropriate cross-references. Keka (who happens to have been involved with character set standarisation for many years), 2006-04-23.
True in a sense but you could say the same about say jpeg and jfif. One is the formal standard left incomplete by standards body politics. The other is the equivilent real standard in use. Also in most cases the IANA defines ISO_8859-? the same as ISO-8859-? and an underscore is the standard substitute for a space where space can't be used. Plugwash 15:54, 26 April 2006 (UTC)[reply]

In the table in the section "Related character maps", at position (-4, 8-) in the table, the table links to the disambiguation page for index (It says "IND"). This shouldn't happen. However, I have abolutely no idea what kind of index it's referring to, so I didn't change it. Could somebody who knows more about this please change it to link to the specific type of index that it refers to? E946 04:59, 6 April 2006 (UTC)[reply]

The two pipe symbols

In the character chart, both the character | (value 7C) and the character ¦ (value A6) linked to the article about Pipe_(computing); but as far as I can see, that article only talks about the character | (7C).

I have changed both links to Vertical bar, which I believe gives more relevant information. --Oz1cz 14:57, 10 November 2006 (UTC)[reply]

Line Feed / Newline

Why is there no encoding for "line feed / new line" in this standard? How does that work? 83.118.38.37 09:06, 9 February 2007 (UTC)[reply]

ISO 8859-1 vs. UTF-8

I think someone of knowledge in this field should write a section with that name.

OK, the background: I installed a server software distro (Apache2Triad) and everything was working just fine and so I copied some folder with webpages in it I had on another server (XAMPP) and to my shock and appallment this thing was displaying letters with diacritics like all those crappy sites from 1990's, that I've noticed are not even capable of displaying apostrophes on pages in English. And I've noticed a coincidence of pages not being able to display apostrophes (nor any diacritics) and the page having a charset like ISO 123456 or something (instead of UTF-8) in its HEAD section.

And yeah, when I changed the line specifying ISO 8859-1 to UTF-8 (in the httpd.conf file) I got all my diacritics (namely, Latvian) and nothing appears to have been broken.

So, basically, why would anyone need a charset like this when there's UTF-8, what are the inclusion criteria for languages (lol), it's not that I would have any issues with a charset not displaying Latvian diacritics, but there are carons (š) in Czech, for instance, and macrons (ā) in Japanese rōmaji script, and those are like legit languages (not to mention the apostrophes) so, basically, the article seriously lacks some rationale section as to why would anyone use this encoding. 354d 22:05, 26 September 2007 (UTC)[reply]

I think the article ISO/IEC 8859 might answer your concerns! Theo 194.222.199.109 17:30, 30 September 2007 (UTC)[reply]

i believe in the standard of no standardization —Preceding unsigned comment added by 24.121.199.103 (talk) 20:15, 7 April 2009 (UTC)[reply]

ISO-8859-1 table

This page is about ISO/IEC 8859-1. I think the single character table on this page should not represent the ISO-8859-1 printable character and control character set. 92.78.138.134 (talk) 20:27, 3 October 2010 (UTC)[reply]

Most of the other pages about character encodings show the control characters. I copied the comment off of one of them. However a few such as ISO-8859-3 show gray for the control characters. In any case there certainly should not be two tables, which is what was here before. Most casual readers would never figure out that the only difference is that the gray cells are switched to control characters and would be looking for differences in the letters!Spitzak (talk) 20:40, 4 October 2010 (UTC)[reply]
Yes, all pages should show the control characters of the encoding they are representing. My point is that ISO/IEC 8859-1:1998 has no control characters. So this surplus in information is wrong on the one hand and misleading on the other. It adds to the confusion around ISO/IEC 8859-1:1998, Windows Codepage 1252 and ISO-8859-1. There is no benefit in catering to "casual readers" who never look at those tables anyway. —Preceding unsigned comment added by 88.75.191.146 (talk) 17:25, 5 October 2010 (UTC)[reply]
It would seem that the correct approach is to show the actual characters of the character set being presented, and to leave all other code points blank (gray). So for this particular article, no control codes should be shown in the table. But in addition, there should be some mention of (and link to) the ISO control codes that the character set is generally used with. Interested readers can then click through to get more specific details about the "complete" character set code points as it is typically implemented. — Loadmaster (talk) 15:05, 6 October 2010 (UTC)[reply]
Link is at the top as C0 and C1 controlsSpitzak (talk) 19:53, 6 October 2010 (UTC)[reply]

I don't think it is a good thing to have an article about ISO-8859-1 and not have a table which would show all the characters in it. As I take it, right now I'm supposed to look at the table for ISO 8859-1, then read that ISO-8859-1 has more characters, then head to another article and figure out how that characters fit into this table. Very convenient. HotXRock (talk) 19:45, 12 October 2010 (UTC)[reply]

I did that originally but there seems to be a consensus that the description of the ISO character sets should not show the control characters. It is also true that in most use the majority of those values are not interpreted in any way by any definition of the assigned control characters, except for CR, LF, and perhaps TAB. All others tend to render representations or get interpreted as CP1252.Spitzak (talk) 00:15, 13 October 2010 (UTC)[reply]
Okay, if this article is not the place for ISO-8859-1, where should it be described then? Windows-1252 has a different set of control characters, and it is definitely less appropriate for describing ISO-8859-1. By the way, Windows-1252 lists differences from ISO-8859-1, but currently ISO-8859-1 doesn't have a corresponding table anywhere on Wikipedia to compare with. Should we create a separate article for ISO-8859-1 then? HotXRock (talk) 10:43, 13 October 2010 (UTC)[reply]
I suppose it's a choice between creating a section in this article to discuss ISO-8859-1 and its differences from ISO/IEC 8859-1, or creating an entirely separate article for it. I lean towards the former choice, since it's probably the most expected for users searching for either term and who probably don't know there is a difference between the two. — Loadmaster (talk) 19:46, 13 October 2010 (UTC)[reply]
This article "discusses the differences" several times. It says that ISO-8859-1 is this table with the addition of the C0 and C1 controls, and if you follow that link there is (several) tables of the control characters.
I did in fact try to make the table have the control characters but this idea was rejected. I do NOT want to see a return to the redundant two tables, which leads any novices to believe that some of the letters themselves are different and makes the whole subject look more complex than it is.Spitzak (talk) 23:14, 13 October 2010 (UTC)[reply]

Hexadecimal character codes

I removed the use of a "0x" prefix on hexadecimal character codes, because it appears that typically in these character set articles the decimal values are predominantly used, and when hexadecimal values are used they are labeled as such (e.g., "hex 80"). The use of the "0x" prefix is not universally recognized as implying hexaecimal, in spite of widespread use of C-like programming languages. In ISO and RFC documents for character sets, in fact, either decimal only is used (e.g., "128"), or a "row/column" notation (e.g., "8/0") is used. — Loadmaster (talk) 04:40, 7 October 2010 (UTC)[reply]

Dutch language support

The explanation on Dutch language support is utter nonsense. The very existence of the mentioned IJ/ij as separate symbol is barely even known to Dutch-speaking people. The symbol doesn't even appear on Dutch or Belgian keyboards. As far as I can see, ISO/IEC 8859-1 supports the Dutch language completely. — Preceding unsigned comment added by Nyerguds (talkcontribs) 07:34, 1 July 2011 (UTC)[reply]

It was taught to us as a separate symbol in primary school. We called it de lange ij. We all type it as I + J but we also all know that this isn't perfect and that software's ability to convert it into the proper glyph can leave a lot to be desired. — Preceding unsigned comment added by 82.139.87.39 (talk) 23:42, 27 January 2012 (UTC)[reply]
You can't even see the difference between ij and ij. You have to select it to see the difference, and you barely save memory if you use it. Maybe there are some fonts in which ij would look better as a separate symbol, but otherwise it's useless. 80.101.107.21 (talk) 16:03, 23 April 2012 (UTC)[reply]
Whether it saves memory or looks better is completely beside the point. There is a world of difference between what we commonly type and what would be "correct" if computers only supported it. Example: Today most people use a single "-" to mean either figure dash, en dash, em dash and so on, because all that ASCII / ISO-8859 supports is "-". That doesn't make it "right", and I seriously doubt a typographic system like TeX would drop "all those useless dashes" any time soon. -- DevSolar (talk) 10:03, 22 June 2012 (UTC)[reply]

English (UK and US)

Why do we need to mention UK and US in parentheses in the list of supported languages? Even though some spellings do differ, there are no symbols unique to any of the two. IMHO it could be interesting to note that symbols adopted by smaller communities, such as the character é for say latté, or the character ö for say coöperation (as used by The New Yorker) are supported. elpincha (talk) 21:25, 16 November 2011 (UTC)[reply]

Yes there are differences, primarily the '£' symbol, which is not in 7-bit ASCII nor on a US keyboard. It is on a UK keyboard. This is why ASCII is not adequate for UK English usage, you need ISO8859-1, Windows 1252 etc.
There are also some imported words which have retained their accents, at least in British English, for example Café. There are also some other cases, such as the famous Brontë sisters. TiffaF (talk) 16:27, 3 July 2012 (UTC)[reply]