Jump to content

Talk:Lempel–Ziv–Welch: Difference between revisions

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia
Content deleted Content added
Proposed merge: new section
rv 2013 nonsense
 
(25 intermediate revisions by 11 users not shown)
Line 1: Line 1:
{{WikiProject Computer science}}
{{Talk header}}
{{Talk header}}
{{WikiProject banner shell|class=C|
{{Technical|date=September 2010}}
{{WikiProject Computer science }}
}}


== counting mistake? ==
== counting mistake? ==
Line 10: Line 11:


: No. The stop code is not used as a symbol; the algorithm handles it differently than plaintext symbols. -- [[User:Elphion|Elphion]] ([[User talk:Elphion|talk]]) 23:31, 16 December 2013 (UTC)
: No. The stop code is not used as a symbol; the algorithm handles it differently than plaintext symbols. -- [[User:Elphion|Elphion]] ([[User talk:Elphion|talk]]) 23:31, 16 December 2013 (UTC)
شقرييؤبقيؤً تتغp,szxxrdcyvfrffdgyaxybggfgybedjbtgftgfddjbg <span style="font-size: smaller;" class="autosigned">— Preceding [[Wikipedia:Signatures|unsigned]] comment added by [[Special:Contributions/89.211.81.159|89.211.81.159]] ([[User talk:89.211.81.159|talk]]) 08:10, 18 December 2013 (UTC)</span><!-- Template:Unsigned IP --> <!--Autosigned by SineBot-->


== Incorrect derivation ==
== Incorrect derivation ==
Line 115: Line 115:
Even the above Python code misses the point somewhat. The idea is that the algorithm knows the word size as the dictionary grows, and the output is a bitstring. Using UTF-8 encoding of integer keys pads 8- to 14-bit keys to 2 bytes. The use of unicode strings is incredibly ugly anyway - unicode is synonymous with text, and shouldn't feature in an algorithm that compresses a sequence of bytes. [[User:DanPope|DanPope]] ([[User talk:DanPope|talk]]) 00:26, 15 February 2008 (UTC)
Even the above Python code misses the point somewhat. The idea is that the algorithm knows the word size as the dictionary grows, and the output is a bitstring. Using UTF-8 encoding of integer keys pads 8- to 14-bit keys to 2 bytes. The use of unicode strings is incredibly ugly anyway - unicode is synonymous with text, and shouldn't feature in an algorithm that compresses a sequence of bytes. [[User:DanPope|DanPope]] ([[User talk:DanPope|talk]]) 00:26, 15 February 2008 (UTC)


<source lang=python>
<syntaxhighlight lang=python>
def compress(uncompressed):
def compress(uncompressed):
"""Compresses a string to a list of output symbols"""
"""Compresses a string to a list of output symbols"""
Line 136: Line 136:
result += buffer.split()
result += buffer.split()
return result
return result
</syntaxhighlight>
</source>


That's my compressor... just needs a corresponding decompressor. [[User:DanPope|DanPope]] ([[User talk:DanPope|talk]]) 09:42, 15 February 2008 (UTC)
That's my compressor... just needs a corresponding decompressor. [[User:DanPope|DanPope]] ([[User talk:DanPope|talk]]) 09:42, 15 February 2008 (UTC)
Line 226: Line 226:


Answer: in the example, 6 codes were emitted with width 5 bits, and 11 codes were emitted at 6 bits. (Just count the codes under "Output -- bits".) -- [[User:Elphion|Elphion]] ([[User talk:Elphion|talk]]) 06:10, 30 April 2010 (UTC)
Answer: in the example, 6 codes were emitted with width 5 bits, and 11 codes were emitted at 6 bits. (Just count the codes under "Output -- bits".) -- [[User:Elphion|Elphion]] ([[User talk:Elphion|talk]]) 06:10, 30 April 2010 (UTC)

:Hmm. When I count the number of codes built in the extended dictionary, I see codes 27-31 added. These numbers are encoded in 5 bits. So wouldn't that be 5 5-bit codes? Likewise, the 6-bit codes are 32-41. So wouldn't that be 10 codes? <!-- Template:Unsigned --><small class="autosigned">—&nbsp;Preceding [[Wikipedia:Signatures|unsigned]] comment added by [[User:Netskink|Netskink]] ([[User talk:Netskink#top|talk]] • [[Special:Contributions/Netskink|contribs]]) 12:23, 22 May 2017 (UTC)</small> <!--Autosigned by SineBot-->

:: No, the "encoded length" refers to the number of bits sent in the output, not to codes in the dictionary (many of which will not be transmitted). The point is the degree of compression of the encoded text, not the details of the compression machinery. -- [[User:Elphion|Elphion]] ([[User talk:Elphion|talk]]) 13:04, 22 May 2017 (UTC)


Given this is the output in terms of decimal digits with spaces to separate the writes:
20 15 2 5 15 18 14 15 20 27 29 31 36 30 32 34 0

So there are two possible methods for the code count, here is what I came up with

(This table is wrong but what makes sense. ie. you know the code, so you emit the specified number of bits based upon encoding)
{| class="wikitable"
|-
! Emitted code as decimal !! emitted code as bit encoding !! bit width !! Notes
|-
| 20 || 10100 || 5 ||
|-
| 15 || 01111 || 5 ||
|-
| 2 || 00010 || 5 ||
|-
| 5 || 00101 || 5 ||
|-
| 15 || 01111 || 5 ||
|-
| 18 || 10010 || 5 ||
|-
| 14 || 01110 || 5 ||
|-
| 15 || 01111 || 5 ||
|-
| 20 || 10100 || 5 ||
|-
| 27 || 11011 || 5 ||
|-
| 29 || 11101 || 5 ||
|-
| 31 || 11111 || 5 ||
|-
| 36 || 100100 || 6 || from here on its six bit codes
|-
| 30 || 011110 || 6 || so this 5-bit code is now 5
|-
| 32 || 100000 || 6 ||
|-
| 34 || 100010 || 6 ||
|-
| 0 || 000000 || 6 ||
|}

(This table appears to be what is described in the text)
{| class="wikitable"
|-
! Emitted code as decimal !! emitted code as bit encoding !! bit width !! Notes
|-
| 20 || 10100 || 5 ||
|-
| 15 || 01111 || 5 ||
|-
| 2 || 00010 || 5 ||
|-
| 5 || 00101 || 5 ||
|-
| 15 || 01111 || 5 ||
|-
| 18 || 10010 || 5 || when this code is emitted, the dict has added 32 which requires 6 bits
|-
| 14 || 001110 || 6 || so the next codes are 6-bits
|-
| 15 || 001111 || 6 ||
|-
| 20 || 010100 || 6 ||
|-
| 27 || 011011 || 6 ||
|-
| 29 || 011101 || 6 ||
|-
| 31 || 011111 || 6 ||
|-
| 36 || 100100 || 6 ||
|-
| 30 || 011110 || 6 ||
|-
| 32 || 100000 || 6 ||
|-
| 34 || 100010 || 6 ||
|-
| 0 || 000000 || 6 ||
|}

Sorry if I am being pedantic. I'm just trying to figure out what is done before I convert to probability based coding. ie. more frequent occurring symbols are code
with shorter length codes. <!-- Template:Unsigned --><small class="autosigned">—&nbsp;Preceding [[Wikipedia:Signatures|unsigned]] comment added by [[User:Netskink|Netskink]] ([[User talk:Netskink#top|talk]] • [[Special:Contributions/Netskink|contribs]]) 13:52, 22 May 2017 (UTC)</small> <!--Autosigned by SineBot-->

: This is perhaps the source of the misunderstanding: when the encoder adds code 32 to the dictionary, it issues all following codes with 6 bits rather than 5 bits even though the codes it transmits may not require the wider width. That is, it increases the code width immediately after generating a wider code in the dictionary, not later (when it first ''transmits'' a code requiring the wider width). This is required so that the decoder can keep track of the width of the codes: how is it to know when code width increases? Answer: when the decoder itself generates a wider code in its own dictionary, which it generates in parallel with the encoder. (''Added:'' There's a subtle difference: since the decoder is always one step behind the encoder, it increases the code width from p to p+1 when it generates the last code of width p, not the first code of length p+1, as the encoder does.) -- [[User:Elphion|Elphion]] ([[User talk:Elphion|talk]]) 14:15, 22 May 2017 (UTC)

Elphion, thank you for clearing that up. That makes sense. Perhaps the second table and your explanation should be added to the section. <!-- Template:Unsigned --><small class="autosigned">—&nbsp;Preceding [[Wikipedia:Signatures|unsigned]] comment added by [[User:Netskink|Netskink]] ([[User talk:Netskink#top|talk]] • [[Special:Contributions/Netskink|contribs]]) 15:53, 22 May 2017 (UTC)</small> <!--Autosigned by SineBot-->


== Not Optimal ==
== Not Optimal ==
Line 265: Line 362:


Re: merging [[Reduced Offset Lempel Ziv]]: If it's got to go somewhere, [[LZ77 and LZ78]] is probably a better choice. -- [[User:Elphion|Elphion]] ([[User talk:Elphion|talk]]) 21:29, 9 December 2015 (UTC)
Re: merging [[Reduced Offset Lempel Ziv]]: If it's got to go somewhere, [[LZ77 and LZ78]] is probably a better choice. -- [[User:Elphion|Elphion]] ([[User talk:Elphion|talk]]) 21:29, 9 December 2015 (UTC)

== Not a good enough description of the encoding process. ==

It says:
Buffer input characters in a sequence ω until ω + next character is not in the dictionary. Emit the code for ω, and add ω + next character to the dictionary. Start buffering again with the next character.

But this is missing some key information.

After the first sentence (which is fine as it is), the second sentence is "Emit the code for ω, and add ω + next character to the dictionary". Now while this describes what to do if no match is found, it never describes what to do if a match IS found. If a match is found, what code should be emitted? Should no code be emitted if a match is found? Please explain. I'm trying to implement LZW in my software, and have turned to Wikipedia for the info that I will need to implement this in my software. Please provide a good enough description that a programmer can use that description to actually implement it in software. [[User:Benhut1|Benhut1]] ([[User talk:Benhut1|talk]]) 20:38, 22 August 2016 (UTC)
:Hi, {{Reply to|Benhut1|p=}}. Depending on what language you're using, you may already find it at the [http://rosettacode.org/wiki/LZW_compression Rosetta Code page about LZW], which has its implementation in a number of languages. While the article here will need updating, that should help you in the interim. Hope that helps! — [[User:Sasuke Sarutobi|Sasuke Sarutobi]] ([[User talk:Sasuke Sarutobi|talk]]) 13:05, 23 August 2016 (UTC)

::I disagree somewhat with OP. The high-level description of the encoding algorithm in section 1.1 makes it pretty clear that no output is generated while the buffered string matches entries in the dictionary. (The detailed example in section 2.1 shows this explicitly.) And the premise that a WP article should necessarily provide enough information to implement the algorithm in solid code goes way beyond what WP aims to do. This is an encyclopedia, not a text book. -- [[User:Elphion|Elphion]] ([[User talk:Elphion|talk]]) 14:50, 25 August 2016 (UTC)

== Encoding -Rule 4 in conflict with Rule 5 (indirectly with rule 2) ==

Rule 4 says to include W and the following input into the dictionary
Rule 5 says to go to 2
Rule 2 says to find the longest string that matches current input
Wasn't the current input just added to the dictionary added to the dictionary? Rule 2 will always match what was just included in Rule 4.
I'm not exactly certain of the algorithm, so am not confident rewriting this section or I would. Sorry.
[[Special:Contributions/139.102.14.207|139.102.14.207]] ([[User talk:139.102.14.207|talk]]) 17:55, 25 February 2019 (UTC)

: Rule 3 removes W from the input, so when you loop back to rule 2, you will start scanning at the first character beyond W. -- [[User:Elphion|Elphion]] ([[User talk:Elphion|talk]]) 21:46, 25 February 2019 (UTC)

== Comparison to other archive formats, in particular .tar.xz files ==

As a *nix user it would be interesting to see if wikipedia could show some comparisons with .tar.lz to other formats,
in particular compression rate, memory usage, time to create and extract such archives etc..

A simple table should suffice, but of course it has to be correct. [[Special:Contributions/2A02:8388:1641:8380:3AD5:47FF:FE18:CC7F|2A02:8388:1641:8380:3AD5:47FF:FE18:CC7F]] ([[User talk:2A02:8388:1641:8380:3AD5:47FF:FE18:CC7F|talk]]) 17:24, 10 May 2020 (UTC)

== Over-detailed algorithms ==

I have reverted edits to the description of the encryption and decryption algorithms, which have made the process more difficult to understand. The goal of WP is not to provide code for algorithms, but to describe them in enough detail that the reader understands how they work. -- [[User:Elphion|Elphion]] ([[User talk:Elphion|talk]]) 14:31, 2 May 2024 (UTC)
:Good job; [[WP:NOTHOWTO]]. [[User:TJRC|TJRC]] ([[User talk:TJRC|talk]]) 18:31, 2 May 2024 (UTC)

Latest revision as of 02:10, 26 November 2024

counting mistake?

[edit]

the article says: "There are thus 26 symbols in the plaintext alphabet (the 26 capital letters A through Z), plus the stop code #".

seems like it's suppose to be "There are thus 27 symbols..." , isn't it? — Preceding unsigned comment added by 109.66.119.64 (talk) 14:31, 16 December 2013 (UTC)[reply]

No. The stop code is not used as a symbol; the algorithm handles it differently than plaintext symbols. -- Elphion (talk) 23:31, 16 December 2013 (UTC)[reply]

Incorrect derivation

[edit]

The introduction to the article incorrectly states that LZW is derived from LZ77, when it's really derived from LZ78. I'll change this when I come up with a decent wording. --Kgaughan 22:12, 27 May 2005 (UTC)[reply]

  • There, I finally carried out my threat to fix that introductory paragraph. --Kgaughan 29 June 2005 21:11 (UTC)

IBM patent

[edit]

The IBM patent expires March 21, 2006. Calculation is later of 20 years from the filing date or 17 years from the issue date for patents filed before June of 1995.

PS relevant consideration is that 20 would be added to filing date of parent application, June 1, 1983 rather than filing date of the '746 patent which is August 11,1986. Believe that is where people are going awry; they are forgetting this is a continuation.

—Preceding unsigned comment added by 192.25.142.225 (talkcontribs) 23:09, 16 June 2005

redirects

[edit]

I've just added redirects to here from both Ziv-Lempel-Welch and ZLW. --Heavyphotons 07:55, 27 July 2005 (UTC)[reply]

Lack of detail

[edit]

It seems like this page is very, very lacking in detail about the workings of the LZW algorithm (Description of the algorithm). Any other thoughts? Stack 16:17, 28 July 2005 (UTC)[reply]

The algorithm is not written very clearly. I started to get confused when the author gave an example of 25 characters and told it had 27. i would appreciate it if the author explains it a bit more clearly.

Added info on the fact LZW is used in *.zip

[edit]

I added a bit of info on the fact that LZW is used in the *.zip fomate, but its not a great addition as-is, could someone help me clean it up? It also needs sourcing. I know that the OpenOffice.org help file has info on the fact that it uses .zip in its save formates (I don't know if the new 2.0 formate uses LZW or .zip), and the .zip infomation can be taken from the wikipedia page for .zip, but I don't have any idea where to get a source for .pdf. I know it contians LZW becuse Acrobate Reader 6 contained a note in its boot windown that "This product contians an implementation of the LZW algorithem" or something along those lines. Can someone help me sort this out?

ZIP and PDF both use Deflate, which is based on LZ77, while LZW is based on LZ78. --Matthäus Wander 00:26, 19 November 2005 (UTC)[reply]
Did some research: PDF used LZW in version 2.1: [1]

IBM Patent

[edit]

The article claimed that the IBM US patent expired on 11 Aug 2006, which is seven months in the future. I'm not sure when it will expire or if it has already (based on the patent I'd say it expired in 2005), so I removed the claim. If someone can verify that, I'd appreciate it. --DCrazy talk/contrib 18:39, 17 January 2006 (UTC)[reply]

Just caught the comment above. Seems there is a bit of dispute on this point, unless I'm misunderstanding it. --DCrazy talk/contrib 18:41, 17 January 2006 (UTC)[reply]

Encoding Example

[edit]

The encoding example uses codes of 5 and 6 bits, but the very beginning of this article states that the algorithm uses fixed-length strings. It might be that I'm just not grasping the concept, as I'm not familiar with the algorithm, but is it really possible to change the length of the code words on the fly? If so, how can one deduce in the decoding phase if the following code word is 5 or 6 bits long? 09:41, 7 November 2006 (UTC)

I think the example is flawed as stated above. I've struggled with understanding the concept from this page, and it seems to me that the only practical implementation would indeed use fixed-length codes. There's no mention in the example of how the decoder would know that some codes are 5-bit and some are 6-bit. This example should use all 6-bit codes, and mention that the encoder specifies at the beginning of the data that they are 6-bits long. Therefore, "Total Length = 5*5 + 12*6 = 97 bits." just doesn't make any sense, it's 5*6 + 12*6, and probably an extra byte (8 bits) at the beginning to specify the code lengths. I haven't done any research, but I'm almost positive this has to be right. —The preceding unsigned comment was added by 4.152.213.104 (talk) 03:08, 6 April 2007 (UTC).[reply]
re: the above comment, as far as i'm aware, the number of bits used is the number of bits needed to encode the highest entry in the dictionary so far. As the decoder builds the same dictionary that the encoder used, the decoder is always aware of the number of bits needed to store an entry from the dictionary.
In for instance GIF, they start at 9 bits. The lower 8 bit is normal content, the upper half is dictionary. It expands uptil something like 12 or 16bit then resets. Carewolf 07:36, 24 April 2007 (UTC)[reply]
Notice in the first encoding example that when the bit values in the extended dictionary reach 31 (highest value representable by 5 binary digits), the algorithm immediately switches to 6-bit mode and outputs ALL bit codes as 6-bit codes, so the R character, which previously had a 5-bit value of 10010, from now on has the 6-bit value of 010010. The description accompanying the first encoding example in this article should explain that better. In decoding, the algorithm starts building the extended dictionary right after reading the first character, and the bit values in the extended dictionary will reach 31 at the same time that the code it is decoding switches to 6-bit mode. The algorithm then uses the size of its extended dictionary as a cue to switch to the next-higher bit mode. —The preceding unsigned comment was added by 65.26.235.222 (talk) 14:39, 9 May 2007 (UTC).[reply]

I think there's an error in the encoding example: R shouldn't be 6-bit, only the next string (it is N in the example) should. When we output code for R, we add to the dictionary a string that exceeds current code length. But it is obvious that if we are adding the first string with 6-bit code, any string that is written to output at this step has 5-bit code. Then why should we use 6-bit code for R if 5-bit can still suffice? So I think the example should be corrected. By the way, this was very misleading when I tried to build my GIF encoder. HotXRock 18:22, 2 August 2008 (UTC) —Preceding unsigned comment added by HotXRock (talkcontribs)

There were many errors in the example (both encoding and decoding). I think I've cleared them all up. I also added a couple of sections before the example to lay the groundwork, and to explain "Early Change" (which the previous versions of the example may have used -- it's difficult to tell because of the miscounting of symbols. Elphion (talk) 02:39, 6 November 2009 (UTC)[reply]
There are still errors in the example. The encoded length should be 6*6 (not 6*5) + 11*6 = 102 (not 96) bits. Jellystones (talk) 16:09, 5 March 2010 (UTC)[reply]
No, the example is correct: the initial codes are 5 bits because the symbol alphabet does not exhaust the 5-bit codes. There is nothing in the basic algorithm that says the codes *must* start wider than the alphabet. The details of the variable encoding obviously depend on the conventions of the particular implementation. This example explicitly implements a scheme starting at 5 bits in order to illustrate what happens when the code length needs to be incremented. Elphion (talk) 15:59, 5 March 2010 (UTC)[reply]
Then there is a contradiction in the article: " A 5-bit code gives 25 = 32 possible combinations of bits, so when the 33rd dictionary word is created, the algorithm will have to start using 6-bit strings (for all codes, including those which were previously represented by only five bits).". Jellystones (talk) 16:09, 5 March 2010 (UTC)[reply]
Not a contradiction, just a misunderstanding: "... will have to start using 6-bit strings ... from that point in the encoding process." I.e., once the 33rd code is generated, the encoder must switch from 5-bit strings to 6-bit strings. I'll reword the paragraph to reflect that. Elphion (talk) 23:05, 6 March 2010 (UTC)[reply]
In other words, the encoder does not go back and adjust previous output -- which has conceptually, and in many cases physically, already been transmitted. The shift in code size applies to subsequent output, but applies even when emitting values < 32 that would not require 6 bits. I hope that's clearer now in the article, but it's not easy to say without getting really picky in the details like this! Elphion (talk) 23:29, 6 March 2010 (UTC)[reply]

Code comments

[edit]

If there is going to be code included in the page, it should at the very least be commented to show what is happening in relation to the description. —Preceding unsigned comment added by 82.20.1.85 (talkcontribs) 15:26, 8 May 2007

Decoding Example

[edit]

Shouldn't the "TO" entry in the decoding example have a bit value of 27, as stated in the encoding example? I don't know why the author of this article made such a big deal out of zero indexing; he/she seems to have confused himself/herself.

algorithm details

[edit]

I have added a code to describe what is the details of compressor and decompresser algorithms.

Python -> pseudocode

[edit]

I think the Python example should be translated to pseudocode by someone who knows how. Not all readers understand Python.

Also, the given python code is not entirely clear.

Additionally, I believe the given pseudocode is overly simplistic; I did a direct translation (and I mean "my code looks exactly like the pseudocode" direct), and it failed on "xxx". This is a known issue in the simplistic representation of LZW which is indicated at http://marknelson.us/1989/10/01/lzw-data-compression/.

If I get some time, I'll rework my code or clean up the given python to look more pseudocodey, then post that.


I have cleaned up the Python code a little bit:

  • removed unnecessary class
  • put doctsrings at correct place (after the def line)
  • removed unnecessary and "unpythonic" type checks
  • moved utf-8 encoding to the end of compress()
  • removed that ugly __join() by using the unicode strings directly as keys
  • some minor clean ups

The result can be seen here: http://paste.pocoo.org/show/4497/

Any objections against replacing the existing code in the article? Marrin 10:24, 25 September 2007 (UTC)[reply]

The new code is much, much better than the old one, and I'm all for changing it. 72.67.158.141 (talk) 01:26, 14 January 2008 (UTC)[reply]

Even the above Python code misses the point somewhat. The idea is that the algorithm knows the word size as the dictionary grows, and the output is a bitstring. Using UTF-8 encoding of integer keys pads 8- to 14-bit keys to 2 bytes. The use of unicode strings is incredibly ugly anyway - unicode is synonymous with text, and shouldn't feature in an algorithm that compresses a sequence of bytes. DanPope (talk) 00:26, 15 February 2008 (UTC)[reply]

def compress(uncompressed):
    """Compresses a string to a list of output symbols"""
    chars = 256
    dictionary = {}
    for i in range(chars):
        dictionary[chr(i)] = chr(i)
    buffer = ''
    result = []
    for c in uncompressed:
        xstr = buffer + c
        if xstr in dictionary:
            buffer = xstr
        else:
            result.append(dictionary[buffer])
            dictionary[xstr] = chars
            chars += 1
            buffer = c
    if buffer:
        result += buffer.split()
    return result

That's my compressor... just needs a corresponding decompressor. DanPope (talk) 09:42, 15 February 2008 (UTC)[reply]

Your code is the best one so far IMO, I have added it in along with my corresponding decompressor. Also, I fixed the buffer.split() line at the end; there is no separator between the characters (i.e., 'OT' in the example), so split doesn't have the desired effect.

I did like the meaningful variable names, but for the sake of consistency I've changed them to match the pseudocode and examples. It's much easier to follow along if the notation agrees everywhere. Arguably we should change the rest of the article to your names... Kiv (talk) 22:15, 1 May 2008 (UTC)[reply]

failure on xxx

[edit]

Which pseudo code failed on "xxx"? I tested the algorithm for compression and decompression on xxx and it worked!

I think the last line of the algorithm:

done output the code for w;

is not correct -- I implemented the algorithm and the resulting code had a duplicate of the penultimate code and no final code.

I suggest the last line should be:

done output the code for wk;

158.48.32.2 12:49, 4 August 2007 (UTC)[reply]

Well, I am almost sure that it should be 'output the code for w;'. As you can see if the first if-clause is satisfied it assigns the wk to w. However I'm going to implement a java version of the algorithm and see if it's wrong or not. Could you please review your implementation again? Thanks


table deletion

[edit]

That table that was deleted on Nov. 13th was the single most useful thing on the page for understanding how the LZW worked. Deleting REALLY hurts how easy it is to understand the LZW for a novice user. 68.207.120.151 (talk) 00:39, 15 November 2008 (UTC)[reply]

It was removed by User:Thumperward at 13:25, 13 November 2008 in this edit with edit summary "Algorithm: the actual algorithm, rewritten in English and not pseudocode, might be useful. this barrage of university homework is not." I think it would be useful to describe the algorithm in a general way, but it would also be helpful to have an illustrative example. Could I invite Thumperward to expand on their justification? Dcoetzee 02:16, 15 November 2008 (UTC)[reply]
I have added examples of LZW encoding and decoding at GIF of a simple image. Cuddlyable3 (talk) 16:13, 23 November 2008 (UTC)[reply]


Can I ask why the table was removed? It was really useful to understand the algorithm. Can we bring it back? Amin (talk) 17:24, 27 December 2008 (UTC)[reply]

Perverse results

[edit]

Any algorithm that transforms its input into a shorter output without loss of information must also transform some input into a larger data stream.

The article should indicate what types of input result in a larger output.

It should also indicate ranges of typical effectiveness on various types of input (eg, ASCII English text, bitmapped graphics, x86 binaries, etc.).

Coach 10025 (talk) 16:56, 24 February 2009 (UTC)[reply]

The purpose of the LZW algorithm is to exploit redundancy present in its input. Removing that redundancy is no loss of information. Cuddlyable3 (talk) 17:43, 24 February 2009 (UTC)[reply]
This .gif file example demonstrates a slight compression in the LZW coding of a tiny image. However the .gif format requires in addition a comparatively large palette and headers that comprise uncompressed bytes. It therefore achieves overall compression only for much larger images.
LZW coding of a short run of random data is likely to generate an expanded data stream. This is because input characters are represented from the start by codes that are one bit longer than natural (in order to allocate space for control code(s) and future table entries) and few "profitable" string repetitions can be found and then exploited. Cuddlyable3 (talk) 20:22, 11 March 2009 (UTC)[reply]
As another example, generally, "compression" of encrypted data will generally enlarge it a bit. (This is just because encrypted data appears random in the absence of the decryption algorithm.) --Doradus (talk) 15:50, 13 August 2009 (UTC)[reply]

This is not a concern in practice. It is trivial to ensure that a lossless compression scheme adds no more than 1 bit to the compressed length. Simply start the compressed representation with a flag indicating whether or not the data were compressed. Try compressing; if it reduces the size by at least 1 bit, set the flag and append the compressed data; otherwise, clear the flag and append the uncompressed data. --Doradus (talk) 16:19, 28 September 2009 (UTC)[reply]

Clear codes

[edit]

The article doesn't mention this but clear code handling is especially arcane. After a clear code some 'padding' bits are expected to follow. And it is not padding to the next byte boundary, but rather to a boundary that is a multiple of the clear code's length multiplied by eight. The sourceforge ncompress project (based on original BSD sources), and the lzw decompressor from GNU gzip both contain impenetrable bit buffer positioning logic with what appears to be the possibility of seeking backwards into the bit stream under some conditions. If anyone has more information on the padding expected after a Clear code, please ammend this article! DLeonard (talk) 21:15, 12 May 2009 (UTC)[reply]

The clear code is mentioned as one of the special codes in the Encoder and Decoder diagrams. The article is not exclusively about the implementation of LZW coding in the GIF file format but should be consistent with it. A GIF encoder may deliver a clear code at any time. Its effect is to reset the code table to root+special codes only and to reset the packing of code into bytes. Packing varies during encoding as the table builds with longer codes so that only the essential number of bits is stored.Cuddlyable3 (talk) 06:35, 13 May 2009 (UTC)[reply]
I know of no requirement in the algorithm itself to add padding after a clear code. That would certainly break GIF encoding, which treats the clear code as just another code. It resets the code width back to the original value as Cuddlyable3 indicates, but inserts no padding, and there's no particular point to inserting any padding. (Of course there are many flavors of LZW; as long as the encoder and decoder agree on the details things ought to work out.) Elphion (talk) 02:50, 6 November 2009 (UTC)[reply]

Diagrams

[edit]

I find the diagrams incomprehensible and unhelpful. Can we remove them, or explain them better? --Doradus (talk) 16:13, 14 July 2009 (UTC)[reply]

The diagrams should be kept until improvements are made. Anyone who understands the article is free to do that. Cuddlyable3 (talk) 08:28, 5 August 2009 (UTC)[reply]
I respectfully disagree. There are times when no diagram is better than a poor one. --Doradus (talk) 15:48, 13 August 2009 (UTC)[reply]
Is your plan to criticize or to contribute material ? Cuddlyable3 (talk) 19:42, 13 August 2009 (UTC)[reply]
I don't see how my plans are relevant here, but since you asked, it is my plan to improve the article. In this case, I was trying to do so by removing a confusing diagram. --Doradus (talk) 17:05, 25 August 2009 (UTC)[reply]
I didn't realize until now that you are the creator of this diagram. I apologize if I have been brusque. But your pronouncement that "diagrams should be kept until improvements are made" is a personal opinion; it is not Wikipedia policy, and it doesn't in any way explain how you believe my edit harmed the quality of the article, so it is not a reasonable justification for reverting my edit. --Doradus (talk) 19:27, 25 August 2009 (UTC)[reply]
I agree. The diagrams are incomprehensible and add no value to the article. 81.152.169.67 (talk) 16:21, 28 August 2009 (UTC).[reply]

If "diagrams" refers to the encoding and decoding examples, I think I've fixed them. Elphion (talk) 02:52, 6 November 2009 (UTC)[reply]

Oh, those diagrams. (Links: http://commons.wikimedia.org/wiki/File:GIFDRAWE.png , http://commons.wikimedia.org/wiki/File:GIFDRAWD.png ) I too am trying not to be brusque, but I know LZW cold and find the diagrams completing befuddling. I agree that they do more harm than good. Elphion (talk) 03:49, 6 November 2009 (UTC)[reply]

Encoding question

[edit]

IP 124.124.105.162 left in the article the following question, which I have moved here:

someone plz give an explanation for "Encoded length = 6*5 + 11*6 = 96 bits."..how did 6 and 11 come?

Answer: in the example, 6 codes were emitted with width 5 bits, and 11 codes were emitted at 6 bits. (Just count the codes under "Output -- bits".) -- Elphion (talk) 06:10, 30 April 2010 (UTC)[reply]

Hmm. When I count the number of codes built in the extended dictionary, I see codes 27-31 added. These numbers are encoded in 5 bits. So wouldn't that be 5 5-bit codes? Likewise, the 6-bit codes are 32-41. So wouldn't that be 10 codes? — Preceding unsigned comment added by Netskink (talkcontribs) 12:23, 22 May 2017 (UTC)[reply]
No, the "encoded length" refers to the number of bits sent in the output, not to codes in the dictionary (many of which will not be transmitted). The point is the degree of compression of the encoded text, not the details of the compression machinery. -- Elphion (talk) 13:04, 22 May 2017 (UTC)[reply]


Given this is the output in terms of decimal digits with spaces to separate the writes: 20 15 2 5 15 18 14 15 20 27 29 31 36 30 32 34 0

So there are two possible methods for the code count, here is what I came up with

(This table is wrong but what makes sense. ie. you know the code, so you emit the specified number of bits based upon encoding)

Emitted code as decimal emitted code as bit encoding bit width Notes
20 10100 5
15 01111 5
2 00010 5
5 00101 5
15 01111 5
18 10010 5
14 01110 5
15 01111 5
20 10100 5
27 11011 5
29 11101 5
31 11111 5
36 100100 6 from here on its six bit codes
30 011110 6 so this 5-bit code is now 5
32 100000 6
34 100010 6
0 000000 6

(This table appears to be what is described in the text)

Emitted code as decimal emitted code as bit encoding bit width Notes
20 10100 5
15 01111 5
2 00010 5
5 00101 5
15 01111 5
18 10010 5 when this code is emitted, the dict has added 32 which requires 6 bits
14 001110 6 so the next codes are 6-bits
15 001111 6
20 010100 6
27 011011 6
29 011101 6
31 011111 6
36 100100 6
30 011110 6
32 100000 6
34 100010 6
0 000000 6

Sorry if I am being pedantic. I'm just trying to figure out what is done before I convert to probability based coding. ie. more frequent occurring symbols are code with shorter length codes. — Preceding unsigned comment added by Netskink (talkcontribs) 13:52, 22 May 2017 (UTC)[reply]

This is perhaps the source of the misunderstanding: when the encoder adds code 32 to the dictionary, it issues all following codes with 6 bits rather than 5 bits even though the codes it transmits may not require the wider width. That is, it increases the code width immediately after generating a wider code in the dictionary, not later (when it first transmits a code requiring the wider width). This is required so that the decoder can keep track of the width of the codes: how is it to know when code width increases? Answer: when the decoder itself generates a wider code in its own dictionary, which it generates in parallel with the encoder. (Added: There's a subtle difference: since the decoder is always one step behind the encoder, it increases the code width from p to p+1 when it generates the last code of width p, not the first code of length p+1, as the encoder does.) -- Elphion (talk) 14:15, 22 May 2017 (UTC)[reply]
Elphion, thank you for clearing that up.  That makes sense.  Perhaps the second table and your explanation should be added to the section.  — Preceding unsigned comment added by Netskink (talkcontribs) 15:53, 22 May 2017 (UTC)[reply] 

Not Optimal

[edit]

The second sentence of this article on 01-jan-2011 says: "The algorithm is designed to be fast to implement but is not usually optimal because it performs only limited analysis of the data". This got flagged for citation needed, but I am taking it out. The LZW implementation of LZ78 is known to asymptotically optimal, so an examination of its performance needs to be a bit more nuanced than this.

A good discussion of performance of LZW wrt other compressors would be great, and a rewrite of that second sentence would be nice too. --Snorkelman 12:54, 21 January 2011 (UTC) — Preceding unsigned comment added by Snorkelman (talkcontribs)

Commercial falsification of history supported by wikipedia falsification

[edit]

This is what this article about LZW compression is. A great shame! It's ugly to present a technically irrelevent commercial trick (some arbitrary details and complications) by W's company as a contribution to the historical break through and achievement by ZL. We see here wikipedia at it's very bad, very low Wlod (talk) —Preceding undated comment added 08:52, 5 December 2011 (UTC).[reply]

Your characterization of the history above is not historical. Welch (who worked in the Sperry research facility, not in the commercial arm of the company), added the hash table as a speed up of the algorithm -- a detail that made the algorithm fast enough to make it useful on the fly, and therefore in situations where LZ itself was not. As with most technical innovations from the research office (or indeed from any software company), it was routinely patented; but this was not a nefarious scheme to snag LZ, as your remarks seem to imply. Sperry (later Unisys) was largely unaware of its own patent until the algorithm's use in GIF was noticed many years later.
Unisys did not handle the patent controversy particularly well [insert irony emoticon here], but it made no attempt to sock ordinary users with license fees (unlike many SW patent holders). It did attempt to enforce licensing from developers using the algorithm, as any SW patent holder would -- that, after all, is what patents are all about. Whether such things should be patentable is a different issue; but until the law changes, software companies will continue to patent innovations like this simply because, for better or worse, that is currently the prevailing business model for intellectual property.
But patent issues aside, if (as you claim) W truly was not a significant addition to LZ, why do we continue to use LZW in preference to LZ?
-- Elphion (talk) 16:21, 5 December 2011 (UTC)[reply]

Putative "Late Change" convention

[edit]

The following paragraph was added to the article by user:One.Ouch.Zero. It's not entirely clear how the convention is supposed to work; and more importantly I've never seen this implemented. I think we need some documentation before including it (including formats that use it). -- Elphion (talk) 20:27, 19 January 2012 (UTC)[reply]

An advanced type of LZW-style encoders implement the opposite, "Late Change." They write the low p − 1 bits first and then decide if the most significant bit can be 1 at that stage. If it cannot, there is no need to store the most significant bit at all, and the decoder can decide as well whether a 1 MSB at that stage could be meaningful or not. This kind of encoding is quite obfuscated with MSB-first packing order, and developers must take care not to fall for the "Early Change" trap again.

Lossy LZW

[edit]

http://www.stevenpigeon.com/Publications/publications/LossyLZW.pdf talks about at least three ways of maximizing compression by allowing lzw to introduce small errors. Should we include that in the article? --Peterpall (talk) 19:21, 22 July 2014 (UTC)[reply]

Neither the scheme introduced in the paper (which clearly has not been through any editing process, let alone a reviewing process) nor the earlier schemes mentioned in the paper look very practical. Do these have any real-world application? If lossy compression is acceptable, JPG will get far better results, both in terms of compression and visual quality. -- Elphion (talk) 18:28, 24 July 2014 (UTC)[reply]

Contrast wth LZ78

[edit]

The lede states that LZW is an improved implementation of LZ78, but the article does not describe what these improvements are. Could someone who knows the difference clarify what they are exactly? --Doradus (talk) 13:04, 12 December 2014 (UTC)[reply]

Proposed merge

[edit]

Re: merging Reduced Offset Lempel Ziv: If it's got to go somewhere, LZ77 and LZ78 is probably a better choice. -- Elphion (talk) 21:29, 9 December 2015 (UTC)[reply]

Not a good enough description of the encoding process.

[edit]

It says: Buffer input characters in a sequence ω until ω + next character is not in the dictionary. Emit the code for ω, and add ω + next character to the dictionary. Start buffering again with the next character.

But this is missing some key information.

After the first sentence (which is fine as it is), the second sentence is "Emit the code for ω, and add ω + next character to the dictionary". Now while this describes what to do if no match is found, it never describes what to do if a match IS found. If a match is found, what code should be emitted? Should no code be emitted if a match is found? Please explain. I'm trying to implement LZW in my software, and have turned to Wikipedia for the info that I will need to implement this in my software. Please provide a good enough description that a programmer can use that description to actually implement it in software. Benhut1 (talk) 20:38, 22 August 2016 (UTC)[reply]

Hi, @Benhut1. Depending on what language you're using, you may already find it at the Rosetta Code page about LZW, which has its implementation in a number of languages. While the article here will need updating, that should help you in the interim. Hope that helps! — Sasuke Sarutobi (talk) 13:05, 23 August 2016 (UTC)[reply]
I disagree somewhat with OP. The high-level description of the encoding algorithm in section 1.1 makes it pretty clear that no output is generated while the buffered string matches entries in the dictionary. (The detailed example in section 2.1 shows this explicitly.) And the premise that a WP article should necessarily provide enough information to implement the algorithm in solid code goes way beyond what WP aims to do. This is an encyclopedia, not a text book. -- Elphion (talk) 14:50, 25 August 2016 (UTC)[reply]

Encoding -Rule 4 in conflict with Rule 5 (indirectly with rule 2)

[edit]

Rule 4 says to include W and the following input into the dictionary Rule 5 says to go to 2 Rule 2 says to find the longest string that matches current input Wasn't the current input just added to the dictionary added to the dictionary? Rule 2 will always match what was just included in Rule 4. I'm not exactly certain of the algorithm, so am not confident rewriting this section or I would. Sorry. 139.102.14.207 (talk) 17:55, 25 February 2019 (UTC)[reply]

Rule 3 removes W from the input, so when you loop back to rule 2, you will start scanning at the first character beyond W. -- Elphion (talk) 21:46, 25 February 2019 (UTC)[reply]

Comparison to other archive formats, in particular .tar.xz files

[edit]

As a *nix user it would be interesting to see if wikipedia could show some comparisons with .tar.lz to other formats, in particular compression rate, memory usage, time to create and extract such archives etc..

A simple table should suffice, but of course it has to be correct. 2A02:8388:1641:8380:3AD5:47FF:FE18:CC7F (talk) 17:24, 10 May 2020 (UTC)[reply]

Over-detailed algorithms

[edit]

I have reverted edits to the description of the encryption and decryption algorithms, which have made the process more difficult to understand. The goal of WP is not to provide code for algorithms, but to describe them in enough detail that the reader understands how they work. -- Elphion (talk) 14:31, 2 May 2024 (UTC)[reply]

Good job; WP:NOTHOWTO. TJRC (talk) 18:31, 2 May 2024 (UTC)[reply]