-
-
Notifications
You must be signed in to change notification settings - Fork 47k
Add LZ77 compression algorithm #8059
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
19 commits
Select commit
Hold shift + click to select a range
848143b
- add "lz77_compressor" class with compress and decompress methods us…
LuciaHarcekova 2f48936
Merge pull request #1 from LuciaHarcekova/lz77
LuciaHarcekova 31181ec
Merge branch 'TheAlgorithms:master' into master
LuciaHarcekova ee44d71
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 86c2bb3
- use "list" instead "List", formatting
LuciaHarcekova f593693
- fix spelling
LuciaHarcekova ee06ca0
- add Python type hints
LuciaHarcekova 3198b33
- add 'Token' class to represent triplet (offset, length, indicator)
LuciaHarcekova 41c5a0f
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 63f28c6
- add test, hange type rom List to list
LuciaHarcekova 76b22a2
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] dd40cf3
- remove extra import
LuciaHarcekova 153ed96
- remove extra types in comments
LuciaHarcekova 7bf9096
- better test
LuciaHarcekova 52af3cf
- edit comments
LuciaHarcekova 3298284
- add return types
LuciaHarcekova b530862
- add tests for __str__ and __repr__
LuciaHarcekova b891abf
Update lz77.py
cclauss bda07c5
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,227 @@ | ||
""" | ||
LZ77 compression algorithm | ||
- lossless data compression published in papers by Abraham Lempel and Jacob Ziv in 1977 | ||
- also known as LZ1 or sliding-window compression | ||
- form the basis for many variations including LZW, LZSS, LZMA and others | ||
|
||
It uses a “sliding window” method. Within the sliding window we have: | ||
- search buffer | ||
- look ahead buffer | ||
len(sliding_window) = len(search_buffer) + len(look_ahead_buffer) | ||
|
||
LZ77 manages a dictionary that uses triples composed of: | ||
- Offset into search buffer, it's the distance between the start of a phrase and | ||
the beginning of a file. | ||
- Length of the match, it's the number of characters that make up a phrase. | ||
- The indicator is represented by a character that is going to be encoded next. | ||
|
||
As a file is parsed, the dictionary is dynamically updated to reflect the compressed | ||
data contents and size. | ||
|
||
Examples: | ||
"cabracadabrarrarrad" <-> [(0, 0, 'c'), (0, 0, 'a'), (0, 0, 'b'), (0, 0, 'r'), | ||
(3, 1, 'c'), (2, 1, 'd'), (7, 4, 'r'), (3, 5, 'd')] | ||
"ababcbababaa" <-> [(0, 0, 'a'), (0, 0, 'b'), (2, 2, 'c'), (4, 3, 'a'), (2, 2, 'a')] | ||
"aacaacabcabaaac" <-> [(0, 0, 'a'), (1, 1, 'c'), (3, 4, 'b'), (3, 3, 'a'), (1, 2, 'c')] | ||
|
||
Sources: | ||
en.wikipedia.org/wiki/LZ77_and_LZ78 | ||
""" | ||
|
||
|
||
from dataclasses import dataclass | ||
|
||
__version__ = "0.1" | ||
__author__ = "Lucia Harcekova" | ||
|
||
|
||
@dataclass | ||
class Token: | ||
""" | ||
Dataclass representing triplet called token consisting of length, offset | ||
and indicator. This triplet is used during LZ77 compression. | ||
""" | ||
|
||
offset: int | ||
length: int | ||
indicator: str | ||
|
||
def __repr__(self) -> str: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As there is no test file in this pull request nor any test function or class in the file |
||
""" | ||
>>> token = Token(1, 2, "c") | ||
>>> repr(token) | ||
'(1, 2, c)' | ||
>>> str(token) | ||
'(1, 2, c)' | ||
""" | ||
return f"({self.offset}, {self.length}, {self.indicator})" | ||
|
||
|
||
class LZ77Compressor: | ||
LuciaHarcekova marked this conversation as resolved.
Show resolved
Hide resolved
|
||
""" | ||
Class containing compress and decompress methods using LZ77 compression algorithm. | ||
""" | ||
|
||
def __init__(self, window_size: int = 13, lookahead_buffer_size: int = 6) -> None: | ||
self.window_size = window_size | ||
self.lookahead_buffer_size = lookahead_buffer_size | ||
self.search_buffer_size = self.window_size - self.lookahead_buffer_size | ||
|
||
def compress(self, text: str) -> list[Token]: | ||
""" | ||
Compress the given string text using LZ77 compression algorithm. | ||
|
||
Args: | ||
text: string to be compressed | ||
|
||
Returns: | ||
output: the compressed text as a list of Tokens | ||
|
||
>>> lz77_compressor = LZ77Compressor() | ||
>>> str(lz77_compressor.compress("ababcbababaa")) | ||
'[(0, 0, a), (0, 0, b), (2, 2, c), (4, 3, a), (2, 2, a)]' | ||
>>> str(lz77_compressor.compress("aacaacabcabaaac")) | ||
'[(0, 0, a), (1, 1, c), (3, 4, b), (3, 3, a), (1, 2, c)]' | ||
""" | ||
|
||
output = [] | ||
search_buffer = "" | ||
|
||
# while there are still characters in text to compress | ||
while text: | ||
|
||
# find the next encoding phrase | ||
# - triplet with offset, length, indicator (the next encoding character) | ||
token = self._find_encoding_token(text, search_buffer) | ||
|
||
# update the search buffer: | ||
# - add new characters from text into it | ||
# - check if size exceed the max search buffer size, if so, drop the | ||
# oldest elements | ||
search_buffer += text[: token.length + 1] | ||
if len(search_buffer) > self.search_buffer_size: | ||
search_buffer = search_buffer[-self.search_buffer_size :] | ||
|
||
# update the text | ||
text = text[token.length + 1 :] | ||
|
||
# append the token to output | ||
output.append(token) | ||
|
||
return output | ||
|
||
def decompress(self, tokens: list[Token]) -> str: | ||
""" | ||
Convert the list of tokens into an output string. | ||
|
||
Args: | ||
tokens: list containing triplets (offset, length, char) | ||
|
||
Returns: | ||
output: decompressed text | ||
|
||
Tests: | ||
>>> lz77_compressor = LZ77Compressor() | ||
>>> lz77_compressor.decompress([Token(0, 0, 'c'), Token(0, 0, 'a'), | ||
... Token(0, 0, 'b'), Token(0, 0, 'r'), Token(3, 1, 'c'), | ||
... Token(2, 1, 'd'), Token(7, 4, 'r'), Token(3, 5, 'd')]) | ||
'cabracadabrarrarrad' | ||
>>> lz77_compressor.decompress([Token(0, 0, 'a'), Token(0, 0, 'b'), | ||
... Token(2, 2, 'c'), Token(4, 3, 'a'), Token(2, 2, 'a')]) | ||
'ababcbababaa' | ||
>>> lz77_compressor.decompress([Token(0, 0, 'a'), Token(1, 1, 'c'), | ||
... Token(3, 4, 'b'), Token(3, 3, 'a'), Token(1, 2, 'c')]) | ||
'aacaacabcabaaac' | ||
""" | ||
|
||
output = "" | ||
|
||
for token in tokens: | ||
for _ in range(token.length): | ||
output += output[-token.offset] | ||
output += token.indicator | ||
|
||
return output | ||
|
||
def _find_encoding_token(self, text: str, search_buffer: str) -> Token: | ||
"""Finds the encoding token for the first character in the text. | ||
|
||
Tests: | ||
>>> lz77_compressor = LZ77Compressor() | ||
>>> lz77_compressor._find_encoding_token("abrarrarrad", "abracad").offset | ||
7 | ||
>>> lz77_compressor._find_encoding_token("adabrarrarrad", "cabrac").length | ||
1 | ||
>>> lz77_compressor._find_encoding_token("abc", "xyz").offset | ||
0 | ||
>>> lz77_compressor._find_encoding_token("", "xyz").offset | ||
Traceback (most recent call last): | ||
... | ||
ValueError: We need some text to work with. | ||
>>> lz77_compressor._find_encoding_token("abc", "").offset | ||
0 | ||
""" | ||
|
||
if not text: | ||
raise ValueError("We need some text to work with.") | ||
|
||
# Initialise result parameters to default values | ||
length, offset = 0, 0 | ||
|
||
if not search_buffer: | ||
return Token(offset, length, text[length]) | ||
|
||
for i, character in enumerate(search_buffer): | ||
found_offset = len(search_buffer) - i | ||
if character == text[0]: | ||
found_length = self._match_length_from_index(text, search_buffer, 0, i) | ||
# if the found length is bigger than the current or if it's equal, | ||
# which means it's offset is smaller: update offset and length | ||
if found_length >= length: | ||
offset, length = found_offset, found_length | ||
|
||
return Token(offset, length, text[length]) | ||
|
||
def _match_length_from_index( | ||
self, text: str, window: str, text_index: int, window_index: int | ||
) -> int: | ||
"""Calculate the longest possible match of text and window characters from | ||
text_index in text and window_index in window. | ||
|
||
Args: | ||
text: _description_ | ||
window: sliding window | ||
text_index: index of character in text | ||
window_index: index of character in sliding window | ||
|
||
Returns: | ||
The maximum match between text and window, from given indexes. | ||
|
||
Tests: | ||
>>> lz77_compressor = LZ77Compressor(13, 6) | ||
>>> lz77_compressor._match_length_from_index("rarrad", "adabrar", 0, 4) | ||
5 | ||
>>> lz77_compressor._match_length_from_index("adabrarrarrad", | ||
... "cabrac", 0, 1) | ||
1 | ||
""" | ||
if not text or text[text_index] != window[window_index]: | ||
return 0 | ||
return 1 + self._match_length_from_index( | ||
text, window + text[text_index], text_index + 1, window_index + 1 | ||
) | ||
|
||
|
||
if __name__ == "__main__": | ||
from doctest import testmod | ||
|
||
testmod() | ||
# Initialize compressor class | ||
lz77_compressor = LZ77Compressor(window_size=13, lookahead_buffer_size=6) | ||
|
||
# Example | ||
TEXT = "cabracadabrarrarrad" | ||
compressed_text = lz77_compressor.compress(TEXT) | ||
print(lz77_compressor.compress("ababcbababaa")) | ||
decompressed_text = lz77_compressor.decompress(compressed_text) | ||
assert decompressed_text == TEXT, "The LZ77 algorithm returned the invalid result." |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.