[Javascript] Huffman tree - Softwareontwikkeling

dinsdag 25 juli 2006 21:05

Acties:

Topicstarter

Ik wil met javascript een comprimeer script maken gebaseerd op de techniek door Huffman beschreven. Simpel gezegd worden deze stappen gevolgd:

Zoek alle unieke tekens op uit de input tekst;
Tel hoe vaak deze unieke tekens voorkomen;
Sorteer de tekens aan de hand van hoe vaak ze voorkomen, hoe vaker een teken voorkomt, hoe lager die moet komen;
Maak een boom diagram waarbij de minst voorkomende tekens zo laag mogelijk moeten komen;
Ga de input tekst coderen. Vervang elk teken met de waarde die je kunt achterhalen uit de boomdiagram (hoe dit precies zit weet ik wel maar kan ik niet uitleggen);

Alle stappen behalve stap 4 lukken me. Ik heb geen idee hoe de boom er precies uit moet komen te zien, aangezien de "layout" ervan afhankelijk is van het aantal tekens dat voorkomt in de oorspronkelijke tekst.

Zo ziet mijn letter/tekenfrequentietabel eruit als ik alle loops achter de rug heb. Hoe lager een teken in de array, hoe vaker die voorkomt. "f" komt dus vaker voor dan "w", "w" vaker dan "k", enzovoorts.

code:

1	var arrChars = new Array("d", "e", "g", "a", "k", "w", "f");

Weet iemand hoe zo'n Huffman-tree in elkaar gezet moet worden aan de hand van een frequentietabel? Het script dat ik nu heb kan in ongeveer 10 seconde 5 MB aan data comprimeren als ik gewoon lege waarden neem voor de vervangende tekens, dus mét Huffman-tree zal het hooguit 15 seconde worden. Het uitpakken van data gaat naar verwachting 3 keer sneller, aangezien er dan geen 5 MB maar rond de 50% of minder ingelezen hoeft te worden én omdat je geen grote loop hoeft te doorlopen om de letterfrequentietabel op te bouwen.

dinsdag 25 juli 2006 21:29

Acties:

eamelink

Droptikkels

Wat lukt er niet als je bijvoorbeeld deze methode gebruikt?

A fast way to create a Huffman tree is to use the heap data structure, which keeps the nodes in partially sorted order according to a predetermined criterion. In this case, the node with the lowest weight is always kept at the root of the heap for easy access.

Creating the tree:
Start with as many leaves as there are symbols.
Push all leaf nodes into the heap.
While there is more than one node in the heap:
Remove two nodes with the lowest weight from the heap.
If the heap was storing copies of node data rather than pointers to nodes in final storage for the tree, move these nodes to final storage.
Create a new internal node, with the two just-removed nodes as children (either node can be either child) and the sum of their weights as the new weight.
Update the parent links in the two just-removed nodes to point to the just-created parent node.
Push the new node into the heap.
The remaining node is the root node; the tree has now been generated.

Bron : http://en.wikipedia.org/wiki/Huffman_coding#Basic_technique

[ Voor 6% gewijzigd door eamelink op 25-07-2006 21:31 ]

woensdag 26 juli 2006 12:16

Acties:

Skit3000

Topicstarter

Ik krijg bij mijn boom diagram alleen maar 01, 001, 0001, 00001, enz. als binaire waarden. Volgens mij heb ik alle stappen gevolgd...

JavaScript:

<script language="javascript">

var text = "een test tekst";
createTree(text);

function createTree(strText)
{
    // Variables used for creating a frequenty table
    var arrChars = new Array();
    var strChar = "";
    var intChar = "";
    // Create the frequenty table
    for (var i=0; i < strText.length; i++)
    {
        strChar = strText.charAt(i);
        intChar = strText.charCodeAt(i);
        if (typeof arrChars[intChar] == "undefined")
        {
            arrChars[intChar] = new treeNode("LEAF");
            arrChars[intChar].value = strChar;
        }
        arrChars[intChar].count++;
    }
    // Calculate weight in %
    for (var intRank in arrChars)
    {
        arrChars[intRank].weight = arrChars[intRank].count / strText.length * 100;
    }
    // Sort, based on weight
    arrChars = arrChars.sort(function(x,y){return y.weight-x.weight});
    // Put characters in a heap (this will remove all empty items which are created while
    // creating the frequenty table, since javascript doesn't like array index numbers like
    // "143" to be used)
    var arrHeap = new Array();
    for (var i in arrChars)
    {
        arrHeap.push(arrChars[i]);
    }
    // Create a Huffman table
    var objLeft;
    var objRight;
    var objParent;
    while (arrHeap.length > 1)
    {
        // Remove the two lowest nodes from the heap
        objLeft = arrHeap.pop();
        objRight = arrHeap.pop();
        // Create a new parent node which will contain both child nodes
        objParent = new treeNode("BRANCH");
        objParent.weight = objLeft.weight + objRight.weight;
        // Assign the child nodes to the parent node and put the parent node on the heap
        objParent.childLeft = objLeft;
        objParent.childRight = objRight;
        arrHeap.push(objParent);
        // Sort, based on weight
        arrChars = arrChars.sort(function(x,y){return y.weight-x.weight});
    }
    
    // Calculate the binary value of each node and output it
    document.write("<pre>");
    document.write("<hr>");
    document.write("<b>Calculated binary values</b>");
    document.write("<hr>");
    document.write(objParent.setBinary(""));
    document.write("</pre>");
    
    // Write created tree to screen
    document.write("<pre>");
    document.write("<hr>");
    document.write("<b>Created tree</b>");
    document.write("<hr>");
    document.write(dump(objParent));
    document.write("</pre>");
}

// Object which will be created for every node in the binary Huffman tree
function treeNode(strType)
{
    this.type = strType;
    this.weight = null;
    this.binary = null;
    // Only add properties which are used by the specified type
    if (strType == "LEAF")
    {
        this.value = null;
        this.count = 0;
    }
    else if (strType == "BRANCH")
    {
        this.childLeft = null;
        this.childRight = null;
    }
    // Function which updates its binary value ánd which executes itself at its child nodes (if
    // the current node is of the BRANCH type)
    this.setBinary = function(strBinary)
    {
        var BRANCH = 1;
        var strTemp = "";
        this.binary = strBinary;
        if (this.type == "BRANCH")
        {
            strTemp = this.childLeft.setBinary(strBinary + "0");
            strTemp += this.childRight.setBinary(strBinary + "1");
            return strTemp;
        }
        else
        {
            return this.value + " = " + strBinary + "\n";
        }
    }
}

// Function used for debugging
function dump(arr, level)
{
    var dumped_text = "";
    if (!level)
    {
        level = 0;
    }
    var level_padding = "";
    for(var j=0; j < level + 1; j++)
    {
        level_padding += "    ";
    }
    if( typeof arr == 'object')
    {
        for(var item in arr)
        {
            var value = arr[item];
            if (typeof value == 'object')
            {
                dumped_text += level_padding + "'" + item + "' ...<br>";
                dumped_text += dump(value, level + 1);
            }
            else
            {
                if (item != "setBinary")
                {
                    dumped_text += level_padding + "'" + item + "' => \"" + value + "\"<br>";
                }
            }
        }
    }
    else
    {
        dumped_text = "===>" + arr + "<===(" + typeof(arr) + ")";
    }
    return dumped_text;
}

</script>

[ Voor 10% gewijzigd door moto-moi op 26-07-2006 14:26 . Reden: ff de code tag aangepast zodat hij de highlighting gebruikt :) ]

woensdag 26 juli 2006 12:36

Acties:

WormLord

Devver

Op regel 56 zit je fout. Je sorteerd daar de verkeerde lijst.

woensdag 26 juli 2006 13:05

Acties:

Skit3000

Topicstarter

Dat ik dat niet eens heb gezien

Bedankt!!!

woensdag 26 juli 2006 13:24

Acties:

zwippie

Electrons at work

Ik krijg bij mijn boom diagram alleen maar 01, 001, 0001, 00001, enz. als binaire waarden. Volgens mij heb ik alle stappen gevolgd...

Als ik me niet vergis is dat toch ook de bedoeling van het algoritme? Je krijgt zo een binary tree met links steeds de 0-node en rechts de 1-node.

code:

How much can you compute with the "ultimate laptop" with 1 kg of mass and 1 liter of volume? Answer: not more than 10^51 operations per second on not more than 10^32 bits.

woensdag 26 juli 2006 13:40

Acties:

Skit3000

Topicstarter

Ja, maar als je alleen maar op een reeks van 01, 001, 0001, 00001 uitkomt dan heb je dus alleen bij alle binaire waarden die kleiner zijn dan 8 tekens winst door je compressie en dat zijn er in dat geval niet zoveel. Met de opmerking die WormLord maakte werkt het wél gewoon goed en gebruikt het script een veel groter scala aan combinaties. Verander regel 56 maar eens van:

code:

1	arrChars = arrChars.sort(function(x,y){return y.weight-x.weight});

code:

1	arrHeap = arrHeap.sort(function(x,y){return y.weight-x.weight});

woensdag 26 juli 2006 14:00

Acties:

zwippie

Electrons at work

Ok, ik dacht dat je er een gecomprimeerde file van wilde maken, want in dat geval moet je wél de reeksen met de trailing zeroes hebben, anders zou je de verschillende elementen niet meer terug kunnen vinden als alle bitreeksen achter elkaar staan. Door de nullen weet je wanneer het volgende element begint.

Disclaimer: Heel erg goed ken ik het algoritme nou ook weer niet.

How much can you compute with the "ultimate laptop" with 1 kg of mass and 1 liter of volume? Answer: not more than 10^51 operations per second on not more than 10^32 bits.

woensdag 26 juli 2006 14:23

Acties:

Skit3000

Topicstarter

Dat is juist de grap van dit algoritme, door op een speciale manier die binaire codes samen te stellen heb je al die binaire rommel niet nodig

Als je naar het voorbeeld kijkt dan zie je dat "een test tekst" met de volgende binaire boom:

code:

  = 00
n = 0100
k = 0101
s = 011
t = 10
e = 11

Wordt omgezet in:

code:

1 2	e e n t e s t t e k s t 11 11 0100 00 10 11 011 10 00 10 11 0101 011 10

Ofwel aan elkaar:

code:

1	1111010000101101110001011010101110

Je kan die string nu gewoon opdelen in stukjes van 8, en die wegschrijven als normaal ASCII teken.Waar je normaal dus voor "een test tekst" 14 bytes nodig hebt, heb je voor de gecomprimeerde versie maar (35 bits : 8 bits/byte = ) 5 bytes nodig. En je kan de comprimeerde string met de waarden die door de tree zijn gemaakt maar op één manier weer uitpakken...

zaterdag 29 juli 2006 20:48

Acties:

Skit3000

Topicstarter

Oke, ik heb alles draaien en het werkt opzich nog goed ook! Ik kan 50 KB in 1 seconde comprimeren tot ongeveer 45% van de oorspronkelijke grootte, bij grotere bestanden is de compressie zelfs nog groter (maar duurt het dus ook langer). Het enige nadeel is dat uitpakken van dezelfde hoeveelheid gegevens 4 tot 5 seconde duurt. Ik wil de uitpakloop daarom wat versnellen maar ik weet even niet meer hoe.

Onderstaande functie is het deel van de code die het langste nodig heeft om uitgevoerd te worden (80% van de totale tijd):

JavaScript:

document.write(convert());

function convert()
{
    // The string to be converted
    strBits = "100101110001101010000100110010111000110101000010010100101000011011100101100";
    // An array which holds all the values and their replacements
    arrLookup = new Array()
    arrLookup["100"] = "a";
    arrLookup["101"] = "b";
    arrLookup["1100"] = "c";
    arrLookup["01101"] = "d";
    arrLookup["01000"] = "e";
    arrLookup["01001"] = "f";

    // Array to keep the converted values
    var arrOutput = new Array();
    // Temporary variable which remembers the bits read by strBits.charAt(i)
    var strTempBits = "";
    
    var i = 0;
    intLoopLength = strBits.length;
    while (i < intLoopLength)
    {
        // Read a bit and put it at the end of strTempBits
        strTempBits += strBits.charAt(i);
        i++;
        // If the bit sequence in strTempBits is the same as a key name of the 
        // arrLookup array, output the value of that array item to arrOutput
        if (arrLookup[strTempBits])
        {
            arrOutput.push(arrLookup[strTempBits]);
            strTempBits = strBits.charAt(i);
            i++;
        }
    }
    // Put the array into a string variable
    var strOutput = arrOutput.join("");
    // Output
    return strOutput;
}

De grootste tegenstribbelaar is de "if (arrLookup[strTempBits])" regel. Weet iemand anders een manier om een soort van Lookup-table te maken waarbij je sneller kunt controleren of het opgevraagde item bestaat dan bovenstaande methode?

[ Voor 6% gewijzigd door Skit3000 op 29-07-2006 20:48 ]

zaterdag 29 juli 2006 20:50

Acties:

Skit3000

Topicstarter

Ik heb trouwens ook een Run Length Encoding gemaakt die heel snel werkt en als ik die combineer met de huffman-methode die ik nu gebruik, kan de compressieratio oplopen van 60 tot 80%. Als alles af is maak ik de code trouwens open-source, bij mijn weten is er namelijk nog geen compressietechniek voor javascript die zulke resultaten kan behalen

zaterdag 29 juli 2006 20:58

Acties:

frickY

Skit3000 schreef op woensdag 26 juli 2006 @ 14:23:
.Waar je normaal dus voor "een test tekst" 14 bytes nodig hebt, heb je voor de gecomprimeerde versie maar (35 bits : 8 bits/byte = ) 5 bytes nodig.

En hoeveel extra bytes heb je nodig om de tree in op te slaan? Want ik meen dat je die ook nodig hebt om je plaintext weer terug te krijgen?

zaterdag 29 juli 2006 21:10

Acties:

Janoz

Moderator Devschuur®

!litemod

Bij het uitlezen moet je juist weer de boom gebruiken. Je leest gewoon het bestand bit voor bit in. Bij een 0 neem je het ene kind en bij een 1 neem je het andere kind. Als je bij een leaf uitkomt zet je het bijbehorende teken in je output en begin je weer in de root. OP die manier kun je in O(N) je hele bestand decoderen.

frickY schreef op zaterdag 29 juli 2006 @ 20:58:
[...]

En hoeveel extra bytes heb je nodig om de tree in op te slaan? Want ik meen dat je die ook nodig hebt om je plaintext weer terug te krijgen?

De bytes die je nodig hebt om die boom op te slaan ligt tussen de n en 2n bytes waarbij n het aantal verschillende tekens in je document is (en het aantal tekens niet boven de 256 uitkomt, in dat geval heb je per teken immers meer ruimte nodig). Bij teksten van meer dan 1 regel kan het dus al uit. Wat de post waarop je reageert echter wil laten zien is dat je niet perse een fixed width per teken of stopteken nodig hebt om de verschillende tekens te kunnen onderscheiden.

[ Voor 54% gewijzigd door Janoz op 29-07-2006 21:15 ]

Ken Thompson's famous line from V6 UNIX is equaly applicable to this post:
'You are not expected to understand this'

zaterdag 29 juli 2006 21:32

Acties:

Skit3000

Topicstarter

frickY schreef op zaterdag 29 juli 2006 @ 20:58:
[...]

En hoeveel extra bytes heb je nodig om de tree in op te slaan? Want ik meen dat je die ook nodig hebt om je plaintext weer terug te krijgen?

Bij mijn manier heb je per uniek teken in de originele tekst voor de tree 8 bits nodig voor het oorspronkelijke teken, 4 bits om de lengte van het vervangende teken aan te geven en daarna 1 tot 16 bits voor dat vervangende teken. Dus tussen de 13 en 24 bits per teken. Ik heb expres voor deze wijze gekozen (en niet voor het reconstructueren van de originele tree) omdat voor het maken van de tree best veel code nodig is. Omdat het in javascript is (en de client dus altijd voor het uitpakken van data eerst het uitpak algoritme moet downloaden) denk ik dat het op deze wijze beter en sneller is dan het zelf elke keer opnieuw genereren van de benodigde tree. Bij grote bestanden merk je het verschil in grootte zelfs haast niet eens, omdat het maar om een paar bits gaat op vaak vele kilobytes.

zaterdag 29 juli 2006 21:42

Acties:

TheBlasphemer

Als je nog snelheid wilt optimaliseren kun je dit hele stukje weggooien:

code:

    // Calculate weight in %
    for (var intRank in arrChars)
    {
        arrChars[intRank].weight = arrChars[intRank].count / strText.length * 100;
    }

en simpelweg ipv "count" "weight" gebruiken. Je gebruikt die weight immers alleen om te sorteren, en je wilt het hoogste percentage bovenop hebben. Aangezien het hoogste percentage toch altijd het hoogst aantal keren geteld is, hoef je die hele percentage berekening niet te doen

[img=http://www.web2messenger.com/smallstatus/w2m/theblasp.png]

zaterdag 29 juli 2006 21:51

Acties:

H!GHGuY

Try and take over the world...

er bestaat bovendien een instant-methode zodat je niet 2 maal door je bestand moet lopen.

je bouwt daarin je trie tijdens het doorlopen dynamisch op. Deze voorkomt ook dat je je huffman trie moet doorzenden. Ze kan ook het bestand nog wat verder inkorten (doordat codes maar tijdens de loop van het bestand langer worden)

ASSUME makes an ASS out of U and ME

zaterdag 29 juli 2006 22:20

Acties:

Skit3000

Topicstarter

TheBlasphemer schreef op zaterdag 29 juli 2006 @ 21:42:
Als je nog snelheid wilt optimaliseren kun je dit hele stukje weggooien:
code:
1
2
3
4
5
    // Calculate weight in %
    for (var intRank in arrChars)
    {
        arrChars[intRank].weight = arrChars[intRank].count / strText.length * 100;
    }
en simpelweg ipv "count" "weight" gebruiken. Je gebruikt die weight immers alleen om te sorteren, en je wilt het hoogste percentage bovenop hebben. Aangezien het hoogste percentage toch altijd het hoogst aantal keren geteld is, hoef je die hele percentage berekening niet te doen

Ik had het ook al door ja en had die aanpassing zelf al gemaakt. Toch bedankt

Het inpakken is het probleem ook niet zozeer, dat heb ik al redelijk geoptimaliseerd maar het probleem is dat het uitpakken van dezelfde data 4 keer langer duurt door de lookup-tabel die ik gebruik.

HIGHGuY schreef op zaterdag 29 juli 2006 @ 21:51:
er bestaat bovendien een instant-methode zodat je niet 2 maal door je bestand moet lopen.

je bouwt daarin je trie tijdens het doorlopen dynamisch op. Deze voorkomt ook dat je je huffman trie moet doorzenden. Ze kan ook het bestand nog wat verder inkorten (doordat codes maar tijdens de loop van het bestand langer worden)

Voor de duidelijkheid hier nog even de laatste stand van zaken van m'n script. Graag niet ergens anders neerkwakken en zeggen dat je het zelf hebt gemaakt

HTML:

<pre>
Data:                   [Header (16)][Lookup table (max 7168)][Compressed data]

Header:                 [Unused bits at last compressed byte (3)][Length of lookup table (13)]
Lookup table:           [Character (13 to 28)][Character (13 to 28)][...]
Character:              [Length of binary code (4)][Original character (8)][Binary code (1 to 16)]
Compressed data:        [10010101011011010101...]

<i>Numbers between brackets are length values in BITS</i>
<hr>
<script language="javascript">

// Use a long text to get a good look at the needed time
var text = "Five sco1re year0s ago, a great American, in whose symbolic shadow we stand today, signed the Emancipation Proclamation. This momentous decree came as a great beacon light of hope to millions of Negro slaves who had been seared in the flames of withering injustice. It came as a joyous daybreak to end the long night of their captivity. But one hundred years later, the Negro still is not free. One hundred years later, the life of the Negro is still sadly crippled by the manacles of segregation and the chains of discrimination. One hundred years later, the Negro lives on a lonely island of poverty in the midst of a vast ocean of material prosperity. One hundred years later, the Negro is still languished in the corners of American society and finds himself an exile in his own land. And so we've come here today to dramatize a shameful condition. In a sense we've come to our nation's capital to cash a check. When the architects of our republic wrote the magnificent words of the Constitution and the Declaration of Independence, they were signing a promissory note to which every American was to fall heir. This note was a promise that all men, yes, black men as well as white men, would be guaranteed the unalienable Rights of Life, Liberty and the pursuit of Happiness. It is obvious today that America has defaulted on this promissory note, insofar as her citizens of color are concerned. Instead of honoring this sacred obligation, America has given the Negro people a bad check, a check which has come back marked insufficient funds.But we refuse to believe that the bank of justice is bankrupt. We refuse to believe that there are insufficient funds in the great vaults of opportunity of this nation. And so, we've come to cash this check, a check that will give us upon demand the riches of freedom and the security of justice.We have also come to this hallowed spot to remind America of the fierce urgency of Now. This is no time to engage in the luxury of cooling off or to take the tranquilizing drug of gradualism. Now is the time to make real the promises of democracy. Now is the time to rise from the dark and desolate valley of segregation to the sunlit path of racial justice. Now is the time to lift our nation from the quicksands of racial injustice to the solid rock of brotherhood. Now is the time to make justice a reality for all of God's children.It would be fatal for the nation to overlook the urgency of the moment. This sweltering summer of the Negro's legitimate discontent will not pass until there is an invigorating autumn of freedom and equality. Nineteen sixty-three is not an end, but a beginning. And those who hope that the Negro needed to blow off steam and will now be content will have a rude awakening if the nation returns to business as usual. And there will be neither rest nor tranquility in America until the Negro is granted his citizenship rights. The whirlwinds of revolt will continue to shake the foundations of our nation until the bright day of justice emerges.But there is something that I must say to my people, who stand on the warm threshold which leads into the palace of justice: In the process of gaining our rightful place, we must not be guilty of wrongful deeds. Let us not seek to satisfy our thirst for freedom by drinking from the cup of bitterness and hatred. We must forever conduct our struggle on the high plane of dignity and discipline. We must not allow our creative protest to degenerate into physical violence. Again and again, we must rise to the majestic heights of meeting physical force with soul force.The marvelous new militancy which has engulfed the Negro community must not lead us to a distrust of all white people, for many of our white brothers, as evidenced by their presence here today, have come to realize that their destiny is tied up with our destiny. And they have come to realize that their freedom is inextricably bound to our freedom.";
text += text;
text += text;
text += text;

begin = new Date();
// Compress string
compressed = compress(text);
end = new Date();
alert("Ingepakt in: " + (end - begin) + " miliseconde");

document.write("Compressed length: " + compressed.length + "\n");
document.write("Original length: " + text.length + "\n");
document.write("Saved: " + parseInt(100 - compressed.length / text.length * 100) + "%\n")
document.write("Original text: " + text);
document.write("\n");

begin = new Date();

// Decompress here
decompressed = decompress(compressed);

end = new Date();
alert("Uitgepakt in: " + (end - begin) + " miliseconde");

document.write("Unpacked text: " + decompressed);

function decompress(strCompressed)
{
    var strBits = "";
    // Create binary code of strCompressed
    var arrBits = new Array();
    for (var i=0; i < strCompressed.length; i++)
    {
        arrBits.push(toBinary(strCompressed.charCodeAt(i), 8));
    }
    strBits = arrBits.join("");
    arrBits = null;
    // Extract header bits
    var intHeaderUnusedBits = parseInt(strBits.substr(0, 3), 2);
    var intHeaderLengthLookupTable = parseInt(strBits.substr(3, 13), 2);
    // Extract lookup table and create lookup table array
    var i = 16;
    var intBinaryCodeLength;
    var strOriginalCharacter;
    var intBinaryCode;
    var arrLookup = new Array();
    var intLoopLength = intHeaderLengthLookupTable + 16;
    while (i < intLoopLength)
    {
        intBinaryCodeLength = parseInt(strBits.substr(i, 4), 2);
        strOriginalCharacter = String.fromCharCode(parseInt(strBits.substr(i + 4, 8), 2));
        intBinaryCode = strBits.substr(4 + 8 + i, intBinaryCodeLength);
        arrLookup[intBinaryCode] = strOriginalCharacter;
        i += 4 + 8 + intBinaryCodeLength;
    }
    // Extract compressed data and replace compressed data with data from lookup table array
    i = 16 + intHeaderLengthLookupTable;
    var strOutput = "";
    var arrOutput = new Array();
    intLoopLength = strBits.length - intHeaderUnusedBits;
    var strTempBits = "";
    while (i < intLoopLength)
    {
        strTempBits += strBits.charAt(i);
        i++;
        if (arrLookup[strTempBits])
        {
            arrOutput.push(arrLookup[strTempBits]);
            strTempBits = strBits.charAt(i);
            i++;
        }
    }
    strOutput = arrOutput.join("");
    arrOutput = null;
    // Output
    return strOutput;
}

function compress(strText)
{
    // Variables used for creating a frequenty table
    var arrChars = new Array();
    var strChar = "";
    var strRegExpChar = "";
    var strTempText = strText;
    var intTempLength;
    var objTempTreeNode;
    // Create the frequenty table
    while (strTempText.length > 0)
    {
        strChar = strTempText.charAt(0);
        intTempLength = strTempText.length;
        // Not all characters can be used with the RegExp object, escape them
        if ("[\\^$.|?*+()".indexOf(strChar) > -1)
        {
            strRegExpChar = "\\" + strChar;
        }
        else
        {
            strRegExpChar = strChar;
        }
        strTempText = strTempText.replace(new RegExp(strRegExpChar, "g"), "");
        objTempTreeNode = new treeNode("LEAF");
        objTempTreeNode.value = strChar;
        objTempTreeNode.regex = strRegExpChar;
        // Calculate weight in % of total
        objTempTreeNode.weight = (intTempLength - strTempText.length) / strText.length * 100;
        // Add the tree node object to the arrChars array
        arrChars.push(objTempTreeNode);
    }
    // Sort, based on weight
    arrChars.sort(function(x,y){return y.weight-x.weight});
    // Put characters in a heap (this will remove all empty items which are created while
    // creating the frequenty table, since javascript doesn't like array index numbers like
    // "143" to be used)
    var arrHeap = new Array();
    for (var i in arrChars)
    {
        arrHeap.push(arrChars[i]);
    }
    // Create a Huffman table
    var objLeft;
    var objRight;
    var objParent;
    while (arrHeap.length > 1)
    {
        // Remove the two lowest nodes from the heap
        objLeft = arrHeap.pop();
        objRight = arrHeap.pop();
        // Create a new parent node which will contain both child nodes
        objParent = new treeNode("BRANCH");
        objParent.weight = objLeft.weight + objRight.weight;
        // Assign the child nodes to the parent node and put the parent node on the heap
        objParent.childLeft = objLeft;
        objParent.childRight = objRight;
        arrHeap.push(objParent);
        // Sort, based on weight
        arrHeap.sort(function(x,y){return y.weight-x.weight});
    }
    // Create the tree which will be outputted and added to the compressed code
    objParent.setBinary("");
    // Create the header and lookup table and put the bits of it in front of strBits
    var strBits = strText;
    // Replace 0 and 1 before the actual loop, to prevent them from messing
    // up all other bits. This is only an internal matter and does not change
    // anything to the outputted bit sequence
    strBits = strBits.replace(new RegExp("0", "g"), "\0\0ZERO\0\0");
    strBits = strBits.replace(new RegExp("1", "g"), "\0\0ONE\0\0");
    for (var i in arrChars)
    {
        if (arrChars[i].value == "0")
        {
            strBits = strBits.replace(new RegExp("\0\0ZERO\0\0", "g"), arrChars[i].binary);
            arrChars[i].regex = "\0\0ZERO\0\0";
        }
        else if (arrChars[i].value == "1")
        {
            strBits = strBits.replace(new RegExp("\0\0ONE\0\0", "g"), arrChars[i].binary);
            arrChars[i].regex = "\0\0ONE\0\0";
        }
    }
    // Create the header and lookup table and put the bits of it in front of strBits
    var strHeaderLookupTable = "";
    for (var i in arrChars)
    {
        // Replace this character with its bit sequence
        strBits = strBits.replace(new RegExp(arrChars[i].regex, "g"), arrChars[i].binary);
        // Length of binary code
        strHeaderLookupTable += toBinary(arrChars[i].binary.length, 4);
        // Original character
        strHeaderLookupTable += toBinary(arrChars[i].value.charCodeAt(0), 8);
        // Binary code
        strHeaderLookupTable += arrChars[i].binary;
    }
    var strHeaderLengthLookupTable = toBinary(strHeaderLookupTable.length, 13);
    var strHeader = strHeaderLengthLookupTable + strHeaderLookupTable;
    strBits = strHeader + strBits;
    var strHeaderUnusedBits = toBinary(8 - ((strBits.length + 3) % 8), 3);
    if (strHeaderUnusedBits.length > 3)
    {
        strHeaderUnusedBits = "000";
    }
    strBits = strHeaderUnusedBits + strBits;
    // Append empty zeros to strBits to make sure it ends with an octet
    while (strBits.length % 8 > 0)
    {
        strBits += "0";
    }
    // Create normal ASCII characters of the binary string in strBits
    var strCompressed = "";
    var arrCompressed = new Array();
    for (var i = 0; i < strBits.length; i = i + 8)
    {
        arrCompressed.push(String.fromCharCode(parseInt(strBits.substr(i, 8), 2)));
    }
    strCompressed = arrCompressed.join("");
    arrCompressed = null;
    // Return the compressed string
    return strCompressed;
}

// Object which will be created for every node in the binary Huffman tree
function treeNode(strType)
{
    this.type = strType;
    this.weight = null;
    this.binary = null;
    // Only add properties which are used by the specified type
    if (strType == "LEAF")
    {
        // LEAF type
        this.value = null;
        this.regex = null;
    }
    else
    {
        // BRANCH type
        this.childLeft = null;
        this.childRight = null;
    }
    // Function which updates its binary value ánd which executes itself at its child nodes (if
    // the current node is of the BRANCH type)
    this.setBinary = function(strBinary)
    {
        this.binary = strBinary;
        if (this.type == "BRANCH")
        {
            return this.childLeft.setBinary(strBinary + "0") + this.childRight.setBinary(strBinary + "1");
        }
        else
        {
            return this.value + strBinary + "\0";
        }
    }
}

// Function to convert a number to base 2. The second parameter can adjust the minimal length
// of the outputted value, which is default to 8.
function toBinary(intNumber, intLength)
{
    intNumber = "0000000000000000" + intNumber.toString(2);
    return intNumber.substr(intNumber.length - (intLength || 8));
}

</script>

[ Voor 5% gewijzigd door Skit3000 op 29-07-2006 22:21 ]

zondag 30 juli 2006 00:33

Acties:

Janoz

Moderator Devschuur®

!litemod

Skit3000 schreef op zaterdag 29 juli 2006 @ 21:32:
Omdat het in javascript is (en de client dus altijd voor het uitpakken van data eerst het uitpak algoritme moet downloaden) denk ik dat het op deze wijze beter en sneller is dan het zelf elke keer opnieuw genereren van de benodigde tree.

A;s je het in een appart js bestand zet kan het keurig worden gecached door de browser. Grootte maakt in dat geval ook niet zo heel erg veel meer uit.

Ken Thompson's famous line from V6 UNIX is equaly applicable to this post:
'You are not expected to understand this'

zondag 30 juli 2006 13:29

Acties:

Skit3000

Topicstarter

Janoz schreef op zondag 30 juli 2006 @ 00:33:
[...]

A;s je het in een appart js bestand zet kan het keurig worden gecached door de browser. Grootte maakt in dat geval ook niet zo heel erg veel meer uit.

Behalve als je maar één bestand wilt uitpakken. Bij meerdere bestanden heb je inderdaad gelijk, maar die paar bits die je bespaard door op de "echte" manier de tree door te sturen wegen niet op tegen de extra hoeveelheid code (en tijd!) die je nodig hebt om die tree dan weer te genereren.

zondag 30 juli 2006 17:35

Acties:

XangadiX

trepanatie is zóó kinderachtig

offtopic:
is dit dezeflde techniek als die voor gif compressie wordt gebruikt?

Stoer; Marduq

zondag 30 juli 2006 17:54

Acties:

Skit3000

Topicstarter

XangadiX schreef op zondag 30 juli 2006 @ 17:35:

offtopic:
is dit dezeflde techniek als die voor gif compressie wordt gebruikt?

Deels denk ik. Ik heb géén idee hoe GIF compressie precies werkt, maar ik weet wel dat er minder kleuren te gebruiken zijn (dus heb je per kleur minder bits nodig) en volgens mij werkt het ook door horizontale herhaling "af te korten", dus al zijn er 10 pixels achter elkaar zwart, in plaats van 10 keer de code voor zwart neer te zetten, één speciale code neerzetten, dan de code van zwart en daarna de code voor "10 keer herhalen". Compressie bij afbeeldingen is sowieso anders dan van gegevens als tekst, aangezien het bij een plaatje vaak niet veel uitmaakt als de kleuren iets anders zijn. Maar moet je je eens indenken dat je een document waar je uren aan hebt gewerkt opslaat, met een compressietechniek als GIF opslaat en dat je er later achterkomt dat alleen de letters van A tot en met de M nog maar zichtbaar zijn...

zondag 30 juli 2006 18:02

Acties:

crisp

Devver

Pixelated

Gif compressie is een vorm van LZW; ik heb daar een implementatie voor geschreven in javascript

Voor plaatjes met weinig kleuren is het in ieder geval vziw nog steeds 1 van de beste compressie-technieken. Huffman is bij dat soort data toch een stuk minder efficient (en bovendien trager).

Skit3000: jij beschrijft RLE (runlength encoding), dat is niet het geval bij GIF

[ Voor 70% gewijzigd door crisp op 30-07-2006 18:06 ]

Intentionally left blank

zondag 30 juli 2006 20:52

Acties:

Janoz

Moderator Devschuur®

!litemod

Gif gebruikt idd LZW (vandaar ook dat patenten geneuzel aangezien lzw compressie). Dat komt niet overeen met dit algoritme. Huffman kijkt enkel naar de losse tekens. Bij lzw wordt juist naar tekenreeksen gekeken.

Ken Thompson's famous line from V6 UNIX is equaly applicable to this post:
'You are not expected to understand this'

zondag 30 juli 2006 21:27

Acties:

Skit3000

Topicstarter

[b][message=26217252,noline]Skit3000: jij beschrijft RLE (runlength encoding), dat is niet het geval bij GIF

Ja, ik wist dat het zo heette maar ik hield het maar even op Jip en Janneke taal

Ik heb een RLE-encoding gemaakt die snel werkt en die ik zo bij mijn huidige compressie in kan bouwen. Het enige nadeel is dat het alleen goed werkt bij tekstbestanden, maar dat is ook het enige soort data dat je via javascript zal versturen/ontvangen.

zondag 30 juli 2006 22:53

Acties:

crisp

Devver

Pixelated

Skit3000 schreef op zondag 30 juli 2006 @ 21:27:
[...]

...maar dat is ook het enige soort data dat je via javascript zal versturen/ontvangen.

javascript kan prima overweg met binaire data hoor

Intentionally left blank

maandag 31 juli 2006 14:29

Acties:

Skit3000

Topicstarter

crisp schreef op zondag 30 juli 2006 @ 22:53:
[...]

javascript kan prima overweg met binaire data hoor

Ja, maar hoe kan je javascript binaire data laten weergeven op je scherm?

En mijn RLE-encoding kan ook werken met binaire data, maar het algoritme dat ik gebruik om herhaling te ontdekken is gewoon een RegExp die herhalende woorden herkent.

Pagina: 1

Reageer