11.30.07
Posted in Computer Science, Computer Science III, Programming at 1:45 am by Nick
While researching Red-Black trees during my original coding homework sometime ago I stumbled upon the Arne Anderson (AA) tree. Looking back I don’t see why I didn’t push off the Red-Black trees and go with AA Trees. Who knows? Here is an analysis of my work with AA Trees:
AA Tree’s are very similar to Red-Black trees. They add one rule in addition to the traditional Red-Black tree rules, and that is that a Red node may only be a Right child. This eliminates a substantial amount of special cases, making this algorithm far easier to implement. Interestingly enough, AA trees perform as well as Red-Black trees. Since the AA algorithm is simpler, it takes less time to compute, but since it makes more rotations then Red-Black trees the performance evens out. Due to the nature of the algorithm (which shapes the tree), AA trees also tend to have better retrieval times then Red-Black trees. For those unfamiliar with the tree (which is probably most people), there are only two operations required for maintaining the tree (compared to the red-black tree’s five). A right rotation occurs when an insertion or deletion creates a red left child; this is known as a Skew. Split’s perform a (conditional) left rotation when a function creates two red links in a row.
My program implements a string to integer map using AA Trees. As an example, it counts word frequencies in a file. The string handling is rather rudimentary (converts to lower case and tosses out punctuation), so some of the words are kind of funky looking, but it works pretty well. I was surprised at how fast this program ran. I didn’t time it exactly, but it scanned a 11.7MB text file full of text in less than 30 seconds.
During my coding I hit several roadblocks. Coding the algorithm went well, but when I was programming my insertion function it all went screwy. I rewrote it three or four times only to realize that my initialization in my constructor was wrong (while cascading assignments for pointers I forgot the NULL at the end). While stepping through the program for an hour using GDB, I was rather clueless as to what was going on. Everything was wrong. None of the pointers were right, none had key data stored, and it crashed randomly. After fixing this I ran into an error with Skew running on my root pointer, which should never happen. Once I got passed these problems all remaining changes were pretty cosmetic.
I would love to package this up using templates and reuse this tree implementation elsewhere. It is pretty applicable, but again, the STL has a very good built-in map class.
About the code:
I did not flesh this code out much. The tree lacks a deletion function, and I show no examples of how to retrieve a word’s count using Lookup, though it is pretty straight forward. I use Lookup primarily as a wrapper for Insert (and for clarity), differentiating the names for their intended use within the code. There is also a large amount of pointer manipulation in this code, which I tried to document, but it may still be confusing to those new to pointers. Interestingly enough, while the algorithm itself contains a significant amount of references to red and black nodes (which this algorithm is based on), there is no reference to such in the code. The code assumes all Right links are red nodes, and instead everything is based off of the “level” of a node, which is the number of left links that must be taken to reach a NULL pointer.
It comes with a few text files as well. Data.txt is a smaller text file I used to test, being about 500KB or so, which resulted in out.txt. The file test.txt was a small test, and bigout.txt is the result of scanning the whole 11.7MB file.
Code -> http://www.nick-cash.com/download/cs3/tree
Note: A wonderful summary, analysis, and comparison between some of the most common efficient data types can be found at http://www.upgrade-cepis.org/issues/2004/5/up5-5Mosaic.pdf. I recommend at least a glance at the diagrams. It is obvious that AA trees are outperformed on many levels, but they hold steady and beat data types on various challenges. Considering the simplicity of the algorithm and ease of implementation compared to the others, I’d say its far more useful.
Permalink
11.26.07
Posted in Computer Science, Computer Science III, Programming at 11:31 pm by Nick
It was interesting reading about a data type I had been using ever since I began coding. Little did I know that SMAUG and SWR made use of sequentially linked lists! Experimenting with lists has been fun and interesting. Once I got into programming it was hard to stop adding features. Lists are very easy to manipulate, and once the basics are in its easy to add things.
What I ended up with after this assignment was a class that replicated most features of STL’s built-in list class. It is templated to allow for any data type, and given a couple tests it seemed to work well. I purposefully avoided the use of any STL data types to drive the class, as that would be counter productive in the learning and showing of knowledge about lists.
I experienced many oddities with lists while I programed. After, again, fighting with template syntax, and I had an initial implementation I realized I was indeed missing a vital feature of a list data type: the ability to traverse it. When I gave the Element class the ability to cough up its pointers, it broke all list class functions that needed to set them since that data was now private. I am not totally happy with this implementation because it gives everyone the ability to modify a given elements pointers, not just its data. This is not good since anyone using those functions could potentially screw up the list, and things like that can be very, very hard to track down (nothing like spending hours stepping through a program to evaluate pointers at every step).
However, this was a very vital feature, so once I added it I found I was able to do quite a bit more with the list. An interesting side effect of my implementation is that you can add and remove things to the list without adversely affecting any current iteration (supposing some pointers are handled correctly). I demonstrate this slightly in the example programs provided by inserting an element into a list when a certain point is reached (and then that element is eventually processed like the rest).
When implementing the ability to traverse the list I found it made the code far more obscure. Lines that were already strange looking, like:
x->next->prev = x->prev;
became
(x->GetNext())->SetPrev(x->GetPrev());
And that, as far as readability of code, is very bad. However, its fairly simple to reason out, and with comments I feel no need to separate it into multiple lines of code. Also, each member function of the class is pretty specific and short, and they would normally appear multiple times within code, so each function is inline. I have no real way to tell if this is faster then outline code (is that even what its called?), but it seems reasonable that it should be inline.
Note: The code below is separated into different files for readability. It should be noted that all the code in list.h and list.cpp -NEED- to be in whatever file they will be used by. Therefore, all of the test files (main_int.cpp and main_string.cpp) would need this code copied and pasted into them to work. This has to do with the nature of how C++ uses templates and the code must be present in the source file it is used by. Due to the way I program this is awfully inconvenient, as I love my larger number of smaller file type projects. Also, the output from the two given test programs is provided in text format for examination.
Code -> http://www.nick-cash.com/download/cs3/list
Permalink
11.20.07
Posted in Computer Science, Computer Science III, Programming at 4:53 pm by Nick
While I am in the midst of Thanksgiving break from school, I find myself not taking a break at all. I’m working, and when not working I’m attempting to get school work done, as I have much to do (as always).
In any event, I decided to jump into hashing, and hash tables more specifically. Hash tables are something I’m very used to because when I started programming with SMAUG it had a hash table built into it (if you have a thousand instances of a mob it is an awful waste of space to allocate the same string a thousand times). So I learned the in’s and out’s required to use it, but like all of my knowledge at the time, I knew how to make it work without knowing -why- it worked. That has changed.
Data Structure Techniques by Thomas A. Standish had a lot of good information on hash tables. It taught me many things, even with my preconceived notions of what a hash table did and what its purpose was. I never even considered adding a key by a method other then division (modulo). His text on the multiplicative method is very interesting, though it seems a little too narrow to be useful in a large number of instances.
I initially had trouble testing my different hashing functions for one very simple reason: my computer. Since my last computer bit the dust (the eMachine lasted 5 years with no problems, I’d say thats pretty good quality), I’ve finally invested the money into a -real- computer (more on that here). Simply put, with the original test cases my independent, worst-case variable, the list structure, performed so fast I had to come up with another solution.
As I was in node mood to figure out and implement the use of a high performance timer (known as QueryPerformanceCounter on Windows), I merely used google and grabbed an easy implementation (doesn’t get easier then one structure!). After implementing that, I increased the number of items hashed by about 100 fold, bring most of the searches up to 15 million items. This gave a sufficient look at the differences between hashing functions and their performance compared to a list.
To compare hash functions, I went to Wikipedia (I think) and grabbed a recommended hash function, which happened to be Jenkin’s One-At-A-Time Hash. I then implemented two more very basic hashes. One was based off the SMAUG version, which hashes a string based on its length. The other is based on the first letter of the string. Oddly enough I named that CaseHash, when really it has nothing to do with the case of the first letter, but the letter itself.
The data was rather astounding. Jenkin’s hash almost always performed as well as or better then all other methods. Sometimes it would take a little longer (we are talking less then hundredths of a second longer though), but almost always was the top performer. Note in the final screen shot how it outperforms the other has functions, or performs as well. When doing a particularly hard lookup it would perform much, much better.
Provided via the link below is the source and a couple of screen shots (some have different output because I was busy tinkering). The first time is provided using the time() function, the second time is using the performance timer (with a precision of 15).
Code and screen shots -> http://www.nick-cash.com/download/cs3/hash
Notes:
1) I intended for this hash table implementation to be far more flexible, but after fighting with templates for quite some time I decided to merely call it quits and hard code the strings in.
2) In screen shots 1,2, and 3 it appears to be sorting numbers. These are not numbers, but in fact strings that contain incrementing numbers. This was the product of the program I used to generate the data.
3) The programs I used to generate the data are also included in the link above. They are small and simple.
4) All code listed above is public domain. Use it as you like, or don’t use it. Your choice.
Permalink