README.txt Alper Sarikaya - Section AA Jung Son - Section AB CSE326A Project 3 1. Who is in your group? Jung Son (section AB) and Alper Sarikaya (section AA) were the two people in the group working on this project. 2. Acknowledgment of any assistance you received from anyone or anywhere but your team members, the 326 staff, or the Weiss book. We did not receive any assistance from anyone or anywhere other than the lecture note on sorting (Jung) and the Weiss book (Alper). 3. How long do we think the project will take? (answer before starting) We believe this project will take approximately 5 hours. 4. How long did the project take? It took Jung about 2 hours to finish parts Vb and Vc, and Alper about 3 hours to finish Va. The README and benchmarking together took around 2 hours. 5. Which sorting algorithm did you implement for WordCount? Why? Under what circumstances might other sorting algorithms work better? The sorting algorithm implemented in the WordCount was the Radix sort. Generally, compared to comparison sorts which runs in O(nlogn) time, the Radix sort is faster with O(n*k) time, where n is the number of elements and k is the length of the key space. Other sorting algorithms would work better in cases where the key (in our case an integer) is long. Since Radix sort has to compare each digit in the key for all digits, even if there are few keys, if the length of keys are long, it will take longer to sort than other comparison algorithms. 6. Which data structure do you expect will be the fastest? (answer before starting) We believe the AVL tree will be the fastest. 7. Which data structure is the fastest? Why were you right or wrong? From the benchmarking test, the Binary Search Tree was the fastest. This could be because the BST was simple and to-the-point, there were no secondary helping functions, the call stack was smaller, and the text was probably random enough so that the tree was not too unbalanced. 8. In general, which StringCounter dictionary implementation was "better": trees or hash tables? Note that you will need to define "better" (ease of coding, ease of debugging, memory usage, disk access patterns, runtime for average input, runtime for all input, etc). The trees were "better" implementation of the StringCounter dictionary. Compared to the hash tables, the trees were harder to code thus harder to debug problems during coding; however, it definitely had advantage over the hash table due to the fact that it used less memory and only as much as the number of items in the tree. From our benchmarking analysis, the BST, AVL tree and the Splay tree showed faster runtime than the hash table, and the runtime for the hash table was almost twice that of the trees. Although it was simpler and easier to implement, the slow performance of the hash table makes the trees a better candidate for the StringCounter dictionary. 9. Are there cases in which a particular data structure performs really well or badly in the correlator? Enumerate the cases for each data structure. From our benchmarking analysis, the hash table was the slowest data structure with BST being fastest. In between, the AVL tree peformed better than the Splay tree on atlantis document but the opposite on the hamlet document; however for both documents, the difference in runtime was minimal. 10. Give a one to two paragraph explanation of whether or not you think Bacon wrote Shakespeare's plays based on the data you collected. No fancy statistical analysis here (formal analysis comes later); keep it fun and simple. The first ten most frequent words in Hamlet are the, and, to, of, I, you, a, my, it, in; while the top ten for Atlantis are the, of, and, to, we, that, a, in, for, as. These are pretty wildly different from each other for being the top ten in each of their own stories that could possibly be related. The most notable differences are 'we' (it appears 343 times [1.93%] in Atlantis and only 161 times [0.44%] in Hamlet) and 'I' (631 times in Hamlet [1.8%]and only 73 times [0.4%] in Atlantis). The words that we would attribute to Shakespeare's writing can be thou [107 times, 0.3%], thy [86, 0.25%], hath [65, 0.19%], and thee [59, 0.17%]. By comparison, thou, thy, hath, and thee are used no more than 5 times each in the text of Atlantis. Maybe Shakespeare just took Bacon's text and adapted it by adding his crazy 'olden English', but the evidence seems to point towards Shakespeare getting the ideas for Hamlet from either his wonderful mind or a passer-by frog. :) 11. Give a detailed description of all Above and Beyond projects which you implemented. What did you find most challenging about the projects you implemented? What did you find most interesting? We did not do any of the above and beyond segments of this project. 12. Writeup your benchmarks. BST: hamlet: 0.500, 0.484, 0.500, 0.484, 0.484 avg: 0.490 per word: 0.0000140 atlantis: 0.250, 0.266, 0.265, 0.265, 0.266 avg: 0.262 per word: 0.0000148 AVL: hamlet: 0.609, 0.578, 0.563, 0.641, 0.531 avg: 0.584 per word: 0.0000167 atlantis: 0.297, 0.281, 0.297, 0.297, 0.297 avg: 0.294 per word: 0.0000166 Hash table: hamlet: 0.969, 0.968, 0.984, 0.968, 0.985 avg: 0.975 per word: 0.0000278 atlantis: 0.469, 0.453, 0.438, 0.500, 0.438 avg: 0.460 per word: 0.0000260 Splay tree: hamlet: 0.531, 0.562, 0.579, 0.547, 0.547 avg: 0.553 per word: 0.0000152 atlantis: 0.313, 0.297, 0.297, 0.313, 0.312 avg: 0.306 per word: 0.0000173 count.sh: hamlet: 0.286, 0.284, 0.283, 0.289, 0.288 avg: 0.286 per word: 0.00000816 atlantis: 0.137, 0.141, 0.143, 0.136, 0.143 avg: 0.140 per word: 0.00000790 count.pl: hamlet: 0.215, 0.199, 0.210, 0.204, 0.209 avg: 0.207 per word: 0.00000590 atlantis: 0.100, 0.102, 0.103, 0.105, 0.098 avg: 0.102 per word: 0.00000575 In benchmarking, we are looking at how long it takes for the program to extract the data from the file, insert it into the appropiate ADT, sort in order of occurance (primarily) and lexographically (secondarily) and print the data to the user. This does not take the interpretation of the given arguments into account, just the core of the program. The measured times are over a broad range from 1 second to 0.1 seconds with each ADT and algorithm having their own trends and sticking around a specific running time. While 'better' is probably defined as faster than 0.5 for hamlet.txt. All of our constructed ADTs were beaten by the extremely simple, unbalanced BST, probably noting that our constructed ADTs were probably bloated and filled with extra algorithms. Five trials were taken for each ADT and text file on WordCount.java while calling -frequency (count.sh/count.pl equivalent). All four ADTs were tested on my computer (AMD64 3400+, 1gb RAM) while count.sh and count.pl were tested on attu.cs (multiple processors!). Herein lies an error in comparing these benchmarks: they were ran on separate computers! However, when running the java programs on attu, it actually takes at least 2x the time of running it on my own computer, probably due to the amount of resources that this program demands. The main source of error, therefore, is the running of other users and processes on the computer at the same time - our program never had 100% (or a constant amount of resources, for that matter) of the computers' resources to use at any single time, so this makes it difficult to ascertain how long exactly each ADT takes. Interpretting the results is a simple matter of compiling the given times with the type of data given and the ADT used. As can be seen from the results above, the regular BST kicked our butts. However, the both the splay tree and the AVL tree were very close behind and are comparable to the runtime of the BST. Although the AVL tree did slightly better than the splay tree with the atlantis document (due to the splay tree splaying so many times to add all the words in), the splay tree does a little better than the AVL tree with Hamlet - leading us to speculate that the splay tree does better than the AVL tree with large data sets. Our hash table implements quadratic probing and takes the longest (0.975s on my machine, 2.339s on attu!). However, looking at the time it takes for each per word, we can see that both the AVL tree and the splay tree get progressively worse as the data set gets larger and larger. The time per word actually *goes down* for the hash table, so we can see some constant time and/or efficient hashing going on there! We conclude that from our constructed ADTs (splay, AVL, and hash), the AVL tree is the best at handling small data sets, while the splay tree is exceptional at working on large data sets. 13. What literary character does Ruth most remind you of? Matt? Jonah? Ruth reminds me most of Trillian from Hitchhiker's Guide. I can just imagine her saying something like "hey, let's go to Madagascar!" Matt is a lot like Atticus from To Kill a Mockingbird (sorry for being cliche!), wise and timid. Jong and I haven't seen enough of Jonah this quarter, so we'll liken him to Captain Haddock from Tintin for the heck of it. :) 14. What did you enjoy about this assignment? What did you hate? Could we have done anything better? Although word counting and constructing is always fun, it would be neat if we played a little game with our program - like guess the top ten occuring words or how many times 'the' would pop up. The internal work was great excercise for our little fingers, this is really the best way to get down and dirty with the ADTs.