Hash Code Compression - Java
This is a discussion on Hash Code Compression - Java ; I am currently working on a dictionary populating program. I currently
have a socket connection my local news server and am trawling through
all of the articles looking for new words. Java's String class has a
method that hashes strings. ...
-
Hash Code Compression
I am currently working on a dictionary populating program. I currently
have a socket connection my local news server and am trawling through
all of the articles looking for new words. Java's String class has a
method that hashes strings. I was wondering if i should still be using
these even though I have over two million words in the hash table.
Although the hash table is currently Big 0(4).
I am using the Multiply Add and Divide (MAD) method for the compression
of the hash code, does Java have any built in functions(methods) that
will do this for me, or does anyone know of a more efficient way?
j1mb0jay
-
Re: Hash Code Compression
j1mb0jay wrote:
> I am currently working on a dictionary populating program. I currently
> have a socket connection my local news server and am trawling through
> all of the articles looking for new words. Java's String class has a
> method that hashes strings. I was wondering if i should still be using
> these even though I have over two million words in the hash table.
> Although the hash table is currently Big 0(4).
This makes no sense. O(4) = O(1) = O(0.01) = O(1000000),
by definition. What do you really mean?
> I am using the Multiply Add and Divide (MAD) method for the compression
> of the hash code, does Java have any built in functions(methods) that
> will do this for me, or does anyone know of a more efficient way?
The value delivered by hashCode -- for any class, not
just for String -- is a Java int, 32 bits wide. How (and why)
are you "compressing" this value?
--
Eric Sosman
esosman@ieee-dot-org.invalid
-
Re: Hash Code Compression
Eric Sosman wrote:
> j1mb0jay wrote:
>> I am currently working on a dictionary populating program. I currently
>> have a socket connection my local news server and am trawling through
>> all of the articles looking for new words. Java's String class has a
>> method that hashes strings. I was wondering if i should still be using
>> these even though I have over two million words in the hash table.
>> Although the hash table is currently Big 0(4).
>
> This makes no sense. O(4) = O(1) = O(0.01) = O(1000000),
> by definition. What do you really mean?
>
>> I am using the Multiply Add and Divide (MAD) method for the
>> compression of the hash code, does Java have any built in
>> functions(methods) that will do this for me, or does anyone know of a
>> more efficient way?
>
> The value delivered by hashCode -- for any class, not
> just for String -- is a Java int, 32 bits wide. How (and why)
> are you "compressing" this value?
>
My hash table is made up of an array of n LinkedLists (where n is a
prime number that is roughly double the number of words in the dictionary).
I firstly use the String.hashCode() method on a given word. I then
compress this number so that i can use it as a index into the array of
LinkedList; as this 32bit number is often far to large. I then insert
the word into the LinkedList array at the compressed value index(The
fact the hashTable is an array of LinkedLists is so that it handles
collisions)
After inserting all of the words into the dictionary the largest
LinkedList in the array has only four elements. I thought Big O(4) was
the correct way of describing this.
Would it help if i posted my classes on here, or offer you a place to
download the program.
j1mb0jay
-
Re: Hash Code Compression
On Jan 11, 3:05 pm, j1mb0jay <n...@none.com> wrote:
> Eric Sosman wrote:
> > j1mb0jay wrote:
> >> I am currently working on a dictionary populating program. I currently
> >> have a socket connection my local news server and am trawling through
> >> all of the articles looking for new words. Java's String class has a
> >> method that hashes strings. I was wondering if i should still be using
> >> these even though I have over two million words in the hash table.
> >> Although the hash table is currently Big 0(4).
>
> > This makes no sense. O(4) = O(1) = O(0.01) = O(1000000),
> > by definition. What do you really mean?
>
> >> I am using the Multiply Add and Divide (MAD) method for the
> >> compression of the hash code, does Java have any built in
> >> functions(methods) that will do this for me, or does anyone know of a
> >> more efficient way?
>
> > The value delivered by hashCode -- for any class, not
> > just for String -- is a Java int, 32 bits wide. How (and why)
> > are you "compressing" this value?
>
> My hash table is made up of an array of n LinkedLists (where n is a
> prime number that is roughly double the number of words in the dictionary).
>
> I firstly use the String.hashCode() method on a given word. I then
> compress this number so that i can use it as a index into the array of
> LinkedList; as this 32bit number is often far to large. I then insert
> the word into the LinkedList array at the compressed value index(The
> fact the hashTable is an array of LinkedLists is so that it handles
> collisions)
>
> After inserting all of the words into the dictionary the largest
> LinkedList in the array has only four elements. I thought Big O(4) was
> the correct way of describing this.
>
> Would it help if i posted my classes on here, or offer you a place to
> download the program.
>
> j1mb0jay
Why aren't you using the existing HashMap class?
If you want a compact representation of the words you come across,
consider a prefix tree data structure instead.
Just so you know, Big O measures the dominant term without
multipliers, For instance, if your algorithm takes N *n + N + 4
steps, then it is O(N*N). If it takes 4*n*n steps, it is still O(N*N)
-
Re: Hash Code Compression
j1mb0jay wrote:
> Eric Sosman wrote:
>> j1mb0jay wrote:
>>> I am currently working on a dictionary populating program. I
>>> currently have a socket connection my local news server and am
>>> trawling through all of the articles looking for new words. Java's
>>> String class has a method that hashes strings. I was wondering if i
>>> should still be using these even though I have over two million words
>>> in the hash table. Although the hash table is currently Big 0(4).
>>
>> This makes no sense. O(4) = O(1) = O(0.01) = O(1000000),
>> by definition. What do you really mean?
>>
>>> I am using the Multiply Add and Divide (MAD) method for the
>>> compression of the hash code, does Java have any built in
>>> functions(methods) that will do this for me, or does anyone know of a
>>> more efficient way?
>>
>> The value delivered by hashCode -- for any class, not
>> just for String -- is a Java int, 32 bits wide. How (and why)
>> are you "compressing" this value?
>>
>
>
> My hash table is made up of an array of n LinkedLists (where n is a
> prime number that is roughly double the number of words in the dictionary).
>
> I firstly use the String.hashCode() method on a given word. I then
> compress this number so that i can use it as a index into the array of
> LinkedList; as this 32bit number is often far to large. I then insert
> the word into the LinkedList array at the compressed value index(The
> fact the hashTable is an array of LinkedLists is so that it handles
> collisions)
>
> After inserting all of the words into the dictionary the largest
> LinkedList in the array has only four elements. I thought Big O(4) was
> the correct way of describing this.
>
> Would it help if i posted my classes on here, or offer you a place to
> download the program.
This is very similar to the design of java.util.HashSet, except it
already has methods for mapping from hashCode to bucket number that have
been tested with Java String.
Is there some particular reason for rolling your own rather than using
the java.util class?
Patricia
-
Re: Hash Code Compression
Daniel Pitts wrote:
> On Jan 11, 3:05 pm, j1mb0jay <n...@none.com> wrote:
>> Eric Sosman wrote:
>>> j1mb0jay wrote:
>>>> I am currently working on a dictionary populating program. I currently
>>>> have a socket connection my local news server and am trawling through
>>>> all of the articles looking for new words. Java's String class has a
>>>> method that hashes strings. I was wondering if i should still be using
>>>> these even though I have over two million words in the hash table.
>>>> Although the hash table is currently Big 0(4).
>>> This makes no sense. O(4) = O(1) = O(0.01) = O(1000000),
>>> by definition. What do you really mean?
>>>> I am using the Multiply Add and Divide (MAD) method for the
>>>> compression of the hash code, does Java have any built in
>>>> functions(methods) that will do this for me, or does anyone know of a
>>>> more efficient way?
>>> The value delivered by hashCode -- for any class, not
>>> just for String -- is a Java int, 32 bits wide. How (and why)
>>> are you "compressing" this value?
>> My hash table is made up of an array of n LinkedLists (where n is a
>> prime number that is roughly double the number of words in the dictionary).
>>
>> I firstly use the String.hashCode() method on a given word. I then
>> compress this number so that i can use it as a index into the array of
>> LinkedList; as this 32bit number is often far to large. I then insert
>> the word into the LinkedList array at the compressed value index(The
>> fact the hashTable is an array of LinkedLists is so that it handles
>> collisions)
>>
>> After inserting all of the words into the dictionary the largest
>> LinkedList in the array has only four elements. I thought Big O(4) was
>> the correct way of describing this.
>>
>> Would it help if i posted my classes on here, or offer you a place to
>> download the program.
>>
>> j1mb0jay
>
> Why aren't you using the existing HashMap class?
>
> If you want a compact representation of the words you come across,
> consider a prefix tree data structure instead.
>
> Just so you know, Big O measures the dominant term without
> multipliers, For instance, if your algorithm takes N *n + N + 4
> steps, then it is O(N*N). If it takes 4*n*n steps, it is still O(N*N)
I have been asked to create my own data structures to help aid
understanding for the course material for my degree module.(Check
article header)
Because i am currently building the dictionary file by trawling news
articles each word I pull from an article needs to be checked in the
dictionary to see if we already have it(I don't want to store each word
more than once). My current methodology means I only have to look at a
maximum of 4 words(out of 2.5 million) to see if I already have this
word stored in memory. Does this still mean my retrieval method is Big(N
Squared)
j1mb0jay
-
Re: Hash Code Compression
j1mb0jay wrote:
> Daniel Pitts wrote:
>> On Jan 11, 3:05 pm, j1mb0jay <n...@none.com> wrote:
>>> Eric Sosman wrote:
>>>> j1mb0jay wrote:
>>>>> I am currently working on a dictionary populating program. I currently
>>>>> have a socket connection my local news server and am trawling through
>>>>> all of the articles looking for new words. Java's String class has a
>>>>> method that hashes strings. I was wondering if i should still be using
>>>>> these even though I have over two million words in the hash table.
>>>>> Although the hash table is currently Big 0(4).
>>>> This makes no sense. O(4) = O(1) = O(0.01) = O(1000000),
>>>> by definition. What do you really mean?
>>>>> I am using the Multiply Add and Divide (MAD) method for the
>>>>> compression of the hash code, does Java have any built in
>>>>> functions(methods) that will do this for me, or does anyone know of a
>>>>> more efficient way?
>>>> The value delivered by hashCode -- for any class, not
>>>> just for String -- is a Java int, 32 bits wide. How (and why)
>>>> are you "compressing" this value?
>>> My hash table is made up of an array of n LinkedLists (where n is a
>>> prime number that is roughly double the number of words in the
>>> dictionary).
>>>
>>> I firstly use the String.hashCode() method on a given word. I then
>>> compress this number so that i can use it as a index into the array of
>>> LinkedList; as this 32bit number is often far to large. I then insert
>>> the word into the LinkedList array at the compressed value index(The
>>> fact the hashTable is an array of LinkedLists is so that it handles
>>> collisions)
>>>
>>> After inserting all of the words into the dictionary the largest
>>> LinkedList in the array has only four elements. I thought Big O(4) was
>>> the correct way of describing this.
>>>
>>> Would it help if i posted my classes on here, or offer you a place to
>>> download the program.
>>>
>>> j1mb0jay
>>
>> Why aren't you using the existing HashMap class?
>>
>> If you want a compact representation of the words you come across,
>> consider a prefix tree data structure instead.
>>
>> Just so you know, Big O measures the dominant term without
>> multipliers, For instance, if your algorithm takes N *n + N + 4
>> steps, then it is O(N*N). If it takes 4*n*n steps, it is still O(N*N)
>
> I have been asked to create my own data structures to help aid
> understanding for the course material for my degree module.(Check
> article header)
That is indeed a good reason to avoid using the standard classes.
Perhaps you should try a few different hash code to bucket number
mappings, and compare performance. In some situations I have found that
a really simple, quickly calculated mapping such as reduction modulo a
power of two had about the same collision rate as more complicated,
slower to compute, functions.
>
> Because i am currently building the dictionary file by trawling news
> articles each word I pull from an article needs to be checked in the
> dictionary to see if we already have it(I don't want to store each word
> more than once). My current methodology means I only have to look at a
> maximum of 4 words(out of 2.5 million) to see if I already have this
> word stored in memory. Does this still mean my retrieval method is Big(N
> Squared)
Note that the java.util hash-based classes do offer ways of controlling
the number of buckets.
Big-O is about limits as the problem size tends to infinity. If you only
have to look at 4 words regardless of the number of words, then your
lookup performance would be O(1) (Equivalent to O(42), O(1e100) etc. but
O(1) is more conventional). If the number of words you have to examine
depends on an upper bound on the number of words you process, you would
need to examine the effect of increasing the number of words to get a
computational complexity.
Patricia
-
Re: Hash Code Compression
Patricia Shanahan wrote:
> j1mb0jay wrote:
>> Eric Sosman wrote:
>>> j1mb0jay wrote:
>>>> I am currently working on a dictionary populating program. I
>>>> currently have a socket connection my local news server and am
>>>> trawling through all of the articles looking for new words. Java's
>>>> String class has a method that hashes strings. I was wondering if i
>>>> should still be using these even though I have over two million
>>>> words in the hash table. Although the hash table is currently Big 0(4).
>>>
>>> This makes no sense. O(4) = O(1) = O(0.01) = O(1000000),
>>> by definition. What do you really mean?
>>>
>>>> I am using the Multiply Add and Divide (MAD) method for the
>>>> compression of the hash code, does Java have any built in
>>>> functions(methods) that will do this for me, or does anyone know of
>>>> a more efficient way?
>>>
>>> The value delivered by hashCode -- for any class, not
>>> just for String -- is a Java int, 32 bits wide. How (and why)
>>> are you "compressing" this value?
>>>
>>
>>
>> My hash table is made up of an array of n LinkedLists (where n is a
>> prime number that is roughly double the number of words in the
>> dictionary).
>>
>> I firstly use the String.hashCode() method on a given word. I then
>> compress this number so that i can use it as a index into the array of
>> LinkedList; as this 32bit number is often far to large. I then insert
>> the word into the LinkedList array at the compressed value index(The
>> fact the hashTable is an array of LinkedLists is so that it handles
>> collisions)
>>
>> After inserting all of the words into the dictionary the largest
>> LinkedList in the array has only four elements. I thought Big O(4) was
>> the correct way of describing this.
>>
>> Would it help if i posted my classes on here, or offer you a place to
>> download the program.
>
> This is very similar to the design of java.util.HashSet, except it
> already has methods for mapping from hashCode to bucket number that have
> been tested with Java String.
>
> Is there some particular reason for rolling your own rather than using
> the java.util class?
>
> Patricia
I have to roll out my own for coursework for my data structures module.
j1mb0jay
-
Re: Hash Code Compression
On Jan 11, 3:20 pm, j1mb0jay <n...@none.com> wrote:
> Daniel Pitts wrote:
> > On Jan 11, 3:05 pm, j1mb0jay <n...@none.com> wrote:
> >> Eric Sosman wrote:
> >>> j1mb0jay wrote:
> >>>> I am currently working on a dictionary populating program. I currently
> >>>> have a socket connection my local news server and am trawling through
> >>>> all of the articles looking for new words. Java's String class has a
> >>>> method that hashes strings. I was wondering if i should still be using
> >>>> these even though I have over two million words in the hash table.
> >>>> Although the hash table is currently Big 0(4).
> >>> This makes no sense. O(4) = O(1) = O(0.01) = O(1000000),
> >>> by definition. What do you really mean?
> >>>> I am using the Multiply Add and Divide (MAD) method for the
> >>>> compression of the hash code, does Java have any built in
> >>>> functions(methods) that will do this for me, or does anyone know of a
> >>>> more efficient way?
> >>> The value delivered by hashCode -- for any class, not
> >>> just for String -- is a Java int, 32 bits wide. How (and why)
> >>> are you "compressing" this value?
> >> My hash table is made up of an array of n LinkedLists (where n is a
> >> prime number that is roughly double the number of words in the dictionary).
>
> >> I firstly use the String.hashCode() method on a given word. I then
> >> compress this number so that i can use it as a index into the array of
> >> LinkedList; as this 32bit number is often far to large. I then insert
> >> the word into the LinkedList array at the compressed value index(The
> >> fact the hashTable is an array of LinkedLists is so that it handles
> >> collisions)
>
> >> After inserting all of the words into the dictionary the largest
> >> LinkedList in the array has only four elements. I thought Big O(4) was
> >> the correct way of describing this.
>
> >> Would it help if i posted my classes on here, or offer you a place to
> >> download the program.
>
> >> j1mb0jay
>
> > Why aren't you using the existing HashMap class?
>
> > If you want a compact representation of the words you come across,
> > consider a prefix tree data structure instead.
>
> > Just so you know, Big O measures the dominant term without
> > multipliers, For instance, if your algorithm takes N *n + N + 4
> > steps, then it is O(N*N). If it takes 4*n*n steps, it is still O(N*N)
>
> I have been asked to create my own data structures to help aid
> understanding for the course material for my degree module.(Check
> article header)
>
> Because i am currently building the dictionary file by trawling news
> articles each word I pull from an article needs to be checked in the
> dictionary to see if we already have it(I don't want to store each word
> more than once). My current methodology means I only have to look at a
> maximum of 4 words(out of 2.5 million) to see if I already have this
> word stored in memory. Does this still mean my retrieval method is Big(N
> Squared)
>
> j1mb0jay
Actually, it means it has Big O(1) (As hash tables tend to)
Don't use the hash for the index in the linked list. Don't even
BOTHER with indexes in the linked list. The hash should be an index
into an array of lists (of whatever sort, linked or otherwise).
Then each list should be relatively small, so trivial to search/
insert.
-
Re: Hash Code Compression
On Fri, 11 Jan 2008 22:11:20 +0000, j1mb0jay <none@none.com> wrote,
quoted or indirectly quoted someone who said :
>I have over two million words in the hash table
That is not very much compared with the size of the hashCode. However,
if you needed a bigger hashCode for some reason, there are several
popular digest algorithms that will give you various sized results.
See http://mindprod.com/jgloss/digest.html
The traditional way to prune a hash code to size without losing
"randomness" is to take the absolute value of the modulus.
see http://mindprod.com/jgloss/hashcode.html
http://mindprod.com/jgloss/hashtable.html
http://mindprod.com/jgloss/hashmap.html
for a discussion.
--
Roedy Green, Canadian Mind Products
The Java Glossary, http://mindprod.com