We know that in associative unordered containers in STL, like std::unordered_map, the hash functions depend upon the key (among the key-value pair) for their right functionality/work. std::unordered_map, for example, is said to have average constant time, O(1), for searching an element. But if the hash function is not designed well and many collisions happen, that constant time may change to O(n). When can we be sure about constant time lookups if we use the STL containers, such as std::unordered_map?
I guess if we use built-in types like: int, float/double, char or even std::string, as key, we can be sure about that a good hash function is already designed in STL for them to give us average constant time for lookups, but if the key is a user-defined type, then it's up to us to produce a good hash function to have as less (or zero) collisions as possible.
Is my understanding right, please?
Thats sounds very good to me, the worst hash functor is definded as:
1 2 3
class Hash{
size_t operator()(int) const { return 1; } // needs much, much time
}
and could lead to runtime desasters.
As your know, the data is settlet with the Hash value and thrown into buckets, that should be as small as possible, the standard library sets its max bucked size to 1.0.
However, unordered containers are actually only a little faster than the normal ones and should be avoided due to the complexity and the danger of requiring a lot of runtime quickly, especially with custom data types. It is therefore best never to implement your own hash function.
You can make your own hash function for your data. Assuming you know something about the data, you should be able to produce something that works well for it.
The above gives the worst hash function, so what is the 'best'? The absolute best can only be found if the data is static, eg loaded from a file of specific words or the like, and that is one of probably infinite functions that simply put 1 item in one slot. For random/user input/etc data, the best you can do is a function that can split input that is slightly different eg 123,1234, 124, 132 all would give unrelated hash results. Lots of user data tends to be similar, like first names of a large population, and getting a good split on small differences almost always gives superior results for real problems. The other thing is to use fields from your data that define each entry uniquely or as much so as possible. If your data is cars, their paint color is a bad choice here, while the vin number is a good choice.
The bottom line is that high performance hashing is often very hands on, where you need to spend some time understanding the data you have and what use cases you will have and how best to do it. The c++ default does pretty well, and starting there to see if it gives good results is always your first stop. If it does not work well, then you have to roll up your sleeves and get in there with a DIY solution.
Thank you both for your posts, but are you aware that you didn't note my questions!? :) I asked two questions, each in a paragraph, in my first post. Hopefully you take a look at them and provide me with your perspectives on them.
LukeProducts
It's actually my first time reading through a hash function. Thank you for your code.
the standard library sets its max bucked size to 1.0.
What do you mean by '1.0' please? Is it one byte, one big size, what?
@jonnin
Thank you for your perspectives. They, although rather incomprehensible due to my lack in dealing with hash functions, are useful.
1) you can't be totally sure***. you have to study your code with some use case data that fits what you designed it to do. Most of the time, the STL hash is fine. If it is degenerating, you need to write your own hash function. You can detect the problem live in your code, I guess, and re-hash it if you find an issue, or write your own hash container that deals with it better (eg instead of a linear bucket, use another hash container as the bucket or if the containers get above some N rehash or whatever strategy.
2) I am not sure what happens if you try to use a custom object as the key. I have never done that, since the key is almost always something you want the user to provide to go looking to see if the data is there, or some similar scenario where you have a small piece of the data as the key to search on... ? Sorry, I don't know what to say on this one. Ill bet someone has studied it on the web...
there are 3 scenarios here.
1) you are doing theory. In theory, hashing with linear buckets can deteriorate to O(n) searching. In theory, a binary search tree can become a linked list, quick sort can run in N*N time, and so on. Many of the binary algorithms in and of themselves can collapse this way. In practice, code is put in to make this exceedingly unlikely to happen and we don't generally worry about it much.
2) you are doing real time work where a screw up could be really bad: your code is going into the important parts of an aircraft or medical equipment or a nuclear power plant or something? In that case, you do need to figure out what the worst that can happen will be and ensure it works fine in that situation. You can do that by having it run so fast that when the worst case hits it is still fast enough, or by monitoring the data and fixing the problem when it reaches some threshold (see above on a couple of simple ideas there) or whatever else.
3) you have coded it and can show that the container isn't doing well for your code, and need some alternate idea (probably, just writing your own hash function).
1 is irrelevant to real code, 2 does not sound like a problem today, and 3 ... most likely you are worrying over nothing because of 1)... but if you can demonstrate a problem, show us...
*** unordered map has a lot of diagnostic tools -- load factor, number of buckets, count in buckets, and whatever else. You can tell if the container is messing up, but its a moving target. It doesn't go from good to bad instantly, it happens over time. It also has tools to fix itself, like rehash(), which costs time but is probably going to solve your problem if used carefully. You need to monitor it if you care, and that costs as well, or write something that self corrects that wraps the container. Someone out there has probably done a study on this, too.
If the data structure needs to re-size its internal array, then it needs to move each element from the old buffer into a new buffer, then delete the old buffer. This scales with the number of elements in the buffer.
If you are very unlucky (or use a very bad hash) you can end up having all the elements being placed in the same bucket. Then when you insert a new element into that bucket it would have to loop through every element to check if the key is already there, because std::unordered_map does not allow duplicates, hence O(n).
its the same question :)
when, as said above, all the elements end up in bucket 1 for example, searches take O(n) (original question) and so do inserts because (as said above) inserts do a search. and it goes back down as buckets increase, if there are 2 buckets, its N/2, N/3 until that is N/B which eventually becomes 1 or 0 (more than 1 bucket per data item and a strong hash function) -- when its working as desired.
Here again, the worst case is largely theoretical. Its just not likely to happen in practice, if you are using the default settings. It 'could', but I have never seen it get close enough to feel compelled to monitor it.
Ganado has a good point. If the number of buckets change it has to rehash all the keys to find the new buckets to put them. That must be the reason why std::unordered_multimap::insert is also O(n) in the worst case.
jonnin wrote:
the worst case is largely theoretical
Not if you have a bad person feeding you values that have been carefully chosen to lead to hash collisions.
its a huge point... it looks like it has vector-like behavior, auto rehashing when it thinks it should. And like vector, it looks like you can take a stab at minimizing that activity.
Thank you for your posts, but let's narrow down the case to reach a clearer conclusion.
Suppose we have a std::unordered_map<key, value> in our program for which there's a good hash function in that map which is already designed and provided with us by the standard if the key is of built-in types, eg int, float/double, char or even std::string.
I have two assumptions:
1) In such a case, since we're using the hash function designed by standard, and the key is of built-in types, we can be sure to a good extent that we will always have constant time for insertion, deletion and lookup operations, I assume.
2) My other assumption is that since the bucket is an array so it has a size. When we insert many entries to the map above and it goes beyond the size of the bucket array, we may have resizing of that array and hence rehashing, therefore linear time complexity. So we may have a linear time complexity for insertion regardless of the key type whether it's int, for example, or a user defined type.
Which of 1) and 2) is more correct to you, please?
It happens when the load_factor() becomes greater than max_load_factor()*bucket_count().
max_load_factor() specifies an upper limit for how many elements you want per bucket on average. The default value is 1.0 (i.e. 1 element per bucket).
Note that this does not take actual collisions into account so it won't automatically rehash just because you happened to be unlucky and get many collisions. All that matter is the number of elements.
---
One thing I still don't understand is that if insert() sometimes do a rehash how come insert() only is O(n) in the worst case when rehash() is O(n²) in the worst case?
Suppose we have a std::unordered_map<key, value> in our program for which there's a good hash function in that map which is already designed and provided with us by the standard if the key is of built-in types, eg int...
The standard hash for integers is often implemented by simply casting the value to a std::size_t. Both libstdc++ (GCC's implementation) and libc++ (Clang's own implementation) work like this.
since we're using the hash function designed by standard, and the key is of built-in types, we can be sure to a good extent that we will always have constant time for insertion, deletion and lookup operations, I assume.
But under normal usage where there is no pattern in the key values you're unlikely to run into problems.
frek wrote:
My other assumption is that since the bucket is an array so it has a size.
The buckets are stored in an array but each bucket is essentially a linked list.
frek wrote:
When we insert many entries to the map above and it goes beyond the size of the bucket array, we may have resizing of that array and hence rehashing, therefore linear time complexity.
To calculate the bucket where an element should go they use Hash(key) % bucket_count() (I don't know if the standard says so but it seems to be how it's usually implemented). For example, if the number of buckets are 10 and the hash values for three elements were 1, 15, and 23 they would go into bucket 1, 5 and 3 respectively. The important thing to note is that the bucket that each element belongs to depends on the number of buckets. If the number of buckets change then it needs to go through each element again to calculate the new bucket.
frek wrote:
So we may have a linear time complexity for insertion regardless of the key type whether it's int, for example, or a user defined type.
Yes. But if you don't have a lot of hash collisions the average insertion time should still be pretty good.
You could still run into problems
But under normal usage where there is no pattern in the key values you're unlikely to run into problems.
1) I still don't clearly know what issues may cause an std::unordered_map with a built-in type key to take linear time complexity for insertion, deletion or lookup operations. Would you explain it more in a simple language, please.
2) What do you also mean by "pattern in the key", please?
If you know how many elements you're going to have at most you can avoid the rehashing all together by using reserve.
3) So it can be a good strategy to avoid rehashing which is actually expensive (n²). That is, if the number of entries is known we can use the reverse function to extend the bucket array before stating putting pair into it. Right?
4) By the way, is the default size of the bucket array known?
1) I still don't clearly know what issues may cause an std::unordered_map with a built-in type key to take linear time complexity for insertion, deletion or lookup operations. Would you explain it more in a simple language, please.
Pay especially attention to the upper output window (for GCC/libstdc++):
Number of buckets: 1109
Bucket 0 contains 1000 elements.
All the elements ended up in one bucket. This is because I intentionally made the keys a multiple of 1109 which happens to be the same as the number of buckets.
It's easy to see why if you remember the formula for finding the bucket that I showed earlier:
Hash(key) % bucket_count()
If Hash(key) just returns the key and all keys are exactly a multiple of bucket_count() apart from each other then all of them will end up in the same bucket.
you'll see that the lower output window (for Clang/libc++) will have problems instead, for the same reason.
Note that if you keep adding more elements until the number of buckets is increased the problem is likely to go away: https://godbolt.org/z/zvaa4nxGr
2) What do you also mean by "pattern in the key", please?
I admit this is a bit "hand-wavy". What I meant was like in the example above where every key was a multiple of some number. I'm no mathematical genius to be able to tell you if other "pattens" are likely to lead to problems. I guess it depends on the hashing function and whether a different method of mapping hash values to buckets is used than what I showed above.
3) So it can be a good strategy to avoid rehashing which is actually expensive (n²). That is, if the number of entries is known we can use the reverse function to extend the bucket array before stating putting pair into it. Right?
Right, although I wouldn't expect huge performance improvements in practice.
I'm still confused about whether it's O(n²) or O(n) (see my previous post),
and it's "reserve", not "reverse".
4) By the way, is the default size of the bucket array known?
It seems to depend on the implementation.
GCC (libstdc++) says 1.
Clang (with libc++) says 0. https://godbolt.org/z/4n3PaE9Gd
Update: Strangely enough GCC doesn't seem to use that one bucket for anything because when I insert one element bucket_count() is immediately bumped up to 13. https://godbolt.org/z/dhe9Tb745
All the elements ended up in one bucket. This is because I intentionally made the keys a multiple of 1109 which happens to be the same as the number of buckets.
1) Yes, in that situation, you've used the built-in hash function and a built-in data type, yet there're many collisions by the way GCC executes it which is by the way unlike Clang and MSVC! I thought in such a situation we're very unlikely to have collisions even if there might be a pattern in the key! :(
2) Moreover, as you've stated at the end of your post above, the default size of the bucket array is something less than 10 based on the all three compilers above, so how did you figure out the default size is 1109 there!?
I guess it depends on the hashing function and whether a different method of mapping hash values to buckets is used than what I showed above.
3) I guess each compiler doesn't implement the same hash function and that's why for the pattern key example GCC behaves different from Clang and MSVC. Agree?
4) One thing about reserving that I still believe in is that we can reserve the same size as of the elements that are going to be put into the map and it's much better in terms of performance. For example, here https://godbolt.org/z/vo76o4axa GCC does 4 rehashing (O(4n²)) and Clang 7 (O(7n²)) rehashing for 100 entries but when we reserve (O(1n²)) that amount before the loop, m.reserve(100);, no rehashing is required.
I thought in such a situation we're very unlikely to have collisions even if there might be a pattern in the key! :(
I think you're quite unlikely to run into these issues by chance, especially with many elements which is when this really matters. Note that both libstdc++ and libc++ choose the number of buckets to be prime which I also think have some good properties to reduce the chance of collisions.
frek wrote:
how did you figure out the default size is 1109 there!?
I just inserted 1000 elements and looked at the bucket_count().
frek wrote:
I guess each compiler doesn't implement the same hash function and that's why for the pattern key example GCC behaves different from Clang and MSVC. Agree?
I don't know about MSVC but Clang (libc++) seems to work very similar to GCC (libstdc++). The main difference seems to be in the choice of bucket counts.
If you wonder why I keep mentioning libc++ and libstdc++ it's because Clang can be used with both libc++ and libstdc++. If both GCC and Clang used libstdc++ (which is the default on Compiler Explorer) then std::unordered_map would work exactly the same.