hash map worst case insertion/search is O(N) so this is potentially worst case O...

keithwhor · on Oct 29, 2019

Hmm -- assumed time complexity for well-built hash functions for get / put operations is O(1), is it not? Sure, absolute worst-case means every operation is going to collide. Hash functions themselves are O(1)... I'm not sure what you mean by "implicit factor of D"? Like hashing is going inflate the runtime by a small factor basically equivalent to the edit distance in Myers' approach?

I've been looking through a little bit more literature and it's clear to me that my implementation as it stands is not equipped to solve the Longest Common Substring problem [0]. Right now it's too greedy. I think with a tweak it could.

I think I'm mostly just surprised because when I sat down at my computer today, I had the impression that diffing was a "hard problem" -- to be able to implement something that (apparently) works in a short timeframe made me feel as though I did something wrong. (I have this code running in a staging environment now with a code editor, and it seems to be working smoothly and as expected.)

[0] https://en.wikipedia.org/wiki/Longest_common_substring_probl...

hhmc · on Oct 29, 2019

Replying here because I can't edit my orignal comment. But I stand by what I said -- in the pathalogical case all of your keys map to the same bucket, and you hit O(N^2) or O(N log N) behaviour (depending on what exactly happens within your bucket).

I appreciate that in the typical case this won't happen, but big O denotes limiting behaviour, which in this case in O(N^2) or O(N log N) depending on your hash map implementation.

hhmc · on Oct 29, 2019

My understanding is that from a purely theoretical point of view you can call the operations Θ(1), but not O(1) -- since the worst case must account for the pathological case.

Yessing · on Oct 29, 2019

by definition Θ(1) implies O(1).

A sandwich means you have a slice of bread on the upper side.

heavenlyblue · on Oct 29, 2019

I have just been interviewing at several companies that have highly technical interviews in London, and about 80% of the interviewers who asked questions about hash maps expected me to assume that hashmaps imply constant-time lookups.

People tend to dismiss the statistics related to hash collisions because they think they are absolutely irrelevant to “real-life scenarios”.

Shoop · on Oct 29, 2019

This is because hash tables resize when they are filled with too many elements. Assuming a suitably good hash function, this gives O(1) lookup w.h.p. and insertion stays O(1) amortized. The mathematics of hash collision probabilities is factored into the resizing thresholds.

Many standard library implementations of hash tables (such as rust's) also include special hash functions which are salted with random values to prevent DoS attacks which create large numbers of hash collisions.

eru · on Oct 29, 2019

If you are really paranoid, you can even use a salted cryptographic hash function. But the constant factors for those are usually worse than for simpler salted hashes, so they are not worth it for hashtables.

Your fallback for collisions could also be something with O(log size_of_bucket) runtime, instead of linked lists. But again, when you don't have many collisions, that's going to be slower than something simpler.

(I half remember a result that if you tune your parameters right, you can get away with a hashtable that only has enough size for storing one element in each slot; and if a collision happens, you replace that element with a tomb stone, and store put the elements involved in a single global linked list.

Basically, for that to work you need to keep your total number of collision smaller than a constant with high probability.)

LgWoodenBadger · on Oct 29, 2019

Once your hashes collide, you have no other mechanism to use besides equality. How can you use something other than a list in order to find the actual value? This being for the general case where items don't have an "order/rank/sort" property between them.

eru · on Oct 29, 2019

Most of the time, when you talk about collisions we mean that two different keys get hashed to the same bucket.

What mechanisms you have available depends on your implementation.

For example Cuckoo Hashing (https://en.wikipedia.org/wiki/Cuckoo_hashing) relies on having two different hash functions available.

And yes, for having something like a balanced tree, having a comparison would be useful.

In general, most hash functions work by consuming something like a stream of bits. So for your implementation it makes a lot of sense for the datatypes you want to use as keys to export a way to convert themselves into that stream, and leave the actual hashing into a fixed size as a detail of your hashtable.

That way you can eg do fall-back comparisons directly on stream of bits (including for equality). Or you can transparently support multiple hash functions.

Even in languages where the hashable interface works by giving you a method to spit out eg a 64 bit number only, you still have to map that number to one of your buckets. So for your fall-back, you can choose a different mapping.

LgWoodenBadger · on Oct 29, 2019

My point was "what else can you use besides a linked list to store hash-colliding values?"

If you have a second (or a third, or a fourth...) hashing algorithm, then make it hash tables all the way down. At the end, you still need some data structure to store hash-colliding values. And if so, what other structure could you possibly use besides a list (linked-, array-, etc.) ?

eru · on Oct 30, 2019

> If you have a second (or a third, or a fourth...) hashing algorithm, then make it hash tables all the way down. At the end, you still need some data structure to store hash-colliding values.

Why? You can have just two layers: the primary hash table and your buckets are made up of one small secondary hashtable each. If there's a collision in the bucket hashtable, pick a now random hash function for that bucket and re-hash everything in the bucket.

If that fails after trying a few times, pick a new random hash for the primary hash table and consider resizing.

I bet you can make that scheme workable in O(1) expected amortised time for inserts and lookups.

Cuckoo hashing (https://en.wikipedia.org/wiki/Cuckoo_hashing) is a related idea: you just have to hash functions. If you had three elements that collide for both hash functions each, you just re-hash your entire table.

(From Wikipedia:)

> When a new key is inserted, and one of its two cells is empty, it may be placed in that cell. However, when both cells are already full, it will be necessary to move other keys to their second locations (or back to their first locations) to make room for the new key. A greedy algorithm is used: The new key is inserted in one of its two possible locations, "kicking out", that is, displacing, any key that might already reside in this location. This displaced key is then inserted in its alternative location, again kicking out any key that might reside there. The process continues in the same way until an empty position is found, completing the algorithm. However, it is possible for this insertion process to fail, by entering an infinite loop or by finding a very long chain (longer than a preset threshold that is logarithmic in the table size). In this case, the hash table is rebuilt in-place using new hash functions: [...]

You say:

> And if so, what other structure could you possibly use besides a list (linked-, array-, etc.) ?

Any datastructure you feel like. You can also use a balanced search tree, if you want to. See eg https://en.wikipedia.org/wiki/Hash_table#Separate_chaining_w...

hhmc · on Oct 29, 2019

But in the pathological case where all the inputs collide to the same hash, I don't see how you avoid degrading to at least O(N log N), even with resizing -- you must account for the resizing as an O( g(N) ) operation.

eru · on Oct 29, 2019

Well, it depends on your hashtable implementation, and where your data comes from.

Python's old hashtables used to be good enough for most practical uses; but when the input data was controlled by an adversary, it was easy to run a Denial of Service attack.

See https://bugs.python.org/issue13703 They started randomising the hash function at runtime.

Hashtables can be made to run in O(1) with arbitrarily high probability for insert and lookup, if you are willing to be very careful with the implementation, and willing to endure slightly worse constant factors in some cases. (Of course, it also depends on what model of computation you are using. At some scale your keys have to grow like O(log n) of the number of elements, just so that you can distinguish all elements. But we usually abstract away from that.)

But yes, just programming any old hashtable doesn't magically give you O(1) performance, you need to work for it.

flukus · on Oct 29, 2019

Another common one is see is conflating constant-time with no-time. It's near enough to true for a single lookup, but 30,000 individuals lookups in a tight loop and you've got some real performance issues that are relatively invisible.

eru · on Oct 29, 2019

Hash tables also effectively randomize your access patterns.

That can be bad for memory locality, especially if you outgrow your main memory (or just various smaller caches before that.) All without changing the O(1) w.h.p. asymptotics.

flukus · on Oct 29, 2019

Usually you pay the memory locality price twice too, once for the index and once for the actual data, especially in OO languages. Thankfully fitting in main ram has never been an issue for me, but I've seen the results when some of those fetches are in a database or external memory cache.

eru · on Oct 29, 2019

One standard example I use is finding duplicates in a big collection of ite s either via a hashtable or via merge sort. The collection would be big enough not to fit into RAM, so you get swapping onto a hard drive.

Merge sort will mostly just work, because all its memory access patterns are sequential. Hashtables would need to pay for seeks all the time.

So O(n log n) mergesort would be faster than O(n) hashtable approach in this case.

(There are ways to improve the absolute running time further compared to these naive approaches.)