Skip to end of metadata
Go to start of metadata

JIRA ticket CLJ-1065 has been created for this.

See this thread on the Clojure Google group: https://groups.google.com/forum/?fromgroups=#!topic/clojure/AG667ACBd3I

Note especially Chas Emerick's detailed analysis of how we arrived at the current state, posted Aug 5, 2012.  Also Mark Engelberg's argumentation on Sep 4, 2012 in favor of reverting to the older pre-exception-throwing behavior, all of which should now be duplicated below.

 


RH Feedback Zone

Bugs

  • sorted-set duplicate handling behavior differs from hash-set (which throws)
  • sorted-map duplicate handling behavior differs from hash-map (which throws)

Set 'Problems'

  • set literals throw on duplicate keys
    • Is it a user error?
      • open question
      • is there a purpose to writing #{42 42}?
        • must every reader deal with that?
      • if yes, then checking for dupes might be penalizing correct programs (perf-wise)
        • not checking means maybe creating an invalid object
      • if hash-set didn't throw you would have an alternative when unknown entries
    • arguably there should be no problem, since conflict free
      • yet, a user seeing
        • #{a b}
      • should expect a set with 2 entries
    • this behavior is just an artifact of sharing implementation with map
  • hash-set throws on duplicate keys
    • same reasons

Map 'Problems'

  • map literals throw on duplicate keys
    • this is not conflict-free, as values for same key might differ
    • is it a user error?
      • Yes
        • This is an inarguably, and apparently, bad map:
          • {:a 1 :b 2 :a 3}
        • Using keys not known to be unique in a literal is bad form
          • a user seeing this:
            • {a 1 b 2 c 3}
          • should expect a map with 3 entries
      • if hash-map did not throw, you would have an alternative when keys unknown
    • auto-resolving implies an order-of-consideration for map literal entries
      • and there should not be one
      • a complete semantic mess
    • non-resolving alternatives:
      • throw
        • checking for dupes might be penalizing correct programs
      • not checking means maybe creating an invalid object
  • hash-map throws on duplicate keys
    • here there might be an implicit order due to argument order
    • could make repeated assoc promise
      • that's the behavior of sorted-map

Opinions

  • I don't think there is any merit whatsoever to supporting duplicates, evident or not, in literal sets and maps
    • such programs are at worst broken and at best anti-social
    • so, what should happen?
      • there are read-time and runtime considerations
    • first step - declare such things are user errors
    • second step - decide on a reporting strategy
  • Don't penalize correct programs!
    • unchecked array-based map constructors are a critical way to competitive perf for object-like use of maps
  • I think hash-set and hash-map should not throw on dupes
    • and that hash/sorted-set/map should make an explicit as-if-by-repeated assoc promise
  • If you think a month is too long to get a response to your needs, from a bunch of very busy volunteers, you need to chill out
    • just because you decided to bring it up doesn't mean everyone else needs to drop what they are doing
  • This page was useful, thanks.

Recommendations

  1. hash/sorted-set/map should make an explicit as-if-by-repeated-conj/assoc promise
    1. thus will never throw, and be consistent
    2. if you don't know that you have unique keys, use these!
  2. Document "Duplicate keys in map/set literals, evident or not, are user errors"
    1. saying that is not the same as guaranteeing they will generate exceptions!
    2. generate exceptions for now
    3. eventually move the (non-reader) check into debug mode, or otherwise provide runtime control
  3. Restore the fastest path possible for those cases where the keys are compile-time detectable unique constants
    1. high perf for known correct programs at least

end RH Feedback Zone



Current behavior of Clojure 1.4.0

;; Sets
user=> #{28 28}
IllegalArgumentException Duplicate key: 28  clojure.lang.PersistentHashSet.createWithCheck (PersistentHashSet.java:68)

;; It is when set literals contain variables that are unexpectedly equal
;; that some do not want an exception thrown.
user=> (def a 28)
#'user/a
user=> (def b 28)
#'user/b
user=> #{a b}
IllegalArgumentException Duplicate key: 28  clojure.lang.PersistentHashSet.createWithCheck (PersistentHashSet.java:68)

;; This is one way to construct a set that allows duplicates.
user=> (set [a b])
#{28}


;; Maps

;; Similar to sets, except that only keys must be distinct.
;; However, in this case the construction functions array-map
;; and hash-map also disallow duplicate keys, whereas
;; sorted-map permits them.

user=> {a 5 b 7}
IllegalArgumentException Duplicate key: 28  clojure.lang.PersistentArrayMap.createWithCheck (PersistentArrayMap.java:70)
user=> (array-map a 5 b 7)
IllegalArgumentException Duplicate key: 28  clojure.lang.PersistentArrayMap.createWithCheck (PersistentArrayMap.java:70)
user=> (hash-map a 5 b 7)
IllegalArgumentException Duplicate key: 28  clojure.lang.PersistentHashMap.createWithCheck (PersistentHashMap.java:92)
user=> (sorted-map a 5 b 7)
{28 7}

;; assoc is one way to create a map that silently eliminates duplicate keys
user=> (assoc {} a 5 b 7)
{28 7}

 

Arguments for changing it back to never throwing exceptions on duplicates

1. "It's a bug that should be fixed."  The change to throw-on-duplicate behavior for sets in 1.3 was a breaking change that causes a runtime error in previously working, legitimate code.

Looking through the history of the issue, one can see that no one was directly asking for throw-on-duplicate behavior.  The underlying problem was that array-maps with duplicate keys returned nonsensical objects; surely it would be more user-friendly to just block people from creating such nonsense by throwing an error.  This logic was extended to other types of maps and sets.

It's not entirely clear the degree to which the consequences of these changes were considered, but it seems likely that there was an implicit assumption that throw-on-duplicate behavior would only come into play in programs with some sort of syntactic error, when in fact it has semantic implications for working programs.  When a new "feature" causes unintentional breakage in working code, this is arguably a bug and needs to be reconsidered.  

2. "The current way of doing things is internally inconsistent and therefore complex."

(def a 1)

(def b 1)

(set [a b]) -> good

(hash-set a b) -> error

#{a b} -> error

(sorted-set a b) -> good

(into #{} a b) -> good

The cognitive load from having to remember which constructors do what is a bad thing.  

3. "Current behavior conflicts with the mathematical and intuitive notion of a set."

In math, {1, 1} = {1}.  In programming, sets are used as a means to eliminate duplicates.

Arguments for leaving things as is

Now let's summarize the arguments that have been raised here in support of the status quo.

1. "Changing everything to throw-on-duplicate would be just as logically consistent as changing everything to use-last-in."

True, but that doesn't mean that both approaches would be equally useful.  It's readily apparent that an essential idea of sets is that they need to be able to gracefully absorb duplicates, so at least one such method of doing that is essential.  On the other hand, we can get along just fine without sets throwing errors in the event of a duplicate value.  So if you're looking for consistency, there's really only one practical option.

2.  "I like the idea that Clojure will protect me from accidentally from this kind of syntax error."

Clojure, as a dynamically typed language, is unable to protect you from the vast majority of data-entry syntax errors you're likely to make.

Let's say you want to type in {:apple 1, :banana 2}.  Even if Clojure can catch your mistake if you type {:apple 1, :apple 2}, there's no way it's ever going to catch you if you type {:apple 1, :banano 2}, and frankly, the latter error is one you're far more likely to make.

This is precisely why there's little evidence that anyone was asking for this kind of syntax error protection, and little evidence that anyone has benefited significantly from its addition -- its real-world utility is fairly minimal and dwarfed by the other kinds of errors one is likely to make.

3.  "Maybe we can do it both ways."

It's laudable to want to make everyone happy.  The danger, of course, is that such sentiment paints a picture that it would be a massive amount of work to please everyone, and therefore, we should do nothing.  Let's be practical about what is easily doable here with the greatest net benefit.  The current system has awkward and inconsistent semantics with little benefit.  Let's focus on fixing it. The easiest patch -- revert to 1.2 behavior, but bring array-map's semantics into alignment with the other associative collections.

Labels: