Dict sharing vs. duplication - TCL
This is a discussion on Dict sharing vs. duplication - TCL ; Hi,
Now that 8.5 is out thanks to intensive work from our beloved
Conspicuous Tcl Illuminati, I'm getting prepared to use dicts
really...
Looking at the dict manpage, I first discovered that no obvious naming
scheme tells in-place operations from ...
-
Dict sharing vs. duplication
Hi,
Now that 8.5 is out thanks to intensive work from our beloved
Conspicuous Tcl Illuminati, I'm getting prepared to use dicts
really...
Looking at the dict manpage, I first discovered that no obvious naming
scheme tells in-place operations from functional ones. Of course
that's not worse than lappend/lreplace, and reusing the same names
makes sense !
Then, I wondered about what copy-on-write optimization became in dict
world.
Indeed, if $x is a dict value,
set y $x
copies it, in an efficient way (by reference, incrementing just the
refcount).
Now, the dict value is "shared" (refcount>=2). So as soon as someone
touches it:
dict incr x foo
it gets duplicated, so that $x and $y can have separate destinies.
That's copy-on-write.
Now suppose I have a huge dict, stored in some global, and that I pass
its value back and forth to various routines. Its refcount will go up
with each level of proc call where it is passed as an argument, since
the argument is a local holding a bona fide reference. Suddenly, under
a stack depth of (say) three, I decide to update the dict, eg with a
[dict incr] on the global variable. Won't the huge dict be immediately
duplicated ?
If the above story holds true, then passing a dict value is not a
direct replacement for passing an array name for upvarring: it has
stringent restrictions, and careful refcounting is needed as soon as r/
w operations occur...
Disclaimer: this is no criticism. This semantics is traditional[*] for
all first-class values in Tcl; however if again it's the case, it
should just be advertised so that people don't get caught by
surprise...
Or am I missing something ?
-Alex
[*] But is it necessary ? What would break in Tcl if we allowed
mutable shared values, eg lists or even strings ?
Granted, one would need to backpropagate strep invalidation to all
referers. This would have a cost. But are there other drawbacks ?
-
Re: Dict sharing vs. duplication
Alexandre Ferrieux schrieb:
> Hi,
>
> Now that 8.5 is out thanks to intensive work from our beloved
> Conspicuous Tcl Illuminati, I'm getting prepared to use dicts
> really...
>
> Looking at the dict manpage, I first discovered that no obvious naming
> scheme tells in-place operations from functional ones. Of course
> that's not worse than lappend/lreplace, and reusing the same names
> makes sense !
>
> Then, I wondered about what copy-on-write optimization became in dict
> world.
> Indeed, if $x is a dict value,
>
> set y $x
>
> copies it, in an efficient way (by reference, incrementing just the
> refcount).
> Now, the dict value is "shared" (refcount>=2). So as soon as someone
> touches it:
>
> dict incr x foo
>
> it gets duplicated, so that $x and $y can have separate destinies.
> That's copy-on-write.
>
> Now suppose I have a huge dict, stored in some global, and that I pass
> its value back and forth to various routines. Its refcount will go up
> with each level of proc call where it is passed as an argument, since
> the argument is a local holding a bona fide reference. Suddenly, under
> a stack depth of (say) three, I decide to update the dict, eg with a
> [dict incr] on the global variable. Won't the huge dict be immediately
> duplicated ?
The Tcl_Obj representing your 'foo' entry at least needs to be
duplicated and incremented. And as this changes the whole dict, you also
get a duplicate of the hash table structure.
DictIncrCmd in generic/tclDictObj.c tells you the details about this,
there is some optimization in place so that huge string reps aren't
copied. The code in DupDictInternalRep then tells you that the other
values aren't copied, but only refcount increased.
So yes, there is a copy, but its kept close to the minimum possible. You
need a new hashtable struct...
Michael
-
Re: Dict sharing vs. duplication
On Dec 20, 3:42 pm, Michael Schlenker <schl...@uni-oldenburg.de>
wrote:
> Alexandre Ferrieux schrieb:
>
> > Hi,
>
> > Now that 8.5 is out thanks to intensive work from our beloved
> > Conspicuous Tcl Illuminati, I'm getting prepared to use dicts
> > really...
>
> > Looking at the dict manpage, I first discovered that no obvious naming
> > scheme tells in-place operations from functional ones. Of course
> > that's not worse than lappend/lreplace, and reusing the same names
> > makes sense !
>
> > Then, I wondered about what copy-on-write optimization became in dict
> > world.
> > Indeed, if $x is a dict value,
>
> > set y $x
>
> > copies it, in an efficient way (by reference, incrementing just the
> > refcount).
> > Now, the dict value is "shared" (refcount>=2). So as soon as someone
> > touches it:
>
> > dict incr x foo
>
> > it gets duplicated, so that $x and $y can have separate destinies.
> > That's copy-on-write.
>
> > Now suppose I have a huge dict, stored in some global, and that I pass
> > its value back and forth to various routines. Its refcount will go up
> > with each level of proc call where it is passed as an argument, since
> > the argument is a local holding a bona fide reference. Suddenly, under
> > a stack depth of (say) three, I decide to update the dict, eg with a
> > [dict incr] on the global variable. Won't the huge dict be immediately
> > duplicated ?
>
> The Tcl_Obj representing your 'foo' entry at least needs to be
> duplicated and incremented. And as this changes the whole dict, you also
> get a duplicate of the hash table structure.
>
> DictIncrCmd in generic/tclDictObj.c tells you the details about this,
> there is some optimization in place so that huge string reps aren't
> copied. The code in DupDictInternalRep then tells you that the other
> values aren't copied, but only refcount increased.
>
> So yes, there is a copy, but its kept close to the minimum possible. You
> need a new hashtable struct...
>
> Michael
This is interesting but I would like to ask: is this just a
performance convenience that we should not think about and rely upon?
It seems that if you want to use a dict as a single reference, you
should just pass around the name of the dict, or use upvar, etc. Dict
offers the potential to make a copy, which is cool, it seems great
that the implementation tries to minimize copying, I just wonder if
Tcl'ers should think about this. In Tcl the best way to not duplicate
objects is to refer to them directly or through upvar.
-
Structures vs. Dicts Re: Dict sharing vs. duplication
Alexandre Ferrieux wrote:
> Hi,
>
> Now that 8.5 is out thanks to intensive work from our beloved
> Conspicuous Tcl Illuminati, I'm getting prepared to use dicts
> really...
I think dicts have a place in Tcl for some data patterns. They
serialize automatically to a string rep, and that's quite useful for
some things, but I can't help but desire to have [structure] be in the
core too, and I think others may see a reason for them too.
structure is the basis for NexTk.
proc Foo-check {inst arg} {
#When this returns 1 then the structure key will be set.
#When this returns 0 or an error the structure key will not be set.
expr {$arg eq "foo" || $arg eq "bar"}
}
proc Foo-dance {inst args} {
puts "Foo is dancing with $args"
}
proc Foo {inst args} {
structure $inst
$inst -text "" -metadata ""
structure-make-method $inst dance[list Foo-dance $inst]
#This prevents typos from creating new data structure keys.
structure-lock-creation $inst
if {[llength $args]} {
if {[catch {$inst {*}$args} err]} {
structure-free $inst ;#or rename $inst {}
return -code error $err
}
}
structure-key-callback $inst -metadata[list Foo-check $inst]
return $inst
}
% Foo .f -text "Hello"
..f
% .f -text
Hello
% .f -metadata
% .f -metadata foo
1
% .f -metadata
foo
% .f -metadata bozovalue
0
% .f -metadata
foo
% .f dance Happy Elves
Foo is dancing with Happy Elves
There are other neat things, like traces for keys, and per key/structure
locking. Structure traces are more like Tcl traces, in that they run
after the variable/key has been set. The structure callbacks make the
common pattern of verifying an object's valid input patterns much
easier.
I basically use [structure] as a prototype object system (prototype in
OO terms, rather than as in a prototype that is for testing.) Each
ntk_window is a structure instance from which a window/widget tree is
formed based on a _children[list] member.
George
-
Re: Dict sharing vs. duplication
On Dec 21, 12:42 am, Michael Schlenker <schl...@uni-oldenburg.de>
wrote:
> Alexandre Ferrieux schrieb:
>
> > Won't the huge dict be immediately
> > duplicated ?
>
> So yes, there is a copy, but its kept close to the minimum possible. You
> need a new hashtable struct...
Yeah, the "minimum possible" here is still way too much.
The point of my post was to check whether a smart trick existed to
avoid duplication entirely, just like there is never any automatic
duplication of an array. It seems that no such trick exists, right ?
-Alex
-
Re: Dict sharing vs. duplication
On Dec 21, 3:29 am, "tom.rmadilo" <tom.rmad...@gmail.com> wrote:
>
> In Tcl the best way to not duplicate
> objects is to refer to them directly or through upvar.
What we've just shown is that [array/upvar] is the *only* way to be
sure to avoid duplications. As soon as your dict value gets bound to a
formal parameter or other variable, you're at risk.
-Alex
-
Re: Dict sharing vs. duplication
On Dec 20, 11:39 pm, Alexandre Ferrieux <alexandre.ferri...@gmail.com>
wrote:
> On Dec 21, 3:29 am, "tom.rmadilo" <tom.rmad...@gmail.com> wrote:
> > In Tcl the best way to not duplicate
> > objects is to refer to them directly or through upvar.
>
> What we've just shown is that [array/upvar] is the *only* way to be
> sure to avoid duplications. As soon as your dict value gets bound to a
> formal parameter or other variable, you're at risk.
I admit I was somewhat confused by dict, but they still have names
just like anything else in Tcl, and if you pass the name only, it
shouldn't 'bind' anything. I'm actually more confused by the
terminology of bound values and formal parameters applied to Tcl.
The way I seem to look at this is that a dict is somewhat like an
array, but you can easily get its value and pass it around. I'm not a
fan (yet) by any means, but the ability to easily get the value could
be important in certain situations. For one, I'm thinking of the error
code stuff. It occurs to me that these are array-like, but if an array
were created with an inner proc, it would go out of scope and
evaporate, whereas a dict could be passed back to the caller and added
to as things unwind. I'm just guessing here, but it seems like the
ability to pass array-link stuff back and add to it is something
unique.
It is pretty surprising that you could get partial non-copy of data
with a dict (or partial reference, whatever). Maybe it saves memory or
whatever, but it seems to me it is much safer to decide where you are
going to keep the single copy of your data, and just refer to it by
name (or via upvar). Thinking about saving space by guessing what
internals are doing, or not, and when, seems like a diversion. Arrays
work very well, their behavior is well understood. Dicts are pretty
new and I think it will take some time to understand when to use them.
I'm a pretty verbose writer, but dicts seem way too verbose for my
taste, and using 'for' when you get 'foreach' is somewhat strange.
As a data structure I don't find much of a use for this one. I use an
array to enforce unique keys and I use lists to enforce order. I
combine the two in a number of ways: array of lists, list of array
names, etc. But dict is similar to a structure, but is also somewhat
squishy for the way I think about data. The one thing that I like is
that they are easy to serialize, so you can easily write them out or
pass them around.
-
Re: Dict sharing vs. duplication
Alexandre Ferrieux wrote:
> On Dec 21, 12:42�am, Michael Schlenker <schl...@uni-oldenburg.de>
> wrote:
> > Alexandre Ferrieux schrieb:
> >
> > > Won't the huge dict be immediately
> > > duplicated ?
> >
> > So yes, there is a copy, but its kept close to the minimum possible. You
> > need a new hashtable struct...
>
> Yeah, the "minimum possible" here is still way too much.
> The point of my post was to check whether a smart trick existed to
> avoid duplication entirely, just like there is never any automatic
> duplication of an array. It seems that no such trick exists, right ?
K combinator when passing the value, so you do not retain a copy...
Michael
-
Re: Dict sharing vs. duplication
schlenk@uni-oldenburg.de wrote:
> Alexandre Ferrieux wrote:
>> On Dec 21, 12:42�am, Michael Schlenker <schl...@uni-oldenburg.de>
>> wrote:
>>> Alexandre Ferrieux schrieb:
>>>
>>>> Won't the huge dict be immediately
>>>> duplicated ?
>>> So yes, there is a copy, but its kept close to the minimum possible. You
>>> need a new hashtable struct...
>> Yeah, the "minimum possible" here is still way too much.
>> The point of my post was to check whether a smart trick existed to
>> avoid duplication entirely, just like there is never any automatic
>> duplication of an array. It seems that no such trick exists, right ?
>
> K combinator when passing the value, so you do not retain a copy...
Sadly that will not work, as TWO refCounts are added when passing an arg
to a proc:
- one for the ref in the objc/objv array (Tcl stack within TEBC)
- one for the ref of the proc's variable
That means that having more refs in the caller's context (which you
remove with the K trick) is irrelevant to the COW behaviour.
Avoiding this double-ref is not easy, and will not happen soon (except
if someone comes up with a great idea). It would involve either of
(a) a non-trivial change in api semantics (callee manages objc/objv,
as opposed to caller)
(b) some elaborate (aka cliff-hanging, aka hacky) special handling
for argument variables at runtime (as opposed to all other vars). This
would have a (perceptible?) cost in runtime, and a (definite!) cost in
added complexity.
I have been thinking about changing the api only for procs (a), but not
as a hi-prio project.
-
Re: Dict sharing vs. duplication
On Dec 21, 2:39 am, Alexandre Ferrieux <alexandre.ferri...@gmail.com>
wrote:
>
> What we've just shown is that [array/upvar] is the *only* way to be
> sure to avoid duplications. As soon as your dict value gets bound to a
> formal parameter or other variable, you're at risk.
Weird - I thought the whole reason for dict was so that we had a data
structure that was treated as a first class object - instead, we have
Just Another Second Class Structure (JASCS)?