Dict sharing vs. duplication - TCL
This is a discussion on Dict sharing vs. duplication - TCL ; On Dec 21, 7:27 am, Alexandre Ferrieux <alexandre.ferri...@gmail.com>
wrote:
> Bottom line for newbies: if your dict is large and r/w, better
> make it an array, or count references carefully.
>
> Do we agree on the diagnosis ?
...
-
Re: Dict sharing vs. duplication
On Dec 21, 7:27 am, Alexandre Ferrieux <alexandre.ferri...@gmail.com>
wrote:
> Bottom line for newbies: if your dict is large and r/w, better
> make it an array, or count references carefully.
>
> Do we agree on the diagnosis ?
I was going to hold off on the language discussion, hoping it would go
away, but it isn't.
The newbie should completely forget about reference counts, period.
This is not a script level concept in Tcl and will only confuse anyone
trying to think in Tcl. Go find some other language to talk about
references, but don't confuse newbie Tcl'ers with reference counts,
first-class-objects, lazy copying, etc. This is real bs in Tcl. If the
developers who wrote the code tried to optimize performance by not
doing an immediate copy, who cares, this has nothing to do with what
the script level code says to do, which is to make a copy:
[set a $b] makes a copy of the value of b and store it in another
place called a, when or how this is done should never be depended upon
by anyone at the script level.
A dict has a value, a real value which can be passed around. A dict
also has a name, a real name which can be passed around. What is the
difference between this and anything else? An array has a value which
can be accessed with [array get myarray].
But to understand the difference between how dict and array are used
is more instructive than looking at their implementation. They are
used for different types of data. Arrays are limited to one-to-one key-
value relationships. This makes the values independent from each
other, they are orthogonal, at right-angles, whatever. Being able to
address the array entries as individuals makes a lot of sense. But a
dict is not like this. A dict is more like a grouped list of lists.
The grouping may be added to by many different somewhat independent
actions, but the dict itself is a single entity of dependent content.
However, array and dict are are also commands, not just structures.
Talking about a dict or array as if they are mere datatypes completely
misses the point, and an important concept in Tcl: we let the
developer decide what to do. dict and array are ways of accessing and
manipulating data, and in pure Tcl you can invent hundreds of other
ways of manipulating data, or storing it.
In addition, any developer who can't figure out how to use globals or
namespace variables to store and maintain large amounts of reusable
data should start at that point and really forget completely about
reference counts. It is usually a bug to copy data if you just need to
manipulate it, regardless of how small or large the data is in memory.
The reason is that your code then makes it clear that you want the
actual global value of a variable at a particular point in your code,
not what it was when your procedure started to run, but the current
value. If you want the value at the time your procedure starts to run,
get a copy.
In addition, [upvar] is not inelegant! This statement can only result
from a complete misunderstanding of the Tcl language. If we need to
help newbies, the best advice is for them to assume that the Tcl
language works the way it works for very good reasons, and there is
nothing inelegant with something that works very well. (How can
someone like reference counting and misunderstand upvar?) Newbies need
to learn the language of Tcl, and suggesting that they think about
things they cannot use in terms which don't apply to the Tcl language
is a bad way to help them out.
Another statement above is misleading to the newbie: that things
should be handled differently with procs/commands"
"I have been thinking about changing the api only for procs..."
Whaa? Lets go back to syntax rule #1:
"A Tcl script is a string containing one or more commands."
The desire to special-case stuff is always a huge danger in
programming; don't fall into the trap of trying to invent a cool
special shiny object for Tcl, it'll stick out like a wart on your
nose.
Reference counting is not a solution to poor programming.
-
Re: Dict sharing vs. duplication
tom.rmadilo wrote:
> Another statement above is misleading to the newbie: that things
> should be handled differently with procs/commands"
>
> "I have been thinking about changing the api only for procs..."
>
> Whaa? Lets go back to syntax rule #1:
>
> "A Tcl script is a string containing one or more commands."
>
> The desire to special-case stuff is always a huge danger in
> programming; don't fall into the trap of trying to invent a cool
> special shiny object for Tcl, it'll stick out like a wart on your
> nose.
Sorry, I was obviously not clear. No changes planned at the script
level, this refers at worst to the C-level api, at best to only internal
apis of the bytecode subsystem (compiler and engine).
I was replying to Alexandre, definitely not a newbie and clearly worried
about refcounts. He is programming at the script level, but worrying
about performance and how things are implemented below: not a newbie
concern, and a valid one.
The point I was trying to make (again, talking about the C level and
definitely not to newbies):
(a) currently the command api results in our adding two refcounts to
each value that is passed as an argument to a proc: one by the caller,
one by the proc itself.
(b) this invalidates the proposed "solution" of clearing the var in the
caller using the K combinator, at least when the callee is a proc
(c) two extra refcounts is one-too-many: proper management could be done
with just one extra refcount (assuming some coordination between caller
and callee, ie, an api change). Today the caller cleans up the arguments
on return; we'd need the callee to take over that job.
(d) I have been thinking of ways to avoid the one-too-many in order to
simplify some of the code that is run at each proc call, when setting up
and then cleaning up the proper environment for the proc body. Whatever
is done here, it will NOT introduce a difference between procs and
commands at the script level - possibly not even at the C extension
level. It may yet turn out to be a confined to just the bytecode subsystem.
No, I am not conspiring to destroy the language. One of the most basic
features of Tcl is that all commands are equal, that stays. If it got to
choosing between "that stays" or "miguel stays", even I would vote for
my prompt removal.
Miguel
-
Re: Dict sharing vs. duplication
On Dec 21, 9:50 am, "tom.rmadilo" <tom.rmad...@gmail.com> wrote:
> The desire to special-case stuff is always a huge danger in
> programming; don't fall into the trap of trying to invent a cool
> special shiny object for Tcl, it'll stick out like a wart on your
> nose.
The delayed copy of a dictionary into a value seems to have an
immediate usefulness in the [dict] commands themselves. From the
manpage there is this example:
dict set capital en [dict get $capital C]
If $capital isn't instantly copied, then it probably doesn't double in
size with the above operation. That would be good. But why the syntax?
For instance, why not:
dict set capital en [dict get capital C] ????
My guess is that in [dict get capital ...] 'capital' isn't guaranteed
to be a dict, it could be a list:
% dict get {a b c d} a
b
The [dict get] command says: "treat arg1 as a dictionary value". If
this wasn't the case, then you would always need to initialize a
dictionary before you could use any [dict] commands on it. Talk about
copying too much, you would likely end up with the original list, plus
a dictionary object, even within a single context.
-
Re: Dict sharing vs. duplication
On Dec 21, 10:39 am, miguel <mso...@users.sf.net> wrote:
> tom.rmadilo wrote:
> > Another statement above is misleading to the newbie: that things
> > should be handled differently with procs/commands"
>
> > "I have been thinking about changing the api only for procs..."
>
> > Whaa? Lets go back to syntax rule #1:
>
> > "A Tcl script is a string containing one or more commands."
>
> > The desire to special-case stuff is always a huge danger in
> > programming; don't fall into the trap of trying to invent a cool
> > special shiny object for Tcl, it'll stick out like a wart on your
> > nose.
>
> Sorry, I was obviously not clear. No changes planned at the script
> level, this refers at worst to the C-level api, at best to only internal
> apis of the bytecode subsystem (compiler and engine).
>
> I was replying to Alexandre, definitely not a newbie and clearly worried
> about refcounts. He is programming at the script level, but worrying
> about performance and how things are implemented below: not a newbie
> concern, and a valid one.
>
> The point I was trying to make (again, talking about the C level and
> definitely not to newbies):
>
> (a) currently the command api results in our adding two refcounts to
> each value that is passed as an argument to a proc: one by the caller,
> one by the proc itself.
>
> (b) this invalidates the proposed "solution" of clearing the var in the
> caller using the K combinator, at least when the callee is a proc
>
> (c) two extra refcounts is one-too-many: proper management could be done
> with just one extra refcount (assuming some coordination between caller
> and callee, ie, an api change). Today the caller cleans up the arguments
> on return; we'd need the callee to take over that job.
>
> (d) I have been thinking of ways to avoid the one-too-many in order to
> simplify some of the code that is run at each proc call, when setting up
> and then cleaning up the proper environment for the proc body. Whatever
> is done here, it will NOT introduce a difference between procs and
> commands at the script level - possibly not even at the C extension
> level. It may yet turn out to be a confined to just the bytecode subsystem.
>
> No, I am not conspiring to destroy the language. One of the most basic
> features of Tcl is that all commands are equal, that stays. If it got to
> choosing between "that stays" or "miguel stays", even I would vote for
> my prompt removal.
Very well explained, and I agree completely. I do appreciate the level
of non-newbie-ness of everyone contributing to this thread, no
question about that. I also fully support the C level developers of
the core commands doing whatever they think works as long as the
[tclsh] users don't have to care about it. One of the key benefits of
these two distinct levels is that the C developers are free to
optimize all they want without informing the upper levels.
I think I mis-understood what Alexandre was getting at, although I'm
again guessing. And that is that if a newbie looked at some small
chunk of code to figure out how to use a dict, or pass around
references to it, it might not be obvious without a little more
explanation.
You are forced to use the name of an array which can't be passed
around by value. In fact, this is somewhat helpful from time to time
in discovering bugs due to naming. You can't have both $a and $a(a),
first one created wins.
But [dict] may actually improve the situation because a dict name will
hardly ever look like a dict value, and of course, you don't even have
to have a name for a dict, so you should be able to apply several
layers of filtering without any intermediate variables. And this is
what is going to be confusing.
-
Re: Dict sharing vs. duplication
Alexandre Ferrieux wrote:
> Yes, but it means you always pass it by name ([global] being a special
> case of [upvar]), losing the beauty of first-class-ness...
Same as with lists.
> Moreover, if you write a generic procedure like an iterator, taking an
> arbitrary script as argument, then the dict must be passed by name to
> the iterator, just in case the arbitrary script needs write access.
I have no idea what you are talking about there. You seem to be totally
confused and determined to ascribe magic to everything in sight, whereas
there is damn little magic (and all of that is purely a matter of some
performance optimizations). Given that, what do you think this does?
proc iterDict {dict varKey varValue script} {
upvar 1 $varKey key $varValue val
foreach key [dict keys $dict] {
set val [dict get $dict $key]
uplevel 1 $script
}
}
Also note that [dict for] does pretty much the same thing. ;-)
Donal.
-
Re: Dict sharing vs. duplication
On Dec 21, 2:20 pm, "Donal K. Fellows"
<donal.k.fell...@manchester.ac.uk> wrote:
> there is damn little magic (and all of that is purely a matter of some
> performance optimizations).
Regardless of any magic, I am more interested in surprise behavior.
The only surprise I see here is that (according to some of the
comments in this thread) the $mydict construct doesn't necessarily
create an instant copy of the data. This is a different kind of
surprise, if true. It means that the [dict] commands can operate on
things which are not dicts in the formal sense, but just have the
structure of a dict, yet when an actual dict is passed around by
value, it is a somewhat thin value, which could fatten up if
necessary. I hope this is a correct statement, because it means that
filtering of a fat dict into a much smaller dict will be nearly ideal.
For instance, I've been working on an nvlist, and the first question
was to pass by value or by reference. I started by value and switched
to by reference. However I lost the ability to chain filtering
(without intermediate results) due to the need to reference the data
by name. Dict seems to neatly solve this difficult choice. I really
hope this ****ysis is right, but either way a dict is somewhere
between a list and an array, pretty much right in line with Tcl
structures.
-
Re: Dict sharing vs. duplication
Dnia 21.12.2007 tom.rmadilo <tom.rmadilo@gmail.com> napisa³/a:
> As a data structure I don't find much of a use for this one. I use an
> array to enforce unique keys and I use lists to enforce order. I
> combine the two in a number of ways: array of lists, list of array
> names, etc.
A good example would be a typical "personal database" (or just any other):
- array of lists won't be suitable for this one
- list of array names isn't comfortable: if an array has to contain personal
data (say: arr(name), arr(address), arr(phone) and so on), so although you
have to use different array names, like f.e. arr001, arr002 or similar way
- but the real index will be list index, pointing to the array name (and
not array name itself). One can even just generate array names at random,
because they have just to be different, and will not have any other meaning.
And so there's an "intermediate level" when accessing the data. Of course,
one can live with that - but it isn't convenient neither elegant.
Much easier will this be solved using dictionaries, which are kind of records
(or C-structures). For example, such way:
set tmpDict [dict create name John address Washington phone 123455678]
set dBase[list]
lappend dBase $tmpDict
set tmpDict [dict create name Mary address Chicago phone 3453453453]
lappend dBase $tmpDict
...and so on...
Such way we've got a replacement for "ordinary", integer-indexed linear
array, which holds records as data, with record fields accessible just by
field names - very common structure. You can then sort $dBase list whichever
way you want, and the list index will be real pointer to particular record.
Just my thoughts since "my" thread "Still no records". BTW: take a look at
an example http://pinds.com/2000/09/18/datastructures-in-tcl - the guy had
to use "clever naming", and it's quite interesting solution - but pay
attention, how natural solution would be just using dictionaries (not
present at that time).
--
ZB
-
Re: Dict sharing vs. duplication
On Dec 21, 7:05 pm, ZB <zbREMOVE_THIS@AND_THISispid.com.pl> wrote:
> Dnia 21.12.2007 tom.rmadilo <tom.rmad...@gmail.com> napisa³/a:
>
> > As a data structure I don't find much of a use for this one. I use an
> > array to enforce unique keys and I use lists to enforce order. I
> > combine the two in a number of ways: array of lists, list of array
> > names, etc.
>
> A good example would be a typical "personal database" (or just any other):
>
> - array of lists won't be suitable for this one
>
> - list of array names isn't comfortable: if an array has to contain personal
> data (say: arr(name), arr(address), arr(phone) and so on), so although you
> have to use different array names, like f.e. arr001, arr002 or similar way
> - but the real index will be list index, pointing to the array name (and
> not array name itself). One can even just generate array names at random,
> because they have just to be different, and will not have any other meaning.
> And so there's an "intermediate level" when accessing the data. Of course,
> one can live with that - but it isn't convenient neither elegant.
>
> Much easier will this be solved using dictionaries, which are kind of records
> (or C-structures). For example, such way:
>
> set tmpDict [dict create name John address Washington phone 123455678]
> set dBase[list]
> lappend dBase $tmpDict
> set tmpDict [dict create name Mary address Chicago phone 3453453453]
> lappend dBase $tmpDict
> ...and so on...
>
> Such way we've got a replacement for "ordinary", integer-indexed linear
> array, which holds records as data, with record fields accessible just by
> field names - very common structure. You can then sort $dBase list whichever
> way you want, and the list index will be real pointer to particular record..
>
> Just my thoughts since "my" thread "Still no records". BTW: take a look at
> an example http://pinds.com/2000/09/18/datastructures-in-tcl- the guy had
> to use "clever naming", and it's quite interesting solution - but pay
> attention, how natural solution would be just using dictionaries (not
> present at that time).
Of course Mr. Pind doesn't use Tcl anymore (od'd on ruby then
crashed), so who cares? Using clever names for arrays which you will
just put into a list is moronic. Lists are ordered structures and the
only requirement for arrays is that each needs a unique name. If you
use some kind of non-opaque name for the array names that relates to
order, you instantly lose the ability to insert, delete or reorder the
elements (like lreverse). This is the typical solution you find from
developers who can't figure out how to use the data structures which
exist, basically: arrays have unique keys and lists have unique
order.
But records are not that difficult to create in Tcl. I think the
problem is that whenever something is not present in the base Tcl
language it is seen as a missing feature. Since you can write a record
API in a page or two of Tcl code, it isn't so much a missing feature
as a lack of effort by those who want it. The argument against writing
it in Tcl is always that everything needs to be fast, like as fast as
possible. Believe me, if you write it in plain ol' Tcl and people use
it, you can port it to C. But just complaining about the lack of a
data structure which can be coded up over a weekend or two doesn't
help anyone.
If you want a head-start, I'll give you my nvlist code, which is
generalized beyond the dict code in some (tiny) ways, but not quite to
the record stage. And it is interesting to note here that my nvlist
API is designed to be special-cased with thin wrappers to provide
meaningfully named operators. Trying to push too many features into
core API may seem like a good idea at first, but a thin core API has
the advantage of composition, specialization and slight renaming so
that API users will easily understand the meaning of an API, and you
will be able to search for uses of the API.
So here are some examples:
http://junom.com/gitweb/gitweb.perl?...=grep&s=nvlist
This shows a search of uses of nvlist in a larger API. Note that in
most cases here there is a relatively small list of data items, but
each has more than the simple name-value relationship implied by the
name of the structure. This is a typical complex situation which isn't
easily handled by a dictionary, which has indexes on keys and values
only. This nvlist code is built upon [lsearch] and can use any range
of list members as a key, for instance, a 5 member list {a b c d e},
could be selected based upon a search of {* b c d *}.
Okay, that is the nvlist example, how about lists of array names:
http://junom.com/gitweb/gitweb.perl?...7030b3d7929e3a
There are two sets of files here. One is the original pg-
browse.tcl/.adp for browsing a database using foreign keys as links
between the tables. If you are interested, the pg-browse code uses the
OpenACS database API which creates a new specialized data structure,
which can't be easily used without the other associated API.
The second set of files is pg-tmp.tcl/.tmpl Here the database API
returns a list of array names. I wrote this API several years ago and
never considered the case of appending (append) additional rows with
another query, or adding columns (extend) to each row using a loop.
The original OpenACS API has lots of switches, and if you look at the
code it is quite complex, many long files, and the result is a
completely inaccessible data structure.
The new API uses lists and arrays. You can create and modify these
using regular Tcl code, no special API. So, when I needed to port this
relatively complex sequence of queries, I was actually surprised that
my original code fully supported the 'extend' and 'append' features
with no additional code.
My main point here is that it is very important to not create problems
for yourself with new specialized data structures. I think that the
new dict structure very much follows this idea. It does so because
there is a text/string/list representation which can be created in any
number of ways, and many of the operations return values which can be
turned into lists or arrays or new dicts without any special code.
This is very important, much more important than adding a super
functional but isolated data structure.
And don't use Lars Pind as an example of good coding in Tcl, not even
close.
-
Re: Dict sharing vs. duplication
On Dec 21, 11:20 pm, "Donal K. Fellows"
<donal.k.fell...@manchester.ac.uk> wrote:
> Alexandre Ferrieux wrote:
> > Yes, but it means you always pass it by name ([global] being a special
> > case of [upvar]), losing the beauty of first-class-ness...
>
> Same as with lists.
Yes, but it's irrelevant to somebody who is hesitating between an
array and a dict.
> > Moreover, if you write a generic procedure like an iterator, taking an
> > arbitrary script as argument, then the dict must be passed by name to
> > the iterator, just in case the arbitrary script needs write access.
>
> I have no idea what you are talking about there. You seem to be totally
> confused and determined to ascribe magic to everything in sight, whereas
Totally confused ? You've made clear enough how stupid you believe I
am, but read on.
> there is damn little magic (and all of that is purely a matter of some
> performance optimizations). Given that, what do you think this does?
>
> proc iterDict {dict varKey varValue script} {
> upvar 1 $varKey key $varValue val
> foreach key [dict keys $dict] {
> set val [dict get $dict $key]
> uplevel 1 $script
> }
> }
It does exactly what I described in the first place: when $script is
executed, at least one extra reference to the dict is in the $dict
local in the iterDict frame (and even one more according to Miguel).
Hence, if the script inadvertently writes to the dict (by indirect
side-effect, eg by calling some other procs which access it through
the global (assuming we did iterDict $::d k v {...}) ans write), then
the unwanted, wasted duplication occurs.
> Also note that [dict for] does pretty much the same thing. ;-)
Pretty much or exactly ? (Since it's a C function I guess it doesn't
suffer from the ref-wasting issues of procs)
-Alex
PS: to help you avoid biting me once more, look at Miguel's ****ysis.
He does understand my concerns.
-
Re: Dict sharing vs. duplication
On Dec 21, 5:05 pm, miguel <mso...@users.sf.net> wrote:
>
> IMHO, it is wrong to stress these things in a "dict vs array" manner; it
> really is just a "value vs variable" comparison. Do not confuse the
> container (a variable) with its contents (a value). Once the general
> case is understood, things become simple. If you insist on seeing this
> as a multitude of special cases you just obscure the matter.
I'm not confusing anything. I do understand Tcl internals pretty well,
though I like to put myself in the scripter's seat at the same time to
verify how well we hide the internal complexity.
As you yourself said in a nearby message, my only concern is about
performance, in situations which may be hard to ****yse purely from
the scripter's POV.
The reason I'm putting this in terms of "dict vs array" is that, for a
scripter, there is a large functional overlap between the two (I did
not write "redundancy", I want to live through 2007), hence a natural
question like "when should I use one or the other ?".
> Bottom line for newbies: if your values are large and r/w, better store
> them in a variable and access it through upvar. No matter if list or
> dict or whatever.
Yes, but in case you need associative access, doing this is hardly
different (for the client code) from what you'd do with an array... I
have nothing against upvar, I'm just trying to enumerate pros and
cons, some of which are of a stylistic nature, and on this scale it's
an ex-aequo :-)
> I do agree that when the use case allows for either an array or a dict
> it might be valuable to have a specialized list of choice criteria
> available. In the wiki?
That would definitely help !
-Alex