Dict sharing vs. duplication - TCL

This is a discussion on Dict sharing vs. duplication - TCL ; On Dec 21, 7:27 am, Alexandre Ferrieux <alexandre.ferri...@gmail.com> wrote: > Bottom line for newbies: if your dict is large and r/w, better > make it an array, or count references carefully. > > Do we agree on the diagnosis ? ...

+ Reply to Thread
Page 3 of 11 FirstFirst 1 2 3 4 5 ... LastLast
Results 21 to 30 of 104

Dict sharing vs. duplication

  1. Default Re: Dict sharing vs. duplication

    On Dec 21, 7:27 am, Alexandre Ferrieux <alexandre.ferri...@gmail.com>
    wrote:

    > Bottom line for newbies: if your dict is large and r/w, better
    > make it an array, or count references carefully.
    >
    > Do we agree on the diagnosis ?


    I was going to hold off on the language discussion, hoping it would go
    away, but it isn't.

    The newbie should completely forget about reference counts, period.
    This is not a script level concept in Tcl and will only confuse anyone
    trying to think in Tcl. Go find some other language to talk about
    references, but don't confuse newbie Tcl'ers with reference counts,
    first-class-objects, lazy copying, etc. This is real bs in Tcl. If the
    developers who wrote the code tried to optimize performance by not
    doing an immediate copy, who cares, this has nothing to do with what
    the script level code says to do, which is to make a copy:

    [set a $b] makes a copy of the value of b and store it in another
    place called a, when or how this is done should never be depended upon
    by anyone at the script level.

    A dict has a value, a real value which can be passed around. A dict
    also has a name, a real name which can be passed around. What is the
    difference between this and anything else? An array has a value which
    can be accessed with [array get myarray].

    But to understand the difference between how dict and array are used
    is more instructive than looking at their implementation. They are
    used for different types of data. Arrays are limited to one-to-one key-
    value relationships. This makes the values independent from each
    other, they are orthogonal, at right-angles, whatever. Being able to
    address the array entries as individuals makes a lot of sense. But a
    dict is not like this. A dict is more like a grouped list of lists.
    The grouping may be added to by many different somewhat independent
    actions, but the dict itself is a single entity of dependent content.

    However, array and dict are are also commands, not just structures.
    Talking about a dict or array as if they are mere datatypes completely
    misses the point, and an important concept in Tcl: we let the
    developer decide what to do. dict and array are ways of accessing and
    manipulating data, and in pure Tcl you can invent hundreds of other
    ways of manipulating data, or storing it.

    In addition, any developer who can't figure out how to use globals or
    namespace variables to store and maintain large amounts of reusable
    data should start at that point and really forget completely about
    reference counts. It is usually a bug to copy data if you just need to
    manipulate it, regardless of how small or large the data is in memory.
    The reason is that your code then makes it clear that you want the
    actual global value of a variable at a particular point in your code,
    not what it was when your procedure started to run, but the current
    value. If you want the value at the time your procedure starts to run,
    get a copy.

    In addition, [upvar] is not inelegant! This statement can only result
    from a complete misunderstanding of the Tcl language. If we need to
    help newbies, the best advice is for them to assume that the Tcl
    language works the way it works for very good reasons, and there is
    nothing inelegant with something that works very well. (How can
    someone like reference counting and misunderstand upvar?) Newbies need
    to learn the language of Tcl, and suggesting that they think about
    things they cannot use in terms which don't apply to the Tcl language
    is a bad way to help them out.

    Another statement above is misleading to the newbie: that things
    should be handled differently with procs/commands"

    "I have been thinking about changing the api only for procs..."

    Whaa? Lets go back to syntax rule #1:

    "A Tcl script is a string containing one or more commands."

    The desire to special-case stuff is always a huge danger in
    programming; don't fall into the trap of trying to invent a cool
    special shiny object for Tcl, it'll stick out like a wart on your
    nose.

    Reference counting is not a solution to poor programming.


  2. Default Re: Dict sharing vs. duplication

    tom.rmadilo wrote:
    > Another statement above is misleading to the newbie: that things
    > should be handled differently with procs/commands"
    >
    > "I have been thinking about changing the api only for procs..."
    >
    > Whaa? Lets go back to syntax rule #1:
    >
    > "A Tcl script is a string containing one or more commands."
    >
    > The desire to special-case stuff is always a huge danger in
    > programming; don't fall into the trap of trying to invent a cool
    > special shiny object for Tcl, it'll stick out like a wart on your
    > nose.


    Sorry, I was obviously not clear. No changes planned at the script
    level, this refers at worst to the C-level api, at best to only internal
    apis of the bytecode subsystem (compiler and engine).

    I was replying to Alexandre, definitely not a newbie and clearly worried
    about refcounts. He is programming at the script level, but worrying
    about performance and how things are implemented below: not a newbie
    concern, and a valid one.

    The point I was trying to make (again, talking about the C level and
    definitely not to newbies):

    (a) currently the command api results in our adding two refcounts to
    each value that is passed as an argument to a proc: one by the caller,
    one by the proc itself.

    (b) this invalidates the proposed "solution" of clearing the var in the
    caller using the K combinator, at least when the callee is a proc

    (c) two extra refcounts is one-too-many: proper management could be done
    with just one extra refcount (assuming some coordination between caller
    and callee, ie, an api change). Today the caller cleans up the arguments
    on return; we'd need the callee to take over that job.

    (d) I have been thinking of ways to avoid the one-too-many in order to
    simplify some of the code that is run at each proc call, when setting up
    and then cleaning up the proper environment for the proc body. Whatever
    is done here, it will NOT introduce a difference between procs and
    commands at the script level - possibly not even at the C extension
    level. It may yet turn out to be a confined to just the bytecode subsystem.

    No, I am not conspiring to destroy the language. One of the most basic
    features of Tcl is that all commands are equal, that stays. If it got to
    choosing between "that stays" or "miguel stays", even I would vote for
    my prompt removal.

    Miguel

  3. Default Re: Dict sharing vs. duplication

    On Dec 21, 9:50 am, "tom.rmadilo" <tom.rmad...@gmail.com> wrote:
    > The desire to special-case stuff is always a huge danger in
    > programming; don't fall into the trap of trying to invent a cool
    > special shiny object for Tcl, it'll stick out like a wart on your
    > nose.


    The delayed copy of a dictionary into a value seems to have an
    immediate usefulness in the [dict] commands themselves. From the
    manpage there is this example:

    dict set capital en [dict get $capital C]

    If $capital isn't instantly copied, then it probably doesn't double in
    size with the above operation. That would be good. But why the syntax?

    For instance, why not:

    dict set capital en [dict get capital C] ????

    My guess is that in [dict get capital ...] 'capital' isn't guaranteed
    to be a dict, it could be a list:

    % dict get {a b c d} a
    b

    The [dict get] command says: "treat arg1 as a dictionary value". If
    this wasn't the case, then you would always need to initialize a
    dictionary before you could use any [dict] commands on it. Talk about
    copying too much, you would likely end up with the original list, plus
    a dictionary object, even within a single context.

  4. Default Re: Dict sharing vs. duplication

    On Dec 21, 10:39 am, miguel <mso...@users.sf.net> wrote:
    > tom.rmadilo wrote:
    > > Another statement above is misleading to the newbie: that things
    > > should be handled differently with procs/commands"

    >
    > > "I have been thinking about changing the api only for procs..."

    >
    > > Whaa? Lets go back to syntax rule #1:

    >
    > > "A Tcl script is a string containing one or more commands."

    >
    > > The desire to special-case stuff is always a huge danger in
    > > programming; don't fall into the trap of trying to invent a cool
    > > special shiny object for Tcl, it'll stick out like a wart on your
    > > nose.

    >
    > Sorry, I was obviously not clear. No changes planned at the script
    > level, this refers at worst to the C-level api, at best to only internal
    > apis of the bytecode subsystem (compiler and engine).
    >
    > I was replying to Alexandre, definitely not a newbie and clearly worried
    > about refcounts. He is programming at the script level, but worrying
    > about performance and how things are implemented below: not a newbie
    > concern, and a valid one.
    >
    > The point I was trying to make (again, talking about the C level and
    > definitely not to newbies):
    >
    > (a) currently the command api results in our adding two refcounts to
    > each value that is passed as an argument to a proc: one by the caller,
    > one by the proc itself.
    >
    > (b) this invalidates the proposed "solution" of clearing the var in the
    > caller using the K combinator, at least when the callee is a proc
    >
    > (c) two extra refcounts is one-too-many: proper management could be done
    > with just one extra refcount (assuming some coordination between caller
    > and callee, ie, an api change). Today the caller cleans up the arguments
    > on return; we'd need the callee to take over that job.
    >
    > (d) I have been thinking of ways to avoid the one-too-many in order to
    > simplify some of the code that is run at each proc call, when setting up
    > and then cleaning up the proper environment for the proc body. Whatever
    > is done here, it will NOT introduce a difference between procs and
    > commands at the script level - possibly not even at the C extension
    > level. It may yet turn out to be a confined to just the bytecode subsystem.
    >
    > No, I am not conspiring to destroy the language. One of the most basic
    > features of Tcl is that all commands are equal, that stays. If it got to
    > choosing between "that stays" or "miguel stays", even I would vote for
    > my prompt removal.


    Very well explained, and I agree completely. I do appreciate the level
    of non-newbie-ness of everyone contributing to this thread, no
    question about that. I also fully support the C level developers of
    the core commands doing whatever they think works as long as the
    [tclsh] users don't have to care about it. One of the key benefits of
    these two distinct levels is that the C developers are free to
    optimize all they want without informing the upper levels.

    I think I mis-understood what Alexandre was getting at, although I'm
    again guessing. And that is that if a newbie looked at some small
    chunk of code to figure out how to use a dict, or pass around
    references to it, it might not be obvious without a little more
    explanation.

    You are forced to use the name of an array which can't be passed
    around by value. In fact, this is somewhat helpful from time to time
    in discovering bugs due to naming. You can't have both $a and $a(a),
    first one created wins.

    But [dict] may actually improve the situation because a dict name will
    hardly ever look like a dict value, and of course, you don't even have
    to have a name for a dict, so you should be able to apply several
    layers of filtering without any intermediate variables. And this is
    what is going to be confusing.

  5. Default Re: Dict sharing vs. duplication

    Alexandre Ferrieux wrote:
    > Yes, but it means you always pass it by name ([global] being a special
    > case of [upvar]), losing the beauty of first-class-ness...


    Same as with lists.

    > Moreover, if you write a generic procedure like an iterator, taking an
    > arbitrary script as argument, then the dict must be passed by name to
    > the iterator, just in case the arbitrary script needs write access.


    I have no idea what you are talking about there. You seem to be totally
    confused and determined to ascribe magic to everything in sight, whereas
    there is damn little magic (and all of that is purely a matter of some
    performance optimizations). Given that, what do you think this does?

    proc iterDict {dict varKey varValue script} {
    upvar 1 $varKey key $varValue val
    foreach key [dict keys $dict] {
    set val [dict get $dict $key]
    uplevel 1 $script
    }
    }

    Also note that [dict for] does pretty much the same thing. ;-)

    Donal.

  6. Default Re: Dict sharing vs. duplication

    On Dec 21, 2:20 pm, "Donal K. Fellows"
    <donal.k.fell...@manchester.ac.uk> wrote:
    > there is damn little magic (and all of that is purely a matter of some
    > performance optimizations).


    Regardless of any magic, I am more interested in surprise behavior.
    The only surprise I see here is that (according to some of the
    comments in this thread) the $mydict construct doesn't necessarily
    create an instant copy of the data. This is a different kind of
    surprise, if true. It means that the [dict] commands can operate on
    things which are not dicts in the formal sense, but just have the
    structure of a dict, yet when an actual dict is passed around by
    value, it is a somewhat thin value, which could fatten up if
    necessary. I hope this is a correct statement, because it means that
    filtering of a fat dict into a much smaller dict will be nearly ideal.

    For instance, I've been working on an nvlist, and the first question
    was to pass by value or by reference. I started by value and switched
    to by reference. However I lost the ability to chain filtering
    (without intermediate results) due to the need to reference the data
    by name. Dict seems to neatly solve this difficult choice. I really
    hope this ****ysis is right, but either way a dict is somewhere
    between a list and an array, pretty much right in line with Tcl
    structures.

  7. Default Re: Dict sharing vs. duplication

    Dnia 21.12.2007 tom.rmadilo <tom.rmadilo@gmail.com> napisa³/a:

    > As a data structure I don't find much of a use for this one. I use an
    > array to enforce unique keys and I use lists to enforce order. I
    > combine the two in a number of ways: array of lists, list of array
    > names, etc.


    A good example would be a typical "personal database" (or just any other):

    - array of lists won't be suitable for this one

    - list of array names isn't comfortable: if an array has to contain personal
    data (say: arr(name), arr(address), arr(phone) and so on), so although you
    have to use different array names, like f.e. arr001, arr002 or similar way
    - but the real index will be list index, pointing to the array name (and
    not array name itself). One can even just generate array names at random,
    because they have just to be different, and will not have any other meaning.
    And so there's an "intermediate level" when accessing the data. Of course,
    one can live with that - but it isn't convenient neither elegant.

    Much easier will this be solved using dictionaries, which are kind of records
    (or C-structures). For example, such way:

    set tmpDict [dict create name John address Washington phone 123455678]
    set dBase[list]
    lappend dBase $tmpDict
    set tmpDict [dict create name Mary address Chicago phone 3453453453]
    lappend dBase $tmpDict
    ...and so on...

    Such way we've got a replacement for "ordinary", integer-indexed linear
    array, which holds records as data, with record fields accessible just by
    field names - very common structure. You can then sort $dBase list whichever
    way you want, and the list index will be real pointer to particular record.

    Just my thoughts since "my" thread "Still no records". BTW: take a look at
    an example http://pinds.com/2000/09/18/datastructures-in-tcl - the guy had
    to use "clever naming", and it's quite interesting solution - but pay
    attention, how natural solution would be just using dictionaries (not
    present at that time).
    --
    ZB

  8. Default Re: Dict sharing vs. duplication


    On Dec 21, 7:05 pm, ZB <zbREMOVE_THIS@AND_THISispid.com.pl> wrote:
    > Dnia 21.12.2007 tom.rmadilo <tom.rmad...@gmail.com> napisa³/a:
    >
    > > As a data structure I don't find much of a use for this one. I use an
    > > array to enforce unique keys and I use lists to enforce order. I
    > > combine the two in a number of ways: array of lists, list of array
    > > names, etc.

    >
    > A good example would be a typical "personal database" (or just any other):
    >
    > - array of lists won't be suitable for this one
    >
    > - list of array names isn't comfortable: if an array has to contain personal
    > data (say: arr(name), arr(address), arr(phone) and so on), so although you
    > have to use different array names, like f.e. arr001, arr002 or similar way
    > - but the real index will be list index, pointing to the array name (and
    > not array name itself). One can even just generate array names at random,
    > because they have just to be different, and will not have any other meaning.
    > And so there's an "intermediate level" when accessing the data. Of course,
    > one can live with that - but it isn't convenient neither elegant.
    >
    > Much easier will this be solved using dictionaries, which are kind of records
    > (or C-structures). For example, such way:
    >
    > set tmpDict [dict create name John address Washington phone 123455678]
    > set dBase[list]
    > lappend dBase $tmpDict
    > set tmpDict [dict create name Mary address Chicago phone 3453453453]
    > lappend dBase $tmpDict
    > ...and so on...
    >
    > Such way we've got a replacement for "ordinary", integer-indexed linear
    > array, which holds records as data, with record fields accessible just by
    > field names - very common structure. You can then sort $dBase list whichever
    > way you want, and the list index will be real pointer to particular record..
    >
    > Just my thoughts since "my" thread "Still no records". BTW: take a look at
    > an example http://pinds.com/2000/09/18/datastructures-in-tcl- the guy had
    > to use "clever naming", and it's quite interesting solution - but pay
    > attention, how natural solution would be just using dictionaries (not
    > present at that time).


    Of course Mr. Pind doesn't use Tcl anymore (od'd on ruby then
    crashed), so who cares? Using clever names for arrays which you will
    just put into a list is moronic. Lists are ordered structures and the
    only requirement for arrays is that each needs a unique name. If you
    use some kind of non-opaque name for the array names that relates to
    order, you instantly lose the ability to insert, delete or reorder the
    elements (like lreverse). This is the typical solution you find from
    developers who can't figure out how to use the data structures which
    exist, basically: arrays have unique keys and lists have unique
    order.

    But records are not that difficult to create in Tcl. I think the
    problem is that whenever something is not present in the base Tcl
    language it is seen as a missing feature. Since you can write a record
    API in a page or two of Tcl code, it isn't so much a missing feature
    as a lack of effort by those who want it. The argument against writing
    it in Tcl is always that everything needs to be fast, like as fast as
    possible. Believe me, if you write it in plain ol' Tcl and people use
    it, you can port it to C. But just complaining about the lack of a
    data structure which can be coded up over a weekend or two doesn't
    help anyone.

    If you want a head-start, I'll give you my nvlist code, which is
    generalized beyond the dict code in some (tiny) ways, but not quite to
    the record stage. And it is interesting to note here that my nvlist
    API is designed to be special-cased with thin wrappers to provide
    meaningfully named operators. Trying to push too many features into
    core API may seem like a good idea at first, but a thin core API has
    the advantage of composition, specialization and slight renaming so
    that API users will easily understand the meaning of an API, and you
    will be able to search for uses of the API.

    So here are some examples:

    http://junom.com/gitweb/gitweb.perl?...=grep&s=nvlist

    This shows a search of uses of nvlist in a larger API. Note that in
    most cases here there is a relatively small list of data items, but
    each has more than the simple name-value relationship implied by the
    name of the structure. This is a typical complex situation which isn't
    easily handled by a dictionary, which has indexes on keys and values
    only. This nvlist code is built upon [lsearch] and can use any range
    of list members as a key, for instance, a 5 member list {a b c d e},
    could be selected based upon a search of {* b c d *}.

    Okay, that is the nvlist example, how about lists of array names:

    http://junom.com/gitweb/gitweb.perl?...7030b3d7929e3a

    There are two sets of files here. One is the original pg-
    browse.tcl/.adp for browsing a database using foreign keys as links
    between the tables. If you are interested, the pg-browse code uses the
    OpenACS database API which creates a new specialized data structure,
    which can't be easily used without the other associated API.

    The second set of files is pg-tmp.tcl/.tmpl Here the database API
    returns a list of array names. I wrote this API several years ago and
    never considered the case of appending (append) additional rows with
    another query, or adding columns (extend) to each row using a loop.

    The original OpenACS API has lots of switches, and if you look at the
    code it is quite complex, many long files, and the result is a
    completely inaccessible data structure.

    The new API uses lists and arrays. You can create and modify these
    using regular Tcl code, no special API. So, when I needed to port this
    relatively complex sequence of queries, I was actually surprised that
    my original code fully supported the 'extend' and 'append' features
    with no additional code.

    My main point here is that it is very important to not create problems
    for yourself with new specialized data structures. I think that the
    new dict structure very much follows this idea. It does so because
    there is a text/string/list representation which can be created in any
    number of ways, and many of the operations return values which can be
    turned into lists or arrays or new dicts without any special code.
    This is very important, much more important than adding a super
    functional but isolated data structure.

    And don't use Lars Pind as an example of good coding in Tcl, not even
    close.

  9. Default Re: Dict sharing vs. duplication

    On Dec 21, 11:20 pm, "Donal K. Fellows"
    <donal.k.fell...@manchester.ac.uk> wrote:
    > Alexandre Ferrieux wrote:
    > > Yes, but it means you always pass it by name ([global] being a special
    > > case of [upvar]), losing the beauty of first-class-ness...

    >
    > Same as with lists.


    Yes, but it's irrelevant to somebody who is hesitating between an
    array and a dict.

    > > Moreover, if you write a generic procedure like an iterator, taking an
    > > arbitrary script as argument, then the dict must be passed by name to
    > > the iterator, just in case the arbitrary script needs write access.

    >
    > I have no idea what you are talking about there. You seem to be totally
    > confused and determined to ascribe magic to everything in sight, whereas


    Totally confused ? You've made clear enough how stupid you believe I
    am, but read on.

    > there is damn little magic (and all of that is purely a matter of some
    > performance optimizations). Given that, what do you think this does?
    >
    >    proc iterDict {dict varKey varValue script} {
    >       upvar 1 $varKey key $varValue val
    >       foreach key [dict keys $dict] {
    >          set val [dict get $dict $key]
    >          uplevel 1 $script
    >       }
    >    }


    It does exactly what I described in the first place: when $script is
    executed, at least one extra reference to the dict is in the $dict
    local in the iterDict frame (and even one more according to Miguel).
    Hence, if the script inadvertently writes to the dict (by indirect
    side-effect, eg by calling some other procs which access it through
    the global (assuming we did iterDict $::d k v {...}) ans write), then
    the unwanted, wasted duplication occurs.

    > Also note that [dict for] does pretty much the same thing. ;-)


    Pretty much or exactly ? (Since it's a C function I guess it doesn't
    suffer from the ref-wasting issues of procs)

    -Alex

    PS: to help you avoid biting me once more, look at Miguel's ****ysis.
    He does understand my concerns.

  10. Default Re: Dict sharing vs. duplication

    On Dec 21, 5:05 pm, miguel <mso...@users.sf.net> wrote:
    >
    > IMHO, it is wrong to stress these things in a "dict vs array" manner; it
    > really is just a "value vs variable" comparison. Do not confuse the
    > container (a variable) with its contents (a value). Once the general
    > case is understood, things become simple. If you insist on seeing this
    > as a multitude of special cases you just obscure the matter.


    I'm not confusing anything. I do understand Tcl internals pretty well,
    though I like to put myself in the scripter's seat at the same time to
    verify how well we hide the internal complexity.

    As you yourself said in a nearby message, my only concern is about
    performance, in situations which may be hard to ****yse purely from
    the scripter's POV.

    The reason I'm putting this in terms of "dict vs array" is that, for a
    scripter, there is a large functional overlap between the two (I did
    not write "redundancy", I want to live through 2007), hence a natural
    question like "when should I use one or the other ?".

    > Bottom line for newbies: if your values are large and r/w, better store
    > them in a variable and access it through upvar. No matter if list or
    > dict or whatever.


    Yes, but in case you need associative access, doing this is hardly
    different (for the client code) from what you'd do with an array... I
    have nothing against upvar, I'm just trying to enumerate pros and
    cons, some of which are of a stylistic nature, and on this scale it's
    an ex-aequo :-)

    > I do agree that when the use case allows for either an array or a dict
    > it might be valuable to have a specialized list of choice criteria
    > available. In the wiki?


    That would definitely help !

    -Alex

+ Reply to Thread
Page 3 of 11 FirstFirst 1 2 3 4 5 ... LastLast