MSXML - LINEAR cost taking a SINGLE step of IXMLDomNodeListenumerator?? - XML SOAP
This is a discussion on MSXML - LINEAR cost taking a SINGLE step of IXMLDomNodeListenumerator?? - XML SOAP ; When walking a large DomNodeList using the IEnumVariant interface
(I've included VB code below, but obviously it's the same in C++ using
COM enumerators), the cost of going to the next element seems to be
proportional to the TOTAL length ...
-
MSXML - LINEAR cost taking a SINGLE step of IXMLDomNodeListenumerator??
When walking a large DomNodeList using the IEnumVariant interface
(I've included VB code below, but obviously it's the same in C++ using
COM enumerators), the cost of going to the next element seems to be
proportional to the TOTAL length of the list!! Is this right?
I haven't found this mentioned anywhere, maybe I'm just not googling
right 
i.e. walking the first 1000 nodes of a 2000 node list is massively
cheaper than walking the first 1000 nodes of a 20000 node list. - in
practice this means if you're traversing the whole list your cost is
n^2, which on the large lists I'm dealing with is causing a bit of
pain....
This is pretty confusing - it doesn't look like the list is being
generated lazily (and even if it was I'd not expect linear behaviour
like this for accessing a prefix) - it almost looks like it's re-
evaluating the predicate each iteration step! 
Anyone know anything about this? Behaviour doesn't seem to differ
between msxml4 / 6...
Thanks,
Lee.
-----8<---------- sample VBA to demo, needs ref to msxml4/6.
Private Declare Function QueryPerformanceCounter& Lib
"kernel32.dll" (l As pbdw)
Private Declare Function QueryPerformanceFrequency& Lib
"kernel32.dll" (l As pbdw)
Private Type pbdw
bottom As Long
top As Long
End Type
Public Function tickspers() As Double
Static l As Long
If l = 0 Then
Dim ll As pbdw
Call QueryPerformanceFrequency(ll)
l = ll.bottom
End If
ticksperms = l
End Function
Public Function ticks() As Double
Dim ll As pbdw
Call QueryPerformanceCounter(ll)
ticks = ll.bottom
End Function
Private Sub add_node(ByRef s As String)
s = s & "<node><a>1</a><b>2</b></node>"
End Sub
Private Sub check_timing(ByVal s As String)
s = s & "</doc>"
Dim d As DOMDocument
Set d = New DOMDocument40 ' or domdocument60,
If Not d.loadXML(s) Then Call Err.Raise(10101, , "Didn't load")
Dim n As IXMLDOMNode
Set n = d.documentElement
Dim dn As IXMLDOMNodeList
Set dn = n.selectNodes("node")
If dn.Length < 1000 Then Call Err.Raise(10102, , "Odd...")
Dim dstart As Double, dstop As Double
dstart = ticks()
Dim i As Long
i = 0
Dim count As Long
For Each n In dn
i = i + 1
count = count + Len(n.Text)
If (i = 1000) Then
GoTo breakout
End If
Next n
breakout:
dstop = ticks()
Debug.Print Join(Array(dn.Length, (dstop - dstart) / tickspers,
count), " , ")
End Sub
Public Sub test()
Dim doc As String
doc = "<doc>"
Dim l As Long
For l = 1 To 1000
Call add_node(doc)
Next l
Dim step_ As Long
step_ = 1000
Dim stepstr As String
Dim x As Long
For x = 1 To step_
Call add_node(stepstr)
Next x
For l = 1000 To 40000 Step step_
Call check_timing(doc)
doc = doc & stepstr
Next l
End Sub
----8<---------- sample output #nodes in total list, #s to iterate
first 1000, #chars in first 100 nodes
1000 , 1.12989220697044E-02 , 2000
2000 , 2.13613741411269E-02 , 2000
3000 , 3.05105816521374E-02 , 2000
4000 , 4.01073320771215E-02 , 2000
5000 , 4.96102158235195E-02 , 2000
6000 , 5.94773916796688E-02 , 2000
7000 , 6.90878309952801E-02 , 2000
8000 , 7.86281496670666E-02 , 2000
9000 , 8.94758970763044E-02 , 2000
10000 , 9.86164442687548E-02 , 2000
11000 , 0.107572331120296 , 2000
12000 , 0.118514783303465 , 2000
13000 , 0.128915267163843 , 2000
14000 , 0.137097033282163 , 2000
15000 , 0.147874939412691 , 2000
16000 , 0.15577845787663 , 2000
17000 , 0.167114814871723 , 2000
18000 , 0.180319007024636 , 2000
19000 , 0.185794842640615 , 2000
20000 , 0.196314336039916 , 2000
21000 , 0.206365613506745 , 2000
22000 , 0.215159747956793 , 2000
23000 , 0.233987280506321 , 2000
24000 , 0.2455398661003 , 2000
25000 , 0.2592050106927 , 2000
26000 , 0.282794880354905 , 2000
27000 , 0.318379849952997 , 2000
28000 , 0.307542997783238 , 2000
29000 , 0.345976374092238 , 2000
30000 , 0.35645787383592 , 2000
31000 , 0.349666787259275 , 2000
32000 , 0.394941535865592 , 2000
33000 , 0.407140013605081 , 2000
34000 , 0.371464250344667 , 2000
35000 , 0.395082615248586 , 2000
36000 , 0.420603177219451 , 2000
37000 , 0.48740580157534 , 2000
38000 , 0.500694361992935 , 2000
39000 , 0.527952574978105 , 2000
40000 , 0.522535685401357 , 2000
-
Re: MSXML - LINEAR cost taking a SINGLE step of IXMLDomNodeList enumerator??
"lab27" <lee.benfield@gmail.com> wrote in message
news:795f1bf0-544a-481f-ba38-2649a6f67728@p69g2000hsa.googlegroups.com...
> When walking a large DomNodeList using the IEnumVariant interface
> (I've included VB code below, but obviously it's the same in C++ using
> COM enumerators), the cost of going to the next element seems to be
> proportional to the TOTAL length of the list!! Is this right?
>
> I haven't found this mentioned anywhere, maybe I'm just not googling
> right 
>
> i.e. walking the first 1000 nodes of a 2000 node list is massively
> cheaper than walking the first 1000 nodes of a 20000 node list. - in
> practice this means if you're traversing the whole list your cost is
> n^2, which on the large lists I'm dealing with is causing a bit of
> pain....
>
> This is pretty confusing - it doesn't look like the list is being
> generated lazily (and even if it was I'd not expect linear behaviour
> like this for accessing a prefix) - it almost looks like it's re-
> evaluating the predicate each iteration step! 
>
> Anyone know anything about this? Behaviour doesn't seem to differ
> between msxml4 / 6...
>
> Thanks,
>
> Lee.
>
> -----8<---------- sample VBA to demo, needs ref to msxml4/6.
>
> Private Declare Function QueryPerformanceCounter& Lib
> "kernel32.dll" (l As pbdw)
> Private Declare Function QueryPerformanceFrequency& Lib
> "kernel32.dll" (l As pbdw)
>
> Private Type pbdw
> bottom As Long
> top As Long
> End Type
>
> Public Function tickspers() As Double
> Static l As Long
> If l = 0 Then
> Dim ll As pbdw
> Call QueryPerformanceFrequency(ll)
> l = ll.bottom
> End If
> ticksperms = l
> End Function
>
> Public Function ticks() As Double
> Dim ll As pbdw
> Call QueryPerformanceCounter(ll)
> ticks = ll.bottom
> End Function
>
> Private Sub add_node(ByRef s As String)
> s = s & "<node><a>1</a><b>2</b></node>"
> End Sub
>
> Private Sub check_timing(ByVal s As String)
> s = s & "</doc>"
>
> Dim d As DOMDocument
> Set d = New DOMDocument40 ' or domdocument60,
> If Not d.loadXML(s) Then Call Err.Raise(10101, , "Didn't load")
>
> Dim n As IXMLDOMNode
> Set n = d.documentElement
> Dim dn As IXMLDOMNodeList
>
> Set dn = n.selectNodes("node")
> If dn.Length < 1000 Then Call Err.Raise(10102, , "Odd...")
>
> Dim dstart As Double, dstop As Double
> dstart = ticks()
>
> Dim i As Long
> i = 0
> Dim count As Long
> For Each n In dn
> i = i + 1
> count = count + Len(n.Text)
> If (i = 1000) Then
> GoTo breakout
> End If
> Next n
> breakout:
> dstop = ticks()
> Debug.Print Join(Array(dn.Length, (dstop - dstart) / tickspers,
> count), " , ")
> End Sub
>
> Public Sub test()
> Dim doc As String
> doc = "<doc>"
> Dim l As Long
> For l = 1 To 1000
> Call add_node(doc)
> Next l
>
> Dim step_ As Long
> step_ = 1000
>
> Dim stepstr As String
> Dim x As Long
> For x = 1 To step_
> Call add_node(stepstr)
> Next x
>
> For l = 1000 To 40000 Step step_
> Call check_timing(doc)
> doc = doc & stepstr
> Next l
> End Sub
>
>
The culprit is this line in your check_timing routine:-
count = count + Len(n.Text)
Specifically the use of n.Text. It also isn't related to the number of
nodes selected try this change:-
Set dn = n.selectNodes("node[position() < 1001]")
You'll notice that dn.Length is consitently 1000 yet the growth in time
taken is still proportional to the full set of nodes. IOW, the cost of
retrieving .Text is proportional to the size of document that the node
belongs to. You can try this change to keep it to a single simple element:-
Set dn = n.selectNodes("node[position() < 1001]/a")
It's still bad. Take it a step further lets get the text node itself:-
Set dn = n.selectNodes("node[position() < 1001]/a/text()")
Nope still just as bad.
Now make this change
count = count + Len(n.nodeValue)
Suddenly it much quicker and consistent. Make these changes:-
Set dn = n.selectNodes("node[position() < 1001]/a")
count = count + Len(n.firstChild.nodeValue)
Again consistent, lets go back to selecting the node and build the text
string ourselves.
Set dn = n.selectNodes("node[position() < 1001]")
count = count + Len(n.firstChild.firstChild.nodeValue &
n.firstChild.nextSibling.nodeValue)
Ugly but still a couple of orders of magnitude faster than .Text and
consistant despite document size.
Finally remove the position predicate. Just as fast.
Why would using the Text property be so costly? I haven't got the foggiest,
very odd and annoying.
--
Anthony Jones - MVP ASP/ASP.NET
Similar Threads
-
By Application Development in forum Python
Replies: 0
Last Post: 09-26-2007, 03:53 AM
-
By Application Development in forum Python
Replies: 0
Last Post: 01-12-2007, 02:55 AM
-
By Application Development in forum Theory
Replies: 1
Last Post: 09-17-2006, 12:12 AM
-
By Application Development in forum Compilers
Replies: 1
Last Post: 09-07-2004, 10:50 PM
-
By Application Development in forum Microsoft Exchange
Replies: 0
Last Post: 10-01-2003, 05:20 PM