| Register | FAQ | Calendar | Search | Today's Posts | Mark Forums Read |
|
#1
| |||
| |||
| Does anyone have experience utilizing UIMA in a large processor cluster to handle farming documents out for analysis by different machines/processors? Is there any documentation somewhere on doing this in the most straightforward manner? Thanks in advance for any assistance. Karin -- Karin Verspoor, PhD Research Assistant Professor Center for Computational Pharmacology, University of Colorado Denver PO Box 6511, MS 8303, Aurora, CO 80045 USA karin.verspoor-nW+oXHvFcJb2fBVCVOL8/A@public.gmane.org / tel: (720) 279-4875 / campus: 4-3758 |
|
#2
| |||
| |||
| hi karin haven't tried this with uima but we're doing it in two projects: - a commercial project with http://www.matrixware.com/ who have a terabyte of plain text - a new research project with the WHO (http://www.iarc.fr/) for whom we're doing medline crunching amonst other things happy to talk off-line if this is interesting for you best h On Wed, 30 Jul 2008, Karin Verspoor wrote: > Does anyone have experience utilizing UIMA in a large processor > cluster to handle farming documents out for analysis by different > machines/processors? > > Is there any documentation somewhere on doing this in the most > straightforward manner? > > Thanks in advance for any assistance. > > Karin > > -- > Karin Verspoor, PhD > Research Assistant Professor > Center for Computational Pharmacology, University of Colorado Denver > PO Box 6511, MS 8303, Aurora, CO 80045 USA > karin.verspoor-nW+oXHvFcJb2fBVCVOL8/A@public.gmane.org / tel: (720) 279-4875 / campus: 4-3758 > > > > > > |
|
#3
| |||
| |||
| We're working with the sample UIMA-AS material with the following objective: (1) Define an Aggregate with 4 stages ('A', 'B', 'C', 'D') (2) Remote stage 'B' via use of the RemoteAnalysisEngine element of the deployment descriptor, with name/key of 'B'. We have put timing messages on stdout in stage B, so we can see that it is running on each CAS. We are running the sample AS application with the Aggregate (1) deployed with the -d flag. We are running the remote 'B' in it's own cmd session on the same computer; broker in another cmd session. When we run this setup we see the stdout timing messages for each CAS displayed in BOTH the main AS application's window and in remote 'B's window. This implies/indicates that 'B' is running twice, once in the aggregate and once in the remote. Our interpretation of the RemoteAnalysisEngine element of the deployment descriptor is that it should override the aggregate's in-line definition of 'B' with the definition in the deployment descriptor and send all CAS's to remotes. Are we understanding the architecture correctly? Why are we seeing 'B' run twice on each CAS (embedded and remote)? Thanks, - Charles __________________________________________________ _______________ Get more out of the Web. Learn 10 hidden secrets of Windows Live. http://windowslive.com/connect/post/..._domore_092008 |
|
#4
| |||
| |||
| Hi Charles, Yes, when a delegate is declared to be remote it is not deployed colocated within an aggregate, and the aggregate controller will send requests to the specified queue. From what you say, the remote service 'B' is getting process requests for each CAS. Perhaps one of the other colocated delegates is printing the message to stdout? At any rate, please turn up the UIMA logging level to FINEST for the runRemoteAsyncAE process. This will trace the activities for all CASes in the aggregate. This log is more difficult to follow when multiple CASes are being processed by the aggregate at the same time; to simplify the trace, limit the casPool size for the aggregate to 1. Eddie On Thu, Sep 18, 2008 at 9:27 PM, Charles Proefrock <chas.pro-PkbjNfxxIARBDgjK7y7TUQ@public.gmane.org>wrote: > > We're working with the sample UIMA-AS material with the following > objective: > > (1) Define an Aggregate with 4 stages ('A', 'B', 'C', 'D') > (2) Remote stage 'B' via use of the RemoteAnalysisEngine element of the > deployment descriptor, with name/key of 'B'. > > We have put timing messages on stdout in stage B, so we can see that it is > running on each CAS. > > We are running the sample AS application with the Aggregate (1) deployed > with the -d flag. > We are running the remote 'B' in it's own cmd session on the same computer; > broker in another cmd session. > > When we run this setup we see the stdout timing messages for each CAS > displayed in BOTH the main AS application's window and in remote 'B's > window. This implies/indicates that 'B' is running twice, once in the > aggregate and once in the remote. > > Our interpretation of the RemoteAnalysisEngine element of the deployment > descriptor is that it should override the aggregate's in-line definition of > 'B' with the definition in the deployment descriptor and send all CAS's to > remotes. > > Are we understanding the architecture correctly? > Why are we seeing 'B' run twice on each CAS (embedded and remote)? > > Thanks, > > - Charles > __________________________________________________ _______________ > Get more out of the Web. Learn 10 hidden secrets of Windows Live. > > http://windowslive.com/connect/post/jamiethomson.spaces.live.com-Blog-cns!550F681DAD532637!5295.entry?ocid=TXT_TAGLM_WL_ domore_092008<http://windowslive.com/connect/post/jamiethomson.spaces.live.com-Blog-cns%21550F681DAD532637%215295.entry?ocid=TXT_TAGLM _WL_domore_092008> |
|
#5
| |||
| |||
| We've previously run our CPE with a single CollectionReader and 2 processing units talking to two remote Vinci services. All of the stages are relatively quick except for the remotes, so running multiple threads in the same jvm on the same processor is sufficient for the CPE host and in theory wewould want to increase the processingUnitThreadCount to X over time. As wetransition to the AS architecture, it is clear that the main CPE sequence is declared as an aggregate, and that an instance of the aggregate is deployed (in the examples, via the -d flag of runRemoteAsyncAE). There are also numInstance elements that can be declared in the deploy descriptor, but they are qualified as only working for primitives and/or synchronous AEs ?? Simply ... if we want to have a single CollectionReader that services X pipelines, returning all CASes to the same Listener ... what is the deployment approach? Thanks, Charles __________________________________________________ _______________ See how Windows Mobile brings your life together—at home, work, or onthe go. http://clk.atdmt.com/MRT/go/msnnkwxp...mrt/direct/01/ |
|
#6
| |||
| |||
| The sample UIMA-AS deployment was operating just as our setup specified it should. In our initial haste to implements our own AS configuration, we simply removed the CollectionReader component and added the RoomNumberAnnotator.xml remote call into our Deploy_MeetingFinderAggregateNOCR.xml which, in addition to that, was also calling MeetingDetectorTAE.xml, whichin turn locally called RoomNumberAnnotator (for the second time). Once we were certain our understanding of 'remote' in the deploy descriptorwas correct, it took little time to dig into what was happening. Thanks, Charles > Date: Fri, 19 Sep 2008 09:17:36 -0400> From: eaepstein-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> To: uima-user-d1GL8uUpDdXTxqt0kkDzDmD2FQJk+8+b@public.gmane.org> Subject: Re: Basic UIMA-AS: RemoteAnalysisEngine override in deploy descriptor?> > Hi Charles,> > Yes, when a delegate is declared to be remote it is not deployed colocated> within an aggregate, and the aggregate controller will send requests to the> specified queue. From what you say, the remote service 'B' is getting> process requests for each CAS.> > Perhaps one of the other colocated delegates is printing the message to> stdout? At any rate, please turn up the UIMA logging level to FINEST for the> runRemoteAsyncAE process. This will trace the activities for all CASes in> the aggregate. This log is more difficult to follow when multiple CASes are> being processed by the aggregate at the same time; to simplify the trace,> limit the casPool size for the aggregate to 1..> > Eddie> > On Thu, Sep 18, 2008 at 9:27 PM, Charles Proefrock <chas.pro-PkbjNfxxIARBDgjK7y7TUQ@public.gmane.org>wrote:> > >> > We're working with the sample UIMA-AS material with the following> > objective:> >> > (1) Define an Aggregate with 4 stages ('A', 'B', 'C', 'D')> > (2) Remote stage 'B' via use of the RemoteAnalysisEngine element of the> > deployment descriptor, with name/keyof 'B'.> >> > We have put timing messages on stdout in stage B, so we can see that it is> > running on each CAS.> >> > We are running the sample ASapplication with the Aggregate (1) deployed> > with the -d flag.> > We arerunning the remote 'B' in it's own cmd session on the same computer;> > broker in another cmd session.> >> > When we run this setup we see the stdout timing messages for each CAS> > displayed in BOTH the main AS application's window and in remote 'B's> > window. This implies/indicates that 'B' isrunning twice, once in the> > aggregate and once in the remote.> >> > Our interpretation of the RemoteAnalysisEngine element of the deployment> > descriptor is that it should override the aggregate's in-line definition of>> 'B' with the definition in the deployment descriptor and send all CAS's to> > remotes.> >> > Are we understanding the architecture correctly?> > Why are we seeing 'B' run twice on each CAS (embedded and remote)?> >> > Thanks,> >> > - Charles> > __________________________________________________ _______________> > Get more out of the Web. Learn 10 hidden secrets of Windows Live.> >> > http://windowslive.com/connect/post/jamiethomson.spaces.live.com-Blog-cns!550F681DAD532637!5295.entry?ocid=TXT_TAGLM_WL_ domore_092008<http://windowslive.com/connect/post/jamiethomson.spaces.live.com-Blog-cns%21550F681DAD532637%215295.entry?ocid=TXT_TAGLM _WL_domore_092008> __________________________________________________ _______________ See how Windows connects the people, information, and fun that are partof your life. http://clk.atdmt.com/MRT/go/msnnkwxp...mrt/direct/01/ |
|
#7
| |||
| |||
| Charles, Sounds like you are describing the "figure 4" scenario shown on http://incubator.apache.org/uima/doc-uimaas-what.html As yet there is no distributed life cycle management facility for services in the UIMA AS package, so it is up to you to manage the launch of scaled out service instances. The numInstance parameter for a service can be applied to any "single-threaded" UIMA analysis engine. A primitive UIMA component is considered single-threaded. A UIMA aggregate by default is deployed by UIMA AS single threaded, unless async=true is explicitly requested, or async is implicitly requested by adding UIMA AS properties to one of the delegates (such as being remote, or replicated, or having error handling). For each of the numInstance's, the single-threaded AE will be instantiated in the same process and the initialize method called. Each instance will have a JMS listener connected to the same service request queue. Replicating AE instances in the same process require user code to be thread-safe with respect to it's shared static or other singleton objects. Regards, Eddie On Tue, Sep 23, 2008 at 9:17 PM, Charles Proefrock <chas.pro-PkbjNfxxIARBDgjK7y7TUQ@public.gmane.org>wrote: > We've previously run our CPE with a single CollectionReader and 2 > processing units talking to two remote Vinci services. All of the stages > are relatively quick except for the remotes, so running multiple threads in > the same jvm on the same processor is sufficient for the CPE host and in > theory we would want to increase the processingUnitThreadCount to X over > time. As we transition to the AS architecture, it is clear that the main CPE > sequence is declared as an aggregate, and that an instance of the aggregate > is deployed (in the examples, via the -d flag of runRemoteAsyncAE). There > are also numInstance elements that can be declared in the deploy descriptor, > but they are qualified as only working for primitives and/or synchronous AEs > ?? > > Simply ... if we want to have a single CollectionReader that services X > pipelines, returning all CASes to the same Listener ... what is the > deployment approach? > > Thanks, > > Charles > > > > > __________________________________________________ _______________ > See how Windows Mobile brings your life together—at home, work, or on the > go. > http://clk.atdmt.com/MRT/go/msnnkwxp...mrt/direct/01/ |
|
#8
| |||
| |||
| I've reviewed Fig. 4 and Fig. 3. Our system seems closer to Fig. 3 (asingle Collection Reader (CR) with CasPool size X used to push documents to X services).Assuming the "Service Instance" is an aggregate (AG) with multiple AEsteps A..D, we are extending Fig. 3 with another level of remote AE forone of the steps: Machine0: Broker + RunRemoteAsyncAE + 2 AG Service InstancesMachine1: RemoteStepB_AE InstanceMachine2: RemoteStepB_AE Instance The AG descriptor is configured with A..D in-line, and the AG deploymentdescriptor has a remote 'B' override, possibly with error handlingcontrols, etc. CR --- || --2-- A B --- || --2-- remote 'B' C D (consumer) If I'm following your guidance, we should not use numInstances in the AGdeployment descriptor because we have decided to remote 'B'. Instead weneed to deploy the 2 AG Service Instances via our own launch mechanism(as either multiple -dflags on RunRemoteAsyncAE, or independently intheir own processes). Let me know if I'm on track. - Charles > Date: Wed, 24 Sep 2008 09:28:28 -0400> From: eaepstein-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> To: uima-user-d1GL8uUpDdXTxqt0kkDzDmD2FQJk+8+b@public.gmane.org> Subject: Re: CPE to AS Transition ... Porting processingUnitThreadCount> > Charles,> > Sounds like you are describingthe "figure 4" scenario shown on> http://incubator.apache.org/uima/doc-uimaas-what.html> As yet there is no distributed life cycle management facility for services> in the UIMA AS package, so it is up to you to manage the launch of scaled> out service instances.> > The numInstance parameter for aservice can be applied to any> "single-threaded" UIMA analysis engine. A primitive UIMA component is> considered single-threaded. A UIMA aggregate bydefault is deployed by UIMA> AS single threaded, unless async=true is explicitly requested, or async is> implicitly requested by adding UIMA ASproperties to one of the delegates> (such as being remote, or replicated, or having error handling).> > For each of the numInstance's, the single-threaded AE will be instantiated> in the same process and the initializemethod called. Each instance will> have a JMS listener connected to the same service request queue. Replicating> AE instances in the same process require user code to be thread-safe with> respect to it's shared static or other singleton objects.> > Regards,> Eddie> > On Tue, Sep 23, 2008 at 9:17 PM, Charles Proefrock <chas.pro-PkbjNfxxIARBDgjK7y7TUQ@public.gmane.org>wrote:> > > We've previously run our CPE with a single CollectionReader and 2> > processing units talking to two remote Vinci services. All of the stages> > are relatively quick except for the remotes, so running multiple threads in> > the same jvmon the same processor is sufficient for the CPE host and in> > theory we would want to increase the processingUnitThreadCount to X over> > time. As we transition to the AS architecture, it is clear that the main CPE> > sequence is declared as an aggregate, and that an instance of the aggregate>> is deployed (in the examples, via the -d flag of runRemoteAsyncAE). There> > are also numInstance elements that can be declared in the deploy descriptor,> > but they are qualified as only working for primitives and/or synchronous AEs> > ??> >> > Simply ... if we want to have a single CollectionReader that services X> > pipelines, returning all CASes to the same Listener ... what is the> > deployment approach?> >> > Thanks,> >> > Charles> >> >> >> >> > __________________________________________________ _______________> > See how Windows Mobile brings your life together—at home, work, or on the> > go.> > http://clk.atdmt.com/MRT/go/msnnkwxp...mrt/direct/01/ __________________________________________________ _______________ See how Windows connects the people, information, and fun that are partof your life. http://clk.atdmt.com/MRT/go/msnnkwxp...mrt/direct/01/ |
|
#9
| |||
| |||
| In order to optimize deployment, it is good to focus on where the work is being done and then what overhead is added by the framework. In your case all the work for components CR, A, C and D is expected to be done on Machine 0. Separating the CR from the aggregate adds unnecessary CAS serialization overhead for every document, so it would be better to move the CR into the aggregate. Components A, C or D can be replicated as needed (using numInstances as appropriate for each) in the one aggregate instance. Machines 1..N are then used to scaleout multiple instances of B. RunRemoteAsyncAE could just send an "empty" CAS to kick off the CR in the aggregate, or the CAS could contain information about the collection to be processed by the CR. Note that RunRemoteAsyncAE is a fairly simple application, and it is the UIMA AS async API that optionally deploys colocated services and/or optionally instantiates a CR. My point is that RunRemoteAsyncAE could be replaced with a custom application that (via unspecified mechanisms) deploys B on remote machines, then deploys the aggregate in the same JVM, runs it, and shuts everything down at the end. Eddie On Thu, Sep 25, 2008 at 6:54 AM, Charles Proefrock <chas.pro-PkbjNfxxIARBDgjK7y7TUQ@public.gmane.org>wrote: > I've reviewed Fig. 4 and Fig. 3. Our system seems closer to Fig. 3 > (asingle Collection Reader (CR) with CasPool size X used to push documents > to X services).Assuming the "Service Instance" is an aggregate (AG) with > multiple AEsteps A..D, we are extending Fig. 3 with another level of remote > AE forone of the steps: Machine0: Broker + RunRemoteAsyncAE + 2 AG Service > InstancesMachine1: RemoteStepB_AE InstanceMachine2: RemoteStepB_AE > Instance The AG descriptor is configured with A..D in-line, and the AG > deploymentdescriptor has a remote 'B' override, possibly with error > handlingcontrols, etc. CR --- || --2-- A B --- > || --2-- remote 'B' C > D (consumer) If I'm following your guidance, we should not use > numInstances in the AGdeployment descriptor because we have decided to > remote 'B'. Instead weneed to deploy the 2 AG Service Instances via our own > launch mechanism(as either multiple -d flags on RunRemoteAsyncAE, or > independently intheir own processes). Let me know if I'm on track. > - Charles > > |
|
#10
| |||
| |||
| We've made progress on our transition and now have an app and a configuration that we believe should meet our needs. We have an application derived from the RunRemoteAsyncAE with the CR and 1 deployed Aggregate. The deployed aggregate uses a mixture of numInstances and remoteAsyncEngines to achieve the objectives outlined in my original message. Not too much different than originally described, but a better understanding of the AS approach to scale-up, surpassing what we had achieved with the processingUnitThreadCount in the CPE application. We are experiencing one problem at this point which I didn't expect. Null CAS returned on error conditions. In our CPE implementation we manage a database queue by accessing the queuewith a Collection Reader and then completing the queue update via the StatusCallbackListener::entityProcessComplete() call. Independent of whether the call ends with success or error, the CAS reference is valid, so we can access specific information about the CAS and properly update the database status. Our first attempt at error detection in our AS implementation results in a null CAS passed into the entityProcessComplete() call. This occurs on a remote delegate timeout exception. I did not test other error scenarios to determine if the CAS is valid. Please let me know how to configure the system to return a valid CAS reference on error to the StatusCallbackListener. Thanks, Charles > Date: Thu, 25 Sep 2008 09:27:18 -0400> From: eaepstein-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> To: uima-user-d1GL8uUpDdXTxqt0kkDzDmD2FQJk+8+b@public.gmane.org> Subject: Re: CPE to AS Transition ... Porting processingUnitThreadCount> > In order to optimize deployment, it is good to focus on where the work is> being done and then what overhead is addedby the framework.> > In your case all the work for components CR, A, Cand D is expected to be> done on Machine 0. Separating the CR from the aggregate adds unnecessary CAS> serialization overhead for every document, so it would be better to move the> CR into the aggregate. Components A, C or D can be replicated as needed> (using numInstances as appropriate for each) in the one aggregate instance.> > Machines 1..N are then used to scaleout multiple instances of B.> > RunRemoteAsyncAE could just send an "empty" CAS to kick off the CR in the> aggregate, or the CAS could contain information about the collection to be> processed by the CR.> > Note that RunRemoteAsyncAE is a fairly simple application, and it is the> UIMA AS async API that optionally deploys colocated services and/or> optionally instantiates a CR. My point is that RunRemoteAsyncAE could be> replaced with a custom application that (via unspecified mechanisms) deploys> B on remote machines, then deploys the aggregate in the same JVM, runs it,> and shuts everything down at the end.> > Eddie> > On Thu, Sep 25, 2008 at 6:54 AM,Charles Proefrock <chas.pro-PkbjNfxxIARBDgjK7y7TUQ@public.gmane.org>wrote:> > > I've reviewed Fig. 4 and Fig. 3. Our system seems closer to Fig. 3> > (asingle Collection Reader (CR) with CasPool size X used to push documents> > to X services).Assuming the "Service Instance" is an aggregate (AG) with> > multiple AEsteps A..D, we are extending Fig. 3 with another level of remote> > AE forone of the steps: Machine0: Broker + RunRemoteAsyncAE + 2 AG Service> > InstancesMachine1: RemoteStepB_AE InstanceMachine2: RemoteStepB_AE> > Instance The AG descriptor is configured with A..D in-line, and the AG> > deploymentdescriptor has a remote 'B' override, possibly with error> > handlingcontrols, etc. CR --- || --2-- A B ---> > || --2-- remote 'B' C> > D (consumer) If I'm following your guidance, we should not use> > numInstances in the AGdeployment descriptor because we have decided to> > remote 'B'. Instead weneed to deploy the 2 AG Service Instances via our own> > launch mechanism(as either multiple -d flags on RunRemoteAsyncAE, or> > independently intheir own processes). Let me know if I'm on track.> > - Charles> >> > __________________________________________________ _______________ When your life is on the go—take your life with you. http://clk.atdmt.com/MRT/go/115298558/direct/01/ |
![]() |
| Thread Tools | |
| Display Modes | |
In an effort to better serve ads to our visitors, cookies are used on objectmix.com. For more information, check out our Privacy Policy.