Simulation and embedding request interface, a review

General introduction

Following the design and delivery of a new simulation requests interface (fully functional at the beginning in 2010), gathering of requirements and implementation, it is now time for a review of the functionalities and operational procedure and bootstrap possible problems. Findings were presented to the principle person involved in the process (Renee Fatemi and Jason Webb as end-coordinator for requests) and rational/explainations brought as well as iterative discussions to attempt to identify the reasons behind and how to solve it. STAR's PAC, Bedanga Mohanty was CC-ed.

Quick jump list

  1. Definitions - states
  2. Monitoring & latency identification
  3. Observations
    1. Requirements bootstrapping & tuning
    2. Operational Issues
  4. Action items - summary

 

Definitions - states

 

and the explanations of the states as intended follows (the term "simulation" below is to be taken in a broad sense):

  1. Opened - Coordinator is considering this request for production. This phase would include discussion with PWG on missing basic info such as library, helper, production, number of events, ... This state is considered as "draft". The request should NOT moved forward if the description remains unclear.
     
  2. Assigned to Deputy - The request's basic parameters are set. A deputy has been assigned and WOULD BE able to run a kumac, a macro, etc ... if provided. There may be minor adjustments. In case of the chain options, the indications and information provided by the form should allow the deputy to understand which detectors are to be run in the simulation and which parameter are needed. In this phase however, the chain may need to be adjusted - validation would be done by deputy. The workflow should not proceed forward until "a" chain is validated.
     
  3. Opened by deputy, producing QA sample - At least ONE job worked. A sample is being produced for QA. We should NOT reach this state if the request is unclear and/or incomplete.
     
  4. QA sample Available - This state should be used to indicate to the PWG that the QA sample is done (comments follow up may indicate where the files are located) and you wait until feedback comes. In case information is missing in the simulation, the state should be reverted to either 2 (the chain need rework) or 3 (the missing option is understood and the chain is tried and re-validated on at least one job).
     
  5. Base QA done, ... - This state should be used whenever the feedback has been received and eventually discussed. If the QA doe snot pass, the request should NOT move to that stage. If there is a confusion on the PWG side (like the 7.7 GeV example above) the request should be moved BACK to state 1.
     
  6. Sample done, ... - The deputy indicates the sample is done and matters are no longer in his/her hands. At this stage, the PWG MUST re-validate and feedback the sample is done and correct (explicit feedback for example on the # of events, samples found, all detectors present).
     
  7. PWG QA done - this state seem redundant as the next step would be to close (proposal would be to remove it)
     
  8. Closed

 

 

Monitoring & identifications of latency in the simulation workflow

First off, please note that since the interface was created, a side project was conceived and launched to monitor where we spend our time in the simulation request handling. This seemed to be a useful information as past years tended to have the STAR collaboration speculate a lot on who is doing or not doing what and when (with divergence of opinion).

Without the past polemic, I view the result as a measure and indicator of where we need to improve to streamline and speed-up the process. The general functionality and goals of the monitoring interface (not yet public) is described in Simulation request visual accounting.

NB: we are not too far in the new interface handling so the result should be taken as a grain of salt (some "states" not marked on time may bias the result).

Under the understanding we have the following states for advancing the simulation requests (embedding or multiple kind of simulation):

And the result:

In other words:

  • 33% spent in discussing and tuning the request (would not be lower than 18% for recent requests hence still high)
  • 12% in moving it along by the simulation team (embedding or simulation deputy) and creating the first sample QA
  • 12% in doing, presenting and digesting the base QA
  • 30% in processing
  • 14% additional in waiting for PWG feedback upon completion

50% of the request is then spent on the PWG side while the other 50% is on the S&C related team side (inclduing processing). There are a few immediate concluding remark:

  • The first phase especially (getting the request understood and tuned) seem very long and should be understood and streamlined
  • The time spent to close a request (getting explicit feedback from the PWGC final QA) is long as well but not dramatic and does not affect operations (the samples are by then ready and done). Explicit actions will be litsed in the next section to help encourage feedback and close the requests faster.
  • We could consider getting the base QA done faster (perhaps with more automation)

More comments and improvements will be listed below for considerations.

 

Observations

The observations pertain to the simulation request interface available here. Depending on your privileges to access this interface, features may or may not appear.

Requirements coverage, possible functional enhancements

In the document 2010/01 new requirements and features for the embed/simu module, we had a second wave of requirements leading to practical implementations (there were  initial functional requirements gathering in Ongoing activities not documented here).

From the list of requirements, we infer most features are implemented and for the most recent features:

  • Human readable requestID - the proposal was to make it of the form YYYYWWnn. This new format fits both embedding and simulation coordinators.
  • Filter, sorting and selectors - the current selectors and filters are functionally sufficient but improvements are needed. Especially:
    • "killed" and "closed" requests are shown by default
      • "Killed" and "Closed" requests should be hidden by default. If selected (all requests option)
      • They should be listed at the very end of a priority sorted listing and currently, assigned a priority 0, would appear at the head of the list. Simple solution for this case would be to assign priority to be the largest priority+1 (internally) and sort.
    • Similarly, sorting by EC priority would show priority 0 which also include the "new" requests
      • A selector should allow including/excluding the "new" requests.
      • A more comprehensive selector system ("and"/"or" and operation such as ">", "<", "=", "!=") would be best but conceptually difficult (especially mixing and and or operations) and hence, not immediate to implement. It would though cover both issues above.
    • There are occasions where several filters lead to no selection at all (and no ways to clear the filter).
      This need a revisit and perhaps, a "clear all filters" button. An example is illustrated below
       
    • Clicking on sorting by PWG ordering would show the priorities sorted as "Low" -> "High" -> "Normal" (relative random then as not alphanumerical) - need to verify and fix (perhaps a miss-feature and intreaction with other sorted fields).
    • Filters by PWG are not yet implemented but needed (should be for convenience)
  • For some simulations, adding a comment or a text beyond changing the state from one level to another is needed. Currently, this is not possible (so we may miss information)
    • We may implement a RT-like approach whereas an impersonated message (something like "Standard-Emb@www.star.bnl.gov" would be sent and could be answered. Upon answering, the reply (stripped embedded message from the second level of indentation) would be appended to the request as comment.
  • Clone feature - this is used in a seldom manner.
    • We may need to create a tutorial and help - this feature is heavily used by the Spin PWG (making their sub-sequent requests that much easier and that much faster)
  • New feature request: auto-scaling of priorities required
    • So the range of priorities considered rescale, the EC would like to request that when all priority 1 requests are closed all other priorities are auto-scaled that is, priorities 2 become 1, 3 become 2 etc ...
      • Implementation: In other words, when all priorities P=v are closed, any priority P>v are rescaled as P-1.
    • Additional explicit Email notice between stages was suggested as a way to clarify the process
      • For example, when the QA is requested from the PWG, an automated Email would state "We will commence the simulation once the PWG has reviewed the QA and the PWG conveners sign off" making it clear we would not proceed without this information
      • An added comment would sent the Email to the PWG and could be of the form "Please indicate which detector need to be part of this simulation" followed by the disclaimer "This request will not proceed without the requested information".  This latest feature improvement would imply that a comment would be asisgned a catgeory such as "information feedback needed" with each category triggering a standard message.
  • Several fields may become mandatory (next sections will analyze the operational problems encountered). Some fields may become mandatory at particular statges in the process and not ealier.
  • Some field would need to be renamed: for example, "EC priority" is not relevant for an interface covering for both simulation and embedding

 

Operational issues

Operational issues related to the way we are actually using and filling the form and related problems. This section is particularly relevant to the PWG EH and PWGC and should lead to improvement factors on both side of STAR (PWG and S&C) for increasing efficiency. Here are several observations followed with comments and actions items.

  • The comment is that we seem to have a plethora of "helpers" and channel of communication and the streamlining through the EH may not be all in place.
    • Observations:
      • In the embedding list alone, we count a dozen of interaction points while only 4 EH are identified
      • EH names - we have discrepancies between the diverse source of information
        • The current list of EH in the management interface is composed of users: pibero, xwq1985 and jhchen
        • Document Organization lists seven helpers (Pibero Djawhoto, Wenqin Xu, Barbara Trzeciak, Jinhui Chen, Geraldo Vasconcelos, Josh Kronzer, Yuri Gorbunov)
        • The list provided by the EC is: Wenqin Xu (to be replaced by Mustafa Mustafa in November), Jinhui, Pibero and Yury.
      • Simulation requests are using POC for communication and not EH
      • Many requests are submitted WITHOUT an EH (it seems to be typical of some PWG in fact - one example from a PWG usually filling the requests with accuracy can be found here). In some cases, there are two EH per PWG and assignment MUST be done as this is the base criteria to achieve a single POC per request (as agreed by the PWG).
    • Requests & suggestions:
      • The standard simulation requests would be encouraged to go through an EH (extending the definition to incorporate a broader "Simulation Helper" or SH). The streamlining of the communication channel (large amount of time spent in communicating with the PWGC) was identified in 2009/10 as one of the most important action item as communication, accountability and responsibilities would be chaotically unclear. Assuming an "SH" is acceptable ...
      • The list of SH would be maintained in the simulation interface by the EC and SC
      • The list of SH would be dropped from the Organization (later improvement may be to add a script extracting and displaying the information automatically)
      • If a request is submitted without an SH, the request would NOT proceed - providing the comment feature mentioned above is implemented, a clear message would be sent to the PWG.
         
  • Field from the request form are NOT filled consistently. This prevents moving forward with full automation and generation of jobs in the back-end.
    • Observation & Example:
      • An example could be found here. In this request we have several problems identified.
      • The PWG are NOT using the form to enter some base parameters such as library or geometry. Instead, a text is being "dumped" withe information available somewhat in the text field (often completed by the EC/SC).
      • Chain parameters are not filled in this request while this request was in an advanced stage (5). This request should NOT have moved forward without the information saved.
      • On this request, neither a deputy nor a helper was identified when reviewed. Why? This should not be happening
      • This request is special as it reached a point where one of the requested samples is believed to be questionable (the use of the vertex selection may not match the experimental cuts hence, direct comparison may not be valid). But the request was not moved back to stage 1 (opened discussion with PWG) nor closed.
    • Requests & suggestions:
      • We MUST assign deputy and helpers or the request shall not proceed beyond stage 1. Fields such as library, magnetic field, number of events should be present. Helper should be present. Failure to have this information would indicate the request cannot (and should not) porceed forward beyond stage 2.
      • We suggest to study/think where and when those field become mandatory and enforce field requirements to avoid imposing a rule which may hinder progress: for example
        • a submission cannot proceedand states cannot be changed with a helper assigned
        • one cannot move to "assign to deputy" without actually have a deputy assigned to the request by the form fields
        • one cannot move to "produce a sample for QA" without a chain option
        • etc ...
      • It is possible that some fields are confusing to the requesters - The addition of a (mouse-over pop-up)  tooltip box is proposed - its purpose would be to pop-up a quick explanation (Jason/Renee also suggested to add instructions on how to "get" this information from the catalog for example).
        Action item: PWGC should be queried on which fields are currently confusing if any before this implementation take place (time consuming implementation require more homework)
         
  • Requests without a PWG assigned ...
    • Observations:
      • Even better, several requests do not even ahve a PWG assigned. Some are tests and some not.
      • Examples include the Pbar 9 Gev Au+Au, Proton 9 GeV Au+Au, K in Au+Au 9 GeV and Pion Au+Au 9 GeV
      • None of those requests are hence advertized to the PWG (all in this case correspond to an "activity group" and may become out of control of the PWG)
    • Procedure change:
      • Similar to the previous suggestion, the PWG field will be made mandatory. Without it, the request will simply fail submission.
      • Current mitigation - EC/SC please, address those requests as applies - do not proceed without the information filled.
      • Future? a suggestion is to make the PWG selection as mandatory checkbox as a request may belong to multiple PWG. However, there is alreay provision for a virtual PWG "common" (embedding of common interrests) for holding base studies.
         
  • States are not always marked accurately.
    • Observations:
      • States and their meaning are made explicit in the previous section.
      • While the initial states/stages are marked, there seem to be a fast roll-through from 3->8 ,skipping quite a few information on the progress and workflow.
    • Request:
      • EC and SC are requested to provide renewed attention to the stages and their meaning as well as the procedural requirements related to the prevention of making a request move forward (missing information especially). Our monitoring (and hence identification of show stoppers and problems in the workflow) will be ever more accurate that we will pay attention to such detail.
    • Other actions:
      • State 7 seem redundant (since when we get the final feedback, there is no need to say "we got feedback" and then close the request) and would be remove.
      • There was a discussion on why moving from "sample available" to a request closure is not happening (the end 14% of a total request length). The PWG do not seem to care of provide feedback when the sample is provided and often do not close the loop on the request. We discussed of a few incentives to help moving this forward:
        • For embedding, the current scheme is that samples are removed from disk after two weeks of feedback request (that is, if no feedback, the files are removed so QA should happen within that time period). If feedback is provided the samples are marked for deletion at the 4 weeks after the request is closed.
        • For simulation, we would NOT Catalog the files until the QA feedback is provided. Space could be reshaped at anytime similarly to embedding (after two weeks, the sample may be considered for deletion, only the HPSS copy would then and only then be indexed so user may retrreive the sample in a user area).
        • Without the final QA and by procedure requirements, the simulation is not veted (hence publication should be held). PAC should agree on this to preserve the coordinator's authorities.
        • If feedback is not provided, the coordinators are free to hold further simulation requests from that PWG (and reshuffle priorities accordingly).
           
  • We have a seldom use of the "clone" feature
    • Observation:
      • Apart from the Spin PWG, getting a good mileage on this feature, it is not used - its intent was to split a base requests into multiple with minimal parameter change between requests.
      • Right now, PWG make a monster request by dumping all into the text field - this should not be happening (making one huge request as one may bias the priority assignement and defavor PWG working with the system).
    • Suggestions:
      • Since large "bundle" requests usually do not specify the parameters of the simulation from the form (again, preventing further automation), it could be considered as invalid in the interim
      • Instruction and tutorials may be needed for clone requests
      • Request relation (requests with relation to sub-requests) may be needed (with a clearer display interface)

 

 

Action items - summary

The actions and suggestions are outlined in the text wherever applies. Action items beyond S&C include:

  • The PWGC should be polled once again on which field are unclear
  • Submitters (via PWGC) should be polled on why they submitted requests in a very incomplete manner - fundamental problems should be indentified and passed for implementation considerations. Otherwise, should be reminded of the need for a complete and clear request submission.
  • I would like acceptance of the broadening of the EH notion to a SH notion and hence, streamline the simulation process all the way