Offline QA Shifts (Y2000 run)

Under:

Peter Jacobs, July 11, 2000

This document is a first try at describing procedures for the Offline QA shift crew. As you will see, there are a number of open questions concerning what should be done during this shift and how to do it, whose answers we will have only after we gain experience with real data. Please give feedback to the STAR QA links and contacts on what you find confusing, what could be done better, and what doesn't make any sense to you.

  1. Scope of the Offline QA shift activities
  2. The proposed scope of the Offline QA shift is to assess the quality of the DSTs being produced by the Offline Production team. There are several classes of data to be examined:
    • Large scale production of real data: data that will be used for physics analysis
    • Large scale production of MC data: MC data that will be used for detailed physics studies and corrections for data analysis.
    • Nightly tests of real and MC data: limited number of events run in the DEV or NEW branches of the library. These are used to test the libraries and validate them prior to a new r elease and migration DEV->NEW->PRO.
    • Express queue of real data: a small fraction (~5%?) of real data will be channeled to an express production queue during the running of the experiment, to serve as feedback to the crews running the experiment. The results of this production should be reported as soon as possible, typically at the 5 p.m. meeting in the counting house.
    The autoQA system can apply arbitrary sets of "tests" to the scalars extracted from the data by the QA macros, raising errors or warnings when these tests are failed. Which tests and what cuts to apply to real data are complex issues that can only be addressed after we gain some experience. Consequently, the automated testing aspects of the autoQA framework will be applied only at a very low level for real data for this summer's run. The decision about the quality of the data will have to be made by the shift crew, i.e. you, by looking at the data in detail.

  3. Use of autoQA
  4. The principal tool for the Offline QA shift crew is the autoQA web page. Discussion of QA in general and detailed usage of that page can be found STAR QA for Offline Software, which you should be familar with before you read the rest of this document. Usage of autoQA version 2 is very similar to the old autoQA (version 1), so if you used that you should be able to understand the following.

    There have however been many changes behind the scenes. The major changes are

    • autoQA now interfaces to the MySQL databases. It queries the Production File Catalog for completed jobs, and writes QA information back to a QA database. The latter can be used in future in the tag DB or some other mechanism, once a reliable QA cycle is established.
    • autoQA can now handle the range of data classes specified in the introduction.
    • All QA ROOT jobs are now run on rcas under LSF. This change was necessary in anticipation of a large volume of QA processes once large scale data taking starts. This of course also introduces another layer of complexity into the QA framework, and monitoring of autoQA jobs on rcas will be part of the QA shift work.
    Some of the more complex displays of tables and documentation now start an auxilliary browser window. If you happen to have started this window once during an autoQA session and minimized it to get it out of the way as you go on to do other things, you may be confused why the browser is not responding when you make certain requests. It is in fact sending the data to the minimized window, which you should re-expand.

  5. Offline QA Shift Tasks
    1. Which runs to examine?
    2. Discuss the recent production with the Production Crew and establish a prioritized list of runs to QA. The express queue mechanism is still under discussion and is not set up yet, but once it is established it should recieve highest priority for timely feedback to the counting house. The other criteria for setting priorities is whether urgent feedback is needed for a library release, or other runs require special attention. Otherwise, the shift crew should look at the most recent production that has been QA-ed under the various classes of data.

      Since the autoQA mechanism queries the File Catalog once an hour (for real data, less frequently for other data classes) and submits QA batch jobs on rcas, there may be a significant delay between when production is run and when the QA results become available. We will have to monitor this process and adjust the procedures as necessary. Feedback on this point from the shift crew is essential.

    3. How to look at a run
    4. I will specify how to look at a run in the data class "Real Data Production". Other data classes will have different selection procedures, reflecting the differences in the File Catalog structure for these different classes, but the changes should be obvious.
      1. Select "Real Data Production" from the pulldown menu in the banner.
      2. Use the pulldown menus to compose a DB query that includes the run you are interested in. The simplest procedure at the moment is to specify the runID and leave all other fields at "any". In the near future these selections will include trigger, calibration and geometry information. Note that the default for "QA status" is "done".
      3. Press "Display Datasets". A listing of all catalogued runs corresponding to you query will appear in the upper frame.
      4. To examine the QA histograms, press the "QA details" button. In the lower panel, a set of links to the histogram files will appear. The format is gzipped postscript. If your browser is set up to launch ghostview for files of type "ps", these files will be automatically unzipped and displayed. Otherwise, you will have to do something more complicated, such as save the file and view it another way. Note that if the macro "bfcread_hist_to_ps" is reported to have crashed, some or all histograms may be missing.
      5. To examine the QA scalars and tests, scroll past the histogram links in the lower panel and push the button. Tables of scalars for all the data branches will appear in the auxilliary window.
      6. To commpare the QA scalars to similar runs, press the "Compare reports" button. Details on how to procede are found in the autoQA documentation. Note that until more refined selections are available for real data (e.g. comparing runs with idenitical trigger conditions and processing chains), this facility will be of limited utility. Note also that the planned functionality of automatically comparing to a standard reference run has not yet been implemented, for similar reasons.

    5. What QA data to examine
    6. This area needs significant discussion. What we are generally looking for is that all data are present and can be read (scalar values should appear in all branches) and that the results look physically meaningful (e.g. vertex distribution histograms). Comparison to previous, similar runs to check for stability is highly desirable but it is not clear how to carry this out at present, for reasons described above. We should revisit this question as we gain more experience.

      The principal QA tool is the histograms, generated by bfcread_hist_to_ps. The number of QA histograms has grown enormously over the past six months and needs to be pruned back to be useful to the non-expert. This work is going on now (week of July 10) and more information will be forthcoming.

      Description of all the macros run by autoQA is found here. This documentation is important for understanding the meaning of the QA scalars.

      Here are some general guidelines on what to report:

      • Status of run - completed, if not give error status (segmentation violation etc)
      • Macros that crashed
      • Macros whose QA status is not "O.K." (At present, this means simply that there is no data in the branch that macro is trying to read. No additional tests are applied to the data.)
      • Anomalous histograms and scalars - this is necessarily vague at this point.
      More specific rules for what should be in the report will be forthcoming. Input on this question is welcome.

    7. How to report results
    8. Once per shift you should send a status report to the QA hypernews forum:
      starqa-hn@www.star.bnl.gov
      If you are doing Offline QA shifts, you should subscribe to this forum.

      The autoQA framework has a "comment" facility that allows the user to annotate particular runs or to enter a "global comment" that will appear chronologically in the listing of all runs. These are displayed together with the datasets, and while not appropriate for lengthy reports, can serve as flags for specific problems and supply hyperlinks to longer reports. Note that this is not a high security system (anyone can alter or delete you messages).

      You do not need the QA Expert's password to use this facility. Press the button "Add or edit comments" in the upper right part of the upper panel. You will be asked for some identifying string that will be attached to your comments. Enter you name and press return. You will have to press "Display Datasets" again, at which point a button "Add global comment" will appear below the pulldown menus, and each run listing will have an "Add comment" button. Follow the instructions. Messages are interpreted as html, so links to other pages can be introduced. One possibility is to enter the hyperlink to the QA report you have sent to starqa-hn. This can obviously be automated, but it isn't yet and doing it by hand should be straightforward.

    9. Checking QA jobs on rcas
    10. Every two hours you should check the status of autoQA jobs running on rcas, by clicking on "RCAS/LSF monitor" (upper right, under the "Add or Edit Comments" button). You cannot alter jobs using this browser unless you have the Expert's password, so there is no possibility of doing damage. Select jobs called QA_TEST. Each of these is a set of QA macros for a single run, that should require up to 10 minutes CPU time. The throughput of this system for QA is as yet unknown, but you should check that jobs are not sitting in the PENDING queue for more than an hour or two, and are not stalling while running (should not take more than 15 minutes CPU). In case of problems, contact an expert.

    Peter Jacobs
    Last modified: Tue Jul 11 02:35:05 EDT 2000