Ongoing activities
Internal projects and sub-systems task lists
Tasks and projects
Computing operation: IO performance measurements
Goals:
- Provide a documented reference of IO performance tests made toward several configuration sin both disk formatting and RAID level space under non-constrained hardware considerations.
- The base line would help making future configuration choices when it comes to hardware provisioning of servers (services) such as database servers, grid gatekeepers, network IO doors, etc...
Steps and tasks:
- Survey community work on the topic of IO performance of drives especially topics concerning
- Effect of disk format on performance
- Effect of parallelism on performance
- Effect of software raid (Linux) performance and responsiveness (load impact on node under stress)
- Software RAID level and performance impacts
- Kernel parameter tweaks impacting IO performance (good examples are efforts of DAQ group, review consequence)
- Prepare a baseline IO test suite for measuring IO performance (read and write) under two mode Possible test suite could follow what was used in the IO performance page . Other tools welcomed based upon survey recommendations.
- single stream IO
- multi stream IO (parallel IO)
- Use a test node and measure IO performance under the diverse reviewed configurations. A few constraints on choice of hardware are needed to avoid biasing the performance results
- The node should have sufficient memory to accommodate for the tests (2 GB of memory or more is assumed to be large sufficient to accommodate for any tests)
- OS must support software RAID
- Disks used for the test should be isolated from system drive to avoid performance degradation
- Node should have more than two drives (including system disk) and ideally, at least 4 (3+1)
- Present result as a function of disk formatting, RAID level and/or number of drives added in both absolute values (values for each configuration) and differentials (gain when moving from one configuration to another).
Status: See results on Disk IO testing, comparative study 2008.
Opened project and activities listing
A summary of ongoing and incoming projects was sent to the software coordinators for feedback. The document refers to projects listed in this section under Projects and proposals.
The list below does NOT include general tasks such as the one described as part of the S&C core team roles as defined in the You do not have access to view this node job descriptions documents . Examples of which would be global tracking with Silicon including HFT, geometry maintenance and updates or otherwise calibration or production tasks as typically carried for the past few years. Neither does this list include improvements we need for areas such as online computing (many infrastructure issues, including networking an area of responsibility which has been unclear at best) nor activities such as the development and enhancement of the Drupal project (requirements and plans sent here).
The list includes:
- Closer look at Calorimetry issues if any (2007 operation workshop feedback follow-up related to calibration being too"TPC centric" and not addressing Physics qualities). Proposed a workshop with goals to:
- gather requirements from the PWG (statements from the operation workshop in 2007 seemed to have taken the EMC coordinators by surprised as per what resolution was needed to achieve Physics goals)
- discuss with experts technical details and implementation, unrolling / deployment and timing
Status: Underway, see report from a review as PSN0465 : EMC Calibrations Workshop report, fall 2008
- Db related: load balancing improvements, monitoring and performance measurements, resource discovery, distributed database
Status: underway.
References: You do not have access to view this node
- Trigger simulations - (some fleshed out on May 2007 as mentioned in this S&C meeting and attached below). The general idea was to provide a framework to allow trigger emulation / simulation offline for studying rejection/selection effects either applying trigger algorithms on real data (minimum bias) or via true simulation or allow re-applying trigger algorithm to triggered sample (higher threshold for example)
Status: nowhere close to where it should be
References: trigger simulation discussions meeting notes and Email communications.
- Embedding framework reshape.
Status: underway (need full eval with SVT and SSD integrated)
- Unified online/offline framework including integration of online reader offline and offline tools online (leveraging knowledge, minimizing work). This task would address comments and concerns that whenever code is developed online (for PPlot purposes for example), it also needs to be developed offline within separate and very different reader approaches. At a higher level, dramatic memory overwrite offline occurred in early 2007 due to the lack of synchronization between structure sizes (information did NOT propagate and was not adjusted offline by the software sub-system coordinator of interest; an entire production had to be re-run).
Status: tasked and underway, first version delivered in 2008, usage of "cons" and regression testing in principle in place (TBC in 2009 run)
- EventDisplay revisited
Status: underway (are we done? need new review follow-up after the pre-review meeting made in 2007)
- VMC - realistic geometry / geometry description
Status: Project on hold due to reconstruction issues, resumed July 2008.
- Forward tracking (radial field issue). May have importance for FGT project upon schedule understanding.
Status: depend on previous item and would be tasked whenever forward tracking need would be better defined.
- Old framework cleanup, table cleanup, drop old formats and historical baggage. In principle a framework tasks, this is bound to introduce instabilities during which assembling a production library would be challenging. This need to be tasked outside major development projects.
Status: only depend on production of Year 7/8 start-up
- Multi-core CPU era - Task force assembled in 2007 (You do not have access to view this node) had an unfortunate conclusion that the work would be too hard hence not necessary. Unfortunately, market development and aggressive company progression toward even more packed CPU and core indicates the future must integrate this new paradigm. First attempts should target the "obvious".
Status: First status and proposal made at ACAT08 (changing chains to accommodate for possible parallelism). Investigated possibility of parallelism at library level and core algorithm (tracking). Talks at ACAT08 very informative.
- Automated QA (project draft available, Kolmogorov etc... discussed and summarized here)
Status: no project drafted yet, only live discussions and Email communications.
- Automated calibration. The main project objective is to move toward a more automated calibration framework whereas migration from one chain to another chain (distortion correction) would be triggered by a criteria (resolution, convergence) rather than a manual change. This work may leverage the FastOffline framework (which was a first attempt to make automated calibration a reality; currently modified by hand and the trigger mechanism is not present / implemented)
Status: Project description available . Summer 08 service task.
- IO schema evolution (reduction of file size by dropping redundant variables but with full transparency to users)
Status: Project started as planned on July 16th with goals drafted on page Projects and proposals. Project deliverables were achieved (tested from a custom ROOT version now in the ROOT main CVS). Future release will include a fully functional schema evolution as specified in our document. Integration will be needed.
Project team: Jerome Lauret (coordination), Valeri Fine (STAR tetsing), Philippe Canal (ROOT team)
- Distributed storage improvement (Efficient dynamic disk population). This project would aim to restore the dynamic disk population of datasets on distributed disk as well as a prioritization mechanism (and possibly bandwidth throttling) so user cannot over-subscribe storage, causing past observed massive delete/restore dropping efficiency.
Status: under-graduate thesis done ; model to improve IO in/out of HPSS is defined and need implementation.
- Efficient multi-site data transfer (coordination of data movement), this project aims to address multi-Tier2 data transfer support and help organize / best utilize the bandwidth out of BNL. A second part of this project aims at data placement on Grid whereas a "task" working on a dataset is to be scheduled with use of existing staged files at sites or possible pre-staging or migration of files from any site to any site (a bit ambitious).
Status: Project started as a computer science PhD program (thesis submitted). Work scheduled over a 3 years period and deliverable would need to be put in perspectives of Grid project deliverables.
- Distributed production and monitoring system, job monitoring, centralized production requests interface
Status: work tasked within the production team.
- FileCatalg improvement. The FileCatalog in STAR was developed from in-house knowledge and support (starting from service work). The catalog now hold 15 Million records (scalability beyond is a concern) and its access possibly inefficient. An initial design diverging from Meta-Data catalog, File Catalog, Replica Catalog has allowed for a quick start and the development of additional infrastructure but has also lead to the replication of the Meta Data information, making hard to maintain consistency of the Catalogs across sites. Federating the Catalogs and using all site's information simultaneously has been marginal to not possible, making a global namespace (replicas) not possible. The lack of this component will directly affect grid realities.
Status: Ongoing (see You do not have access to view this node).
Wish list (for now):
- Online tracking & High Level trigger. This may depend on a trigger simulation framework (it would have benefited from it for sure) or may be an opportunity to revive the issue and shape anew focused (and reduced in scope) project.
Status: How to fit this additional activity is under debate. First discussion held at BNL on 2008/07/10 and followed later by additional meetings. This activity moved to the "upgrade" activity.