- General information
- Data readiness
- Grid and Cloud
- Articles and publications
- Cloud computing
- Data Management
- Documentation
- Getting site information from VORS
- Globus 1.1.x
- Globus Toolkit Error FAQ
- Intro to FermiGrid site for STAR users
- Introduction to voms proxies for grid cert users
- Job Managers
- Modifying Virtual Machine Images and Deploying Them
- Rudiments of grid map files on gatekeepers
- Running Magellan Cloud at NERSC, Run 11
- SRM instructions for bulk file transfer to PDSF
- Scalability Issue Troubleshooting at EC
- Specification for a Grid efficiency framework
- Starting up a Globus Virtual Workspace with STAR’s image.
- Troubleshooting gsiftp at STAR-BNL
- Using the GridCat Python client at BNL
- Grid Infrastructure
- Grid Production
- Monitoring
- MySQL project activities
- Infrastructure
- Machine Learning
- Offline Software
- Production
- S&C internal group meetings
- Test tree
Scalability Issue Troubleshooting at EC
Updated on Mon, 2009-02-02 18:55. Originally created by lbhajdu on 2009-02-02 18:55.
Under:
Scalability Issue Troubleshooting at EC2
Running jobs at EC2 show some scalability issues with grater then 20-50 jobs submitted at once. The pathology can only be seen once the jobs have completed there run cycle, that is to say, after the jobs copy back the files they have produced and the local batch system reports the job as having finished. The symptoms are as follows:
No stdout from the job as defined in the .condorg file as “output=” comes back. No stderror from the job as defined in the .condorg file as “error=” comes back.
It should be noted that the std output/error can be recovered from the gate keeper at EC2 by scp'ing it back. The std output/error resides in:
/home/torqueuser/.globus/job/[gk name]/*/stdout
/home/torqueuser/.globus/job/[gk name]/*/stderr
The command would be:
scp -r root@[gk name]:/home/torqueuser/.globus/job /star/data08/users/lbhajdu/vmtest/io/
Jobs are still reported as running under condor_q on the submitting end long after they have finished, and the batch system on the other end reports them is finished.
Below is a standard sample condor_g file from a job:
[stargrid01] /<1>data08/users/lbhajdu/vmtest/> cat globusscheduler= ec2-75-101-199-159.compute-1.amazonaws.com/jobmanager-pbs
output =/star/data08/users/starreco/prodlog/P08ie/log/C3A7967022377B3E5F2DCCE2C60CB79D_998.log
error =/star/data08/users/starreco/prodlog/P08ie/log/C3A7967022377B3E5F2DCCE2C60CB79D_998.err
log =schedC3A7967022377B3E5F2DCCE2C60CB79D_998.condorg.log
transfer_executable= true
notification =never
universe =globus
stream_output =false
stream_error =false
queue
The job parameters:
Work flow:
Copy in event generator configuration
Run raw event generator
Copy back raw event file (*.fzd)
Run reconstruction on raw events
Copy back reconstructed files(*.root)
Clean Up
Work flow processes : globus-url-copy -> pythia -> globus-url-copy -> root4star -> globus-url-copy
Note: Some low runtime processes not shown
Run time:
23 hours@1000 eventes
1 hour@10-100 events
Output:
15M rcf1504_*_1000evts.fzd
18M rcf1504_*_1000evts.geant.root
400K rcf1504_*_1000evts.hist.root
1.3M rcf1504_*_1000evts.minimc.root
3.7M rcf1504_*_1000evts.MuDst.root
60K rcf1504_*_1000evts.tags.root
14MB stdoutput log, later changed to 5KB by piping output to file and copying back via globus-url-copy.
Paths:
Jobs submitted form:
/star/data08/users/lbhajdu/vmtest/
Output copied back to:
/star/data08/users/lbhajdu/vmtest/data
STD redirect copied back to:
/star/data08/users/starreco/prodlog/P08ie/log
The tests:
We first tested 100nodes. Whit 14MB of text going to stdoutput. Failed with symptoms above.
Next test was with 10nodes. With 14MB of text going to stdoutput. This worked without any problems.
Next test was 20 nodes. With 14MB of text going to stdoutput. This worked without any problems.
Next test was 40 nodes. With 14MB of text going to stdoutput. Failed with symptoms above.
Next we redirected “>” the output of the event generator and the reconstruction to a file and copied this file back directly with globus-url-copy after the job was finished. We tested again with 40 nodes. The std out now is only 15K. This time it worked without any problems. (Was this just coincidence?)
Next we tried with 75 nodes and the redirected output trick. This failed with symptoms above.
Next we tried with 50 nodes. This failed with symptoms above.
We have consulted Alain Roy who has advised an upgrade of globus and condor-g. He says the upgrade of condor-g is most likely to help. Tim has upgraded the image with the latest version of globus and I will be submitting from stargrid05 which has a newer condor-g version. The software versions are listed here:
Stargrid01
Condor/Condor-G 6.8.8
Globus Toolkit, pre web-services, client 4.0.5
Globus Toolkit, web-services, client 4.0.5
Stargrid05
$CondorVersion: 7.0.5 Sep 20 2008 BuildID: 105846
Globus Toolkit, pre web-services, client 4.0.7
Globus Toolkit, pre web-services, server 4.0.7
We have tested on a five node cluster (1 head node, 4 works) and discovered a problem with stargrid05. Jobs do not get transfered over to the submitting side. The RCF has been contacted we know this is on our side. It was decided we should not submit until we can try from stargrid05.
»
- Printer-friendly version
- Login or register to post comments