This page will add documents / documentation links and help for Grid beginners or experts. Those documents are either created by us or gathered from the internet.
238,Purdue-Physics,grid.physics.purdue.edu:2119,compute,OSG,PASS,2006-08-21 19:16:25 237,Rice,osg-gate.rice.edu:2119,compute,OSG,FAIL,2006-08-21 19:17:07 13,SDSS_TAM,tam01.fnal.gov:2119,compute,OSG,PASS,2006-08-21 19:17:10 38,SPRACE,spgrid.if.usp.br:2119,compute,OSG,PASS,2006-08-21 19:17:51 262,STAR-Bham,rhilxs.ph.bham.ac.uk:2119,compute,OSG,PASS,2006-08-21 19:23:12 217,STAR-BNL,stargrid02.rcf.bnl.gov:2119,compute,OSG,PASS,2006-08-21 19:24:11 16,STAR-SAO_PAULO,stars.if.usp.br:2119,compute,OSG,PASS,2006-08-21 19:26:55 44,STAR-WSU,rhic23.physics.wayne.edu:2119,compute,OSG,PASS,2006-08-21 19:29:10 34,TACC,osg-login.lonestar.tacc.utexas.edu:2119,compute,OSG,FAIL,2006-08-21 19:30:23 19,TTU-ANTAEUS,antaeus.hpcc.ttu.edu:2119,compute,OSG,PASS,2006-08-21 19:30:54
#VORS text interface (grid = All, VO = all, res = 217) shortname=STAR-BNL gatekeeper=stargrid02.rcf.bnl.gov gk_port=2119 globus_loc=/opt/OSG-0.4.0/globus host_cert_exp=Feb 24 17:32:06 2007 GMT gk_config_loc=/opt/OSG-0.4.0/globus/etc/globus-gatekeeper.conf gsiftp_port=2811 grid_services= schedulers=jobmanager is of type fork jobmanager-condor is of type condor jobmanager-fork is of type fork jobmanager-mis is of type mis condor_bin_loc=/home/condor/bin mis_bin_loc=/opt/OSG-0.4.0/MIS-CI/bin mds_port=2135 vdt_version=1.3.9c vdt_loc=/opt/OSG-0.4.0 app_loc=/star/data08/OSG/APP data_loc=/star/data08/OSG/DATA tmp_loc=/star/data08/OSG/DATA wntmp_loc=: /tmp app_space=6098.816 GB data_space=6098.816 GB tmp_space=6098.816 GB extra_variables=MountPoints SAMPLE_LOCATION default /SAMPLE-path SAMPLE_SCRATCH devel /SAMPLE-path exec_jm=stargrid02.rcf.bnl.gov/jobmanager-condor util_jm=stargrid02.rcf.bnl.gov/jobmanager sponsor_vo=star policy=http://www.star.bnl.gov/STAR/comp/Grid
QuickStart.pdf is for Globus version 1.1.3 / 1.1.4 .
For GRAM error codes, follow this link.
The purpose of this document is to outline common errors encountered after the installation and setup of the Globus Toolkit.
The gatekeeper is on a non-standard port
Make sure the gatekeeper is being launched by inetd or xinetd. Review the Install Guide if you do not know how to do this. Check to make sure that ordinary TCP/IP connections are possible; can you ssh to the host, or ping it? If you cannot, then you probably can't submit jobs either. Check for typos in the hostname.
Try telnetting to port 2119. If you see a "Unable to load shared library", the gatekeeper was not built statically, and does not have an appropriate LD_LIBRARY_PATH set. If that is the case, either rebuild it statically, or set the environment variable for the gatekeeper. In inetd, use /usr/bin/env to wrap the launch of the gatekeeper, or in xinetd, use the "env=" option.
Check the $GLOBUS_LOCATION/var/globus-gatekeeper.log if it exists. It may tell you that the private key is insecure, so it refuses to start. In that case, fix the permissions of the key to be read only by the owner.
If the gatekeeper is on a non-standard port, be sure to use a contact string of host:port.
Back to top
LD_LIBRARY_PATH is not set.
If you receive this as a client, make sure to read in either $GLOBUS_LOCATION/etc/globus-user-env.sh (if you are using a Bourne-like shell) or $GLOBUS_LOCATION/etc/globus-user-env.csh (if you are using a C-like shell)
Back to top
You are running globus-personal-gatekeeper as root, or did not run grid-proxy-init.
Don't run globus-personal-gatekeeper as root. globus-personal-gatekeeper is designed to allow an ordinary user to establish a gatekeeper using a proxy from their personal certificate. If you are root, you should setup a gatekeeper using inetd or xinetd, and using your host certificates. If you are not root, make sure to run grid-proxy-init before starting the personal gatekeeper.
Back to top
Check the $GLOBUS_LOCATION/var/globus-gatekeeper.log on the remote server. You will probably see something like:
Authenticated globus user: /O=Grid/O=Globus/OU=your.domain/OU=Your Name
Failure: globus_gss_assist_gridmap() failed authorization. rc =1
This indicates that your account is not in the grid-mapfile. Create the grid-mapfile in /etc/grid-security (or wherever the -gridmap flag in $GLOBUS_LOCATION/etc/globus-gatekeeper.conf points to) with an entry pairing your subject name to your user name. Review the Install Guide if you do not know how to do this. If you see "rc = 7", you may have bad permissions on the /etc/grid-security/. It needs to be readable so that users can see the certificates/ subdirectory.
Back to top
This indicates that the remote host has a date set greater than five minutes in the future relative to the remote host.
Try typing "date -u" on both systems at the same time to verify this. (The "-u" specifies that the time should be displayed in universal time, also known as UTC or GMT.)
Ultimately, synchronize the hosts using NTP. Otherwise, unless you are willing to set the client host date back, you will have to wait until your system believes that the remote certificate is valid. Also, be sure to check your shell environment to see if you have any time zone variables set.
Back to top
This indicates that the remote host has an expired certificate.
To double-check, you can use grid-cert-info or grid-proxy-info. Use grid-cert-info on /etc/grid-security/hostcert.pem if you are dealing with a system level gatekeeper. Use grid-proxy-info if you are dealing with a personal gatekeeper.
If the host certificate has expired, use grid-cert-renew to get a renewal. If your proxy has expired, create a new one with grid-proxy-init.
Back to top
Check the $GLOBUS_LOCATION/var/globus-gatekeeper.log on the remote server. You will probably see something like:
Authenticated globus user: /O=Grid/O=Globus/OU=your.domain/OU=Your Name
Failure: globus_gss_assist_gridmap() failed authorization. rc =1
This indicates that your account is not in the grid-mapfile. Create the grid-mapfile in /etc/grid-security (or wherever the -gridmap flag in $GLOBUS_LOCATION/etc/globus-gatekeeper.conf points to) with an entry pairing your subject name to your user name. Review the Install Guide if you do not know how to do this.
Back to top
New installations will often see errors like the above where the expected target subject name has just the unqualified hostname but the target returned subject name has the fully qualified domain name (e.g. expected is "hostname" but returned is "hostname.domain.edu").
This is usually becuase the client looks up the target host's IP address in /etc/hosts and only gets the simple hostname back.
The solution is to edit the /etc/hosts file so that it returns the fully qualified domain name. To do this find the line in /etc/hosts that has the target host listed and make sure it looks like:
xx.xx.xx.xx hostname.domain.edu hostname
Where "xx.xx.xx.xx" should be the numeric IP address of the host and hostname.domain.edu should be replaced with the actual hostname in question. The trick is to make sure the full name (hostname.domain.edu) is listed before the nickname (hostname).
If this only happens with your own host, see the explanation of the failed to open stdout error, specifically about how to set the GLOBUS_HOSTNAME for your host.
Back to top
You do not have a valid proxy.
Run grid-proxy-init
Back to top
Check the $GLOBUS_LOCATION/var/globus-gatekeeper.log on the remote host. It probably says "remote certificate not yet valid". This indicates that the client host has a date set greater than five minutes in the future relative to the remote host.
Try typing "date -u" on both systems at the same time to verify this. (The "-u" specifies that the time should be displayed in universal time, also known as UTC or GMT.)
Ultimately, synchronize the hosts using NTP. Otherwise, unless you are willing to set the client host date back, you will have to wait until the remote server believes that your proxy is valid. Also, be sure to check your shell environment to see if you have any time zone variables set.
Back to top
Or GRAM Job submission failed because the job manager failed to open stderr (error code 74)
It is also possible that the CA that issued your Globus certificate is not trusted by your local host. Running 'grid-proxy-init -verify' should detect this situation.
Install the trusted CA for your certificate on the local system.
You submitted a job which specifies an RSL substitution which the remote jobmanager does not recognize. The most common case is using a 2.0 version of globus-job-get-output with a 1.1.x gatekeeper/jobmanager.
Currently, globus-job-get-output will not work between a 2.0 client and a 1.1.x gatekeeper. Work is in progress to ensure interoperability by the final release. In the meantime, you should be able to modify the globus-job-get-output script to use $(GLOBUS_INSTALL_PATH) instead of $(GLOBUS_LOCATION).
Back to top
The 530 Login incorrect usually indicates that your account is not in the grid-mapfile, or that your shell is not in /etc/shells.
If your account is not in the grid-mapfile, make sure to get it added. If it is in the grid-mapfile, check the syslog on the machine, and you may see the /etc/shells message. If that is the case, make sure that your shell (as listed in finger or chsh) is in the list of approved shells in /etc/shells.
Back to top
This error message usually indicates that the server you are connecting to doesn't trust the Certificate Authority (CA) that issued your Globus certificate.
Or globus_gsi_callback.c:424: globus_i_gsi_callback_cred_verify: Can't get the local trusted CA certificate: Cannot find issuer certificate for local credential (error code 7)
This error message indicates that your local system doesn't trust the certificate authority (CA) that issued the certificate on the resource you are connecting to.
You need to ask the resource administrator which CA issued their certificate and install the CA certificate in the local trusted certificates directory.
Back to top
This error message indicates that the name in the certificate for the remote party is not legal according local signing_policy file for that CA.
Globus replica catalog was installed along with MDS/Information Services.
Do not install the replica bundle into a GLOBUS_LOCATION containing other Information Services. The Replica Catalog is also deprecated - use RLS instead.
Back to top
The FNAL_FERMIGRID site policy and some documentation can be found here:
http://fermigrid.fnal.gov/policy.html
All users with STAR VOMS proxies are mapped to a single user account ("star").
Technical note: (Quoting from an email that Steve Timm sent to Levente) "Fermigrid1.fnal.gov is not a simple jobmanager-condor. It is emulating the jobmanager-condor protocol and then forwarding the jobs on to whichever clusters have got free slots, 4 condor clusters and actually one pbs cluster behind it too." For instance, I noticed jobs submitted to this gatekeeper winding up at the USCMS-FNAL-WC1-CE site in MonAlisa. (What are the other sites?)
You can use SUMS to submit jobs to this site (though this feature is still in beta testing) following this example:
star-submit-beta -p dynopol/FNAL_FERMIGRID jobDesription.xml
where jobDescription.xml is the filename of your job's xml file.
Hostname: fermigrid1.fnal.gov
condor queue is available (fermigrid1.fnal.gov/jobmanager-condor)
If no jobmanager is specified, the job runs on the gatekeeper itself (jobmanager-fork, I’d assume)
[stargrid02] ~/> globus-job-run fermigrid1.fnal.gov
/bin/cat /etc/redhat-release
Scientific Linux Fermi LTS release 4.2 (Wilson)
[stargrid02] ~/> globus-job-run fermigrid1.fnal.gov/jobmanager-condor /bin/cat /etc/redhat-release
Scientific Linux SL release 4.2 (Beryllium)
[stargrid02] ~/> globus-job-run fermigrid1.fnal.gov/jobmanager-condor /usr/bin/gcc -v
Using built-in specs.
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --enable-shared --enable-threads=posix --disable-checking --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-java-awt=gtk --host=i386-redhat-linux
Thread model: posix
gcc version 3.4.4 20050721 (Red Hat 3.4.4-2)
There doesn't seem to be a GNU fortran compiler such as g77 on the worker nodes.
Here is an example to illustrate the difference between grid proxies and voms proxies (note that the WARNING and Error lines at the top don’t seem to preclude the use of the voms proxy – the fact is that I don’t know why those appear or what practical implications there are from the underlying cause – I hope to update this info as I learn more):
[stargrid02] ~/> voms-proxy-info -all[stargrid02] ~/> grid-proxy-info -all
In order to obtain the proxy, the VOMS server for the requested VO must be contacted (with the potential drawback that it introduces a dependency on a working VOMS server that doesn’t exist with a simple grid cert. It is worth further noting that either a VOMS or GUMS server (I should investigate this) will also be contacted by VOMS-aware gatekeepers to authenticate the users at job submission time, behind the scenes. One goal (or consequence at least) of this sort of usage is to eliminate static grid-map-files.)
Something else to note (and investigate): the voms-proxy doesn’t necessarily last as long as the basic grid cert proxy – the voms part can apparently expire independent of the grid-proxy. Consider this example, in which the two expiration times are different:
[stargrid02] ~/> voms-proxy-info -all
(Question: What determines the duration of the voms-proxy extension - the VOMS server or the user/client?)
Technical note 1: on stargrid02, the “vomses” file, which lists the URL for VOMS servers, was not in a default location used by voms-proxy-init, and thus it was not actually working (basically, it worked just like grid-proxy-init). I have put an existing vomses file in /opt/OSG-0.4.1/voms/etc and it seems content to use it.
Technical note 2: neither stargrid03’s VDT installation nor the WNC stack on the rcas nodes has VOMS tools. I’m guessing that the VDT stack is too old on stargrid03 and that voms-proxy tools are missing on the worker nodes because that functionality isn't really needed on a worker node.
LSF job manager code below is from globus 2.4.3.
The steps:
login to stargrid01
Check that your ssh public key is in $home/.ssh/id_rsa.pub, if not put it there.
Select the base image you wish to modify. You will find the name of the image you are currently using for your cluster by looking inside:
/star/u/lbhajdu/ec2/workspace-cloud-client-010/samples/[cluster discretions].xml
Open up this file you will find a structure that looks something like the one below. There are two <workspace> blocks one for the gatekeeper and one for the worker nodes. The name of the image for the worker node is in the second block in-between the <image> tags. So for the example below the name would be osgworker-012.
<workspace>
<name>head-node</name>
<image>osgheadnode-012</image>
<quantity>1</quantity>
.
.
.
</workspace>
<workspace>
<name>compute-nodes</name>
<image>osgworker-012</image>
<quantity>3</quantity>
<nic interface=”eth1”>private</nic>
.
.
.
</workspace>
To make a modification to the image we have to mount/deploy that image. Once we know the name, simply type:
./bin/cloud-client.sh --run --name [image name] --hours 50
Where [image name] is the name we found in step 3. This image will be up for 50 hours. You will have to save the image before you run out of time, else all of your changes will be lost.
The output of this command will look something like:
[stargrid01] ~/ec2/workspace-cloud-client-010/> ./bin/cloud-client.sh --run --name osgworker-012 --hours 50
(Overriding old GLOBUS_LOCATION '/opt/OSG-0.8.0-client/globus')
(New GLOBUS_LOCATION: '/star/u/lbhajdu/ec2/workspace-cloud-client-010/lib/globus')
SSH public keyfile contained tilde:
- '~/.ssh/id_rsa.pub' --> '/star/u/lbhajdu/.ssh/id_rsa.pub'
Launching workspace.
Workspace Factory Service:
https://tp-vm1.ci.uchicago.edu:8445/wsrf/services/WorkspaceFactoryService
Creating workspace "vm-003"... done.
IP address: 128.135.125.29
Hostname: tp-x009.ci.uchicago.edu
Start time: Tue Jan 13 13:59:04 EST 2009
Shutdown time: Thu Jan 15 15:59:04 EST 2009
Termination time: Thu Jan 15 16:09:04 EST 2009
Waiting for updates.
"vm-003" reached target state: Running
Running: 'vm-003'
It will take some time for the command to finish, usually a few minutes. Make sure you do not loose the output for this command. Inside the output there are two pieces of information you must note. They are the hostname and the handle. In this example the hostname is tp-x009.ci.uchicago.edu and the handle is vm-003.
Next log on to the host using the host name from step 4. Note that your ssh public key will be copied to the /root/.ssh/id_rsa.pub. To log on type:
ssh root@[hostname]
Example:
ssh root@tp-x009.ci.uchicago.edu
Next make the change(s) to the image, you wish to make (this step is up to you).
To save the changes you will need the handle from step 2. And you will need to pick a name for the new image. Run this command:
./bin/cloud-client.sh --save --handle [handle name] --newname [new image name]
Where [handle name] is replaced with the name of the handle and [new image name] is replaced with the new image’s name. If you do not use the name option you will overwrite your image. Here is an example with the values from above.
./bin/cloud-client.sh --save --handle vm-003 --newname starworker-sl08f
The output will look something like this:
[stargrid01] ~/ec2/workspace-cloud-client-010/> ./bin/cloud-client.sh --save --handle vm-004 --newname starworker-sl08e
(Overriding old GLOBUS_LOCATION '/opt/OSG-0.8.0-client/globus')
(New GLOBUS_LOCATION: '/star/u/lbhajdu/ec2/workspace-cloud-client-010/lib/globus')
Saving workspace.
- Workspace handle (EPR): '/star/u/lbhajdu/ec2/workspace-cloud-client-010/history/vm-004/vw-epr.xml'
- New name: 'starworker-sl08e'
Waiting for updates.
"Workspace #919": TransportReady, calling destroy for you.
"Workspace #919" was terminated.
This is an optional step, because the images can be several GB big you may want to delete the old image with this command:
./bin/cloud-client.sh --delete --name [old image name]
This is what it would look like:
(Overriding old GLOBUS_LOCATION '/opt/OSG-0.8.0-client/globus')
(New GLOBUS_LOCATION: '/star/u/lbhajdu/ec2/workspace-cloud-client-010/lib/globus')
Deleting: gsiftp://tp-vm1.ci.uchicago.edu:2811//cloud/56441986/starworker-sl08f
Deleted.
To start up a cluster with the new image you will need to modify one of the:
/star/u/lbhajdu/ec2/workspace-cloud-client-010/samples/[cluster discretion].xml
file inside the <workspace> block of the worker node replace <image> with the name of your own image from step 7. You can also set the number of worker node images you wish to bring up by setting the number in the <quantity> tag.
Note: Be careful remember there are usually at least two <workspace> blocks in each xml fie.
Next just bring up the cluster like any other VM cluster. (See my Drupal documentation)
Fig 1. General scheme of allocation of resources and connections between machines
Fig 2. Specific implementation of the propagation of STAR DB snapshots, external DB monitoring is abandoned
Task | In Charge | To Do | Blocks | ERT |
---|---|---|---|---|
Increase /mnt to 100GB | Iwona | Check if possible and reconfigure Eucalyptus | Done | |
Establish # and target to scp data from | Iwona | Test scp from /global/scrtach to VM against possible carver nodes supporting /global/scratch. | Move Eucalyptus to the set of public IPs registered in DNS. | 2011/01/14 |
Integration of transfer of DAQ file using FastOffline workflow | Jerome | Current scheme restore 100 files max every 6 hours. Need transfer from BNL->Cloud+delete files etc ... | None | 2011/02/08 |
These links describe how to do bulk file transfers from RCF to PDSF.
I suggest creating your own subdirectory ~/hrm_g1 similar to ~hjort/hrm_g1. Then copy from my directory to yours the following files:
setup
hrm
pdsfgrid1.rc
hrm_rrs.rc
Catalog.xml (coordinate permissions w/me)
Substitute your username for “hjort” in these files and then start the HRM by doing “source hrm”. Note that you need to run in redhat8 and your .chos file is ignored on grid nodes so you need to chos to redhat8 manually. If successful you should see the following 5 tasks running:
pdsfgrid1 149% ps -u hjort
PID TTY TIME CMD
8395 pts/1 00:00:00 nameserv
8399 pts/1 00:00:00 trm.linux
8411 pts/1 00:00:00 drmServer.linux
8461 pts/1 00:00:00 rrs.linux
8591 pts/1 00:00:00 java
pdsfgrid1 150%
Note that the “hrm” script doesn’t always work depending on the state things are in but it should always work if the 5 tasks shown above are all killed first.
I suggest creating your own subdirectory ~/hrm_grid similar to ~hjort/hrm_grid. Then copy from my directory to yours the following files:
srm.sh
hrm
bnl.rc
drmServer.linux (create the link)
trm.linux (create the link)
Substitute your username for “hjort” in these files and then start the HRM by doing “source hrm”. If successful you should see the following 3 tasks running:
[stargrid03] ~/hrm_grid/> ps -u hjort
PID TTY TIME CMD
13608 pts/1 00:00:00 nameserv
13611 pts/1 00:00:00 trm.linux
13622 pts/1 00:00:01 drmServer.linux
[stargrid03] ~/hrm_grid/>
Scalability Issue Troubleshooting at EC2
Running jobs at EC2 show some scalability issues with grater then 20-50 jobs submitted at once. The pathology can only be seen once the jobs have completed there run cycle, that is to say, after the jobs copy back the files they have produced and the local batch system reports the job as having finished. The symptoms are as follows:
No stdout from the job as defined in the .condorg file as “output=” comes back. No stderror from the job as defined in the .condorg file as “error=” comes back.
It should be noted that the std output/error can be recovered from the gate keeper at EC2 by scp'ing it back. The std output/error resides in:
/home/torqueuser/.globus/job/[gk name]/*/stdout
/home/torqueuser/.globus/job/[gk name]/*/stderr
The command would be:
scp -r root@[gk name]:/home/torqueuser/.globus/job /star/data08/users/lbhajdu/vmtest/io/
Jobs are still reported as running under condor_q on the submitting end long after they have finished, and the batch system on the other end reports them is finished.
Below is a standard sample condor_g file from a job:
[stargrid01] /<1>data08/users/lbhajdu/vmtest/> cat globusscheduler= ec2-75-101-199-159.compute-1.amazonaws.com/jobmanager-pbs
output =/star/data08/users/starreco/prodlog/P08ie/log/C3A7967022377B3E5F2DCCE2C60CB79D_998.log
error =/star/data08/users/starreco/prodlog/P08ie/log/C3A7967022377B3E5F2DCCE2C60CB79D_998.err
log =schedC3A7967022377B3E5F2DCCE2C60CB79D_998.condorg.log
transfer_executable= true
notification =never
universe =globus
stream_output =false
stream_error =false
queue
The job parameters:
Work flow:
Copy in event generator configuration
Run raw event generator
Copy back raw event file (*.fzd)
Run reconstruction on raw events
Copy back reconstructed files(*.root)
Clean Up
Work flow processes : globus-url-copy -> pythia -> globus-url-copy -> root4star -> globus-url-copy
Note: Some low runtime processes not shown
Run time:
23 hours@1000 eventes
1 hour@10-100 events
Output:
15M rcf1504_*_1000evts.fzd
18M rcf1504_*_1000evts.geant.root
400K rcf1504_*_1000evts.hist.root
1.3M rcf1504_*_1000evts.minimc.root
3.7M rcf1504_*_1000evts.MuDst.root
60K rcf1504_*_1000evts.tags.root
14MB stdoutput log, later changed to 5KB by piping output to file and copying back via globus-url-copy.
Paths:
Jobs submitted form:
/star/data08/users/lbhajdu/vmtest/
Output copied back to:
/star/data08/users/lbhajdu/vmtest/data
STD redirect copied back to:
/star/data08/users/starreco/prodlog/P08ie/log
The tests:
We first tested 100nodes. Whit 14MB of text going to stdoutput. Failed with symptoms above.
Next test was with 10nodes. With 14MB of text going to stdoutput. This worked without any problems.
Next test was 20 nodes. With 14MB of text going to stdoutput. This worked without any problems.
Next test was 40 nodes. With 14MB of text going to stdoutput. Failed with symptoms above.
Next we redirected “>” the output of the event generator and the reconstruction to a file and copied this file back directly with globus-url-copy after the job was finished. We tested again with 40 nodes. The std out now is only 15K. This time it worked without any problems. (Was this just coincidence?)
Next we tried with 75 nodes and the redirected output trick. This failed with symptoms above.
Next we tried with 50 nodes. This failed with symptoms above.
We have consulted Alain Roy who has advised an upgrade of globus and condor-g. He says the upgrade of condor-g is most likely to help. Tim has upgraded the image with the latest version of globus and I will be submitting from stargrid05 which has a newer condor-g version. The software versions are listed here:
Stargrid01
Condor/Condor-G 6.8.8
Globus Toolkit, pre web-services, client 4.0.5
Globus Toolkit, web-services, client 4.0.5
Stargrid05
$CondorVersion: 7.0.5 Sep 20 2008 BuildID: 105846
Globus Toolkit, pre web-services, client 4.0.7
Globus Toolkit, pre web-services, server 4.0.7
We have tested on a five node cluster (1 head node, 4 works) and discovered a problem with stargrid05. Jobs do not get transfered over to the submitting side. The RCF has been contacted we know this is on our side. It was decided we should not submit until we can try from stargrid05.
The following is an independently developed grid efficiency framework that will be consolidated with Lidia’s framework.
The point of this work is to be able to add wrappers around the job that will report back key parameters about the job such as the time it started and the time it stopped the type of node it ran on, if it was successful and so on. These commands execute and return back strings in the jobs output stream. These can be parsed by an executable (I call it the job scanner) that extracts the parameters and writes them into a database. Later other programs use this data to produce web pages, and plots out of any parameter we have recorded.
The image attached shows the relation between elements in my database and commands in my CSH. The commands in my CSH script will be integrated into SUMS soon. This will make it possible for any framework to parse out these parameters.
The steps:
1) login to stargrid01
2) Check that your ssh public key is at $home/.ssh/id_rsa.pub. This will be the key the client package copies to the gatekeeper and client nodes under the root account allowing local password free login as root, which you will need to install grid host certs.
a. Note the file name location must be as defined exactly as above or you must modify the path and name in the client configuration at ./workspace-cloud-client-009/conf/cloud.properties (more on this later).
b. If your using a Putty generated ssh public key it will not work directly. You can simply edit it with a text editor to get it in to this format. Below is an example of the right format A and the wrong format B. If it has multiple lines then it is the wrong format.
Right format A:
ssh-rsa AAAAB3NzaC1yc2EAAAABJQAAAIEAySIkeTLsijvh1U01ass8XvfkBGocUePTkuG2F8TwRilq1gIcuTP5jBFSCF0eYXOpfNcgkujIsRj/+xS1QqM7c5Fs0hrRyLzyxgZrCKeXojVUFYfg9QuokqoY2ymgjxAdwNABKXI2IKMvM0UGBtmxphCuxUSUpMzNfmWk9H4HIrE= |
Wrong format B:
---- BEGIN SSH2 PUBLIC KEY ---- Comment: "imported-openssh-key" AAAAB3NzaC1yc2EAAAABJQAAAIEAySIkeTLsijvh1U01ass8XvfkBGocUePTkuG2 F8TwRilq1gIcuTP5jBFSCF0eYXOpfNcgkujIsRj/+xS1QqM7c5Fs0hrRyLzyxgZr CKeXojVUFYfg9QuokqoY2ymgjxAdwNABKXI2IKMvM0UGBtmxphCuxUSUpMzNfmWk 9H4HIrE= ---- END SSH2 PUBLIC KEY ---- |
3) Get the grid client. By copying the folder /star/u/lbhajdu/ec2/workspace-cloud-client-009 to your area. It is recommended you execute your commands from inside the workspace-cloud-client-009. The manual describes all commands and paths relative to this directory, I will do the same.
a. This grid client is almost the same as the one you download from globus except it has the ./samples/star1.xml, which is configured to load STAR’s custom image.
4) cp to the workspace-cloud-client-009 and type:
./bin/grid-proxy-init.sh -hours 100 |
The output should look like this:
[stargrid01] ~/ec2/workspace-cloud-client-009/> ./bin/grid-proxy-init.sh (Overriding old GLOBUS_LOCATION '/opt/OSG-0.8.0-client/globus') (New GLOBUS_LOCATION: '/star/u/lbhajdu/ec2/workspace-cloud-client-009/lib/globus') Your identity: DC=org,DC=doegrids,OU=People,CN=Levente B. Hajdu 105387 Enter GRID pass phrase for this identity: Creating proxy, please wait... Proxy verify OK Your proxy is valid until Fri Aug 01 06:19:48 EDT 2008 |
Normal
0
false
false
false
MicrosoftInternetExplorer4
/* Style Definitions */
table.MsoNormalTable
{mso-style-name:"Table Normal";
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-parent:"";
mso-padding-alt:0in 5.4pt 0in 5.4pt;
mso-para-margin:0in;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:10.0pt;
font-family:"Times New Roman";
mso-ansi-language:#0400;
mso-fareast-language:#0400;
mso-bidi-language:#0400;}
5.) To start the cluster type:
./bin/cloud-client.sh --run --hours 1 --cluster samples/star1.xml |
Two very important things you will want to make a note of from this output are the cluster handle (usually looks something like “cluster-025”) and the gatekeeper name. It will take about 10minutes to lunch this cluster. The cluster will have one gatekeeper and one worker node. The max life time of the cluster is set in the command line arguments, more parameters are in the xml file (you will want to check with Tim before changing these).
If the command hangs up really quickly (about a minute) and says something like “terminating cluster”, this usually means that you do not have a sufficient number of slots to run.It should look something like this:
[stargrid01] ~/ec2/workspace-cloud-client-009/> ./bin/cloud-client.sh --run --hours 1 --cluster samples/star1.xml
|
5) But hold on you can’t submit yet even thought the grid map file has our DNs in it, the gatekeeper is not trusted. We will need to install an OSG host cert on the other side. Not just anybody can do this. Doug and Leve can do this at least (and I am assuming Wayne). Open up another terminal and logon into the newly instantiated gatekeeper as root. Example here:
[lbhajdu@rssh03 ~]$ ssh root@tp-x009.ci.uchicago.edu The authenticity of host 'tp-x009.ci.uchicago.edu (128.135.125.29)' can't be established. RSA key fingerprint is e3:a4:74:87:9e:69:c4:44:93:0c:f1:c8:54:e3:e3:3f. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'tp-x009.ci.uchicago.edu,128.135.125.29' (RSA) to the list of known hosts. Last login: Fri Mar 7 13:08:57 2008 from 99.154.10.107 |
6) Create a .globus directory:
[root@tp-x009 ~]# mkdir .globus |
7) Go back to the stargrid node and copy over your grid cert and key:
[stargrid01] ~/.globus/> scp usercert.pem root@tp-x009.ci.uchicago.edu:/root/.globus [stargrid01] ~/.globus/> scp userkey.pem root@tp-x009.ci.uchicago.edu:/root/.globus |
8) Move over to /etc/grid-security/ on the gate keeper:
cd /etc/grid-security/ |
9) Create a host cert here:
[root@tp-x009 grid-security]# cert-gridadmin -host 'tp-x002.ci.uchicago.edu' -email lbhajdu@bnl.gov -affiliation osg -vo star -prefix tp-x009
|
10) Change right on the credentialed:
[root@tp-x009 grid-security]# chmod 644 tp-x009cert.pem [root@tp-x009 grid-security]# chmod 600 tp-x009key.pem |
11) Delete the old host credentialed:
[root@tp-x009 grid-security]# rm hostcert.pem [root@tp-x009 grid-security]# rm hostkey.pem |
12) Rename the credentials:
[root@tp-x009 grid-security]# mv tp-x009cert.pem hostcert.pem [root@tp-x009 grid-security]# mv tp-x009key.pem hostkey.pem |
13) Check grid functionality back on stargrid01
[stargrid01] ~/admin_cert/> globus-job-run tp-x009.ci.uchicago.edu /bin/date Thu Jul 31 18:23:55 CDT 2008 |
14) Do your grid work
15) When its time for the cluster to go down (if there is unused time remaining) run the below command. Note that you will need the cluster handle from the command used to bring up the cluster.
./bin/cloud-client.sh --terminate --handle cluster-025 |
If there are problems:
If there are problems try this web page:
http://workspace.globus.org/clouds/cloudquickstart.html
If there are still problems try this mailing list:
workspace-user@globus.org
If there are still problems contact Tim Freeman (tfreeman at mcs.anl.gov).
>Thanks for the -dbg+TCP logs! I posted them in a new ticket at http://bugzilla.globus.org/bugzilla/show_bug.cgi?id=5190
>The response from the GridFTP team, posted there, is:
>"""
>What this report shows me is that the client (globus-url-copy) successfully forms a TCP control channel connection with the server. It then successfully reads the 220 Banner message from the server. The client then attempts to authenicate with the server. It sends the AUTH GSSAPI command and posts a read for a response. It is this second read that times out.
>From what i see here the both sides believe the TCP connection is formed successfully, enough so that at least 1 message is sent from the server to the client (220 banner) and possibly 1 from the client to the server (AUTH GSSAPI, since we dont have server logs we cannot confirm the server actually received it).
>I think the next step should be looking at the gssapi authentication logs on the gridftp server side to see what commands were actually received and what replies sent. I think debugging at the TCP layer may be premature and may be introducing some red herrings.
>To get the desired logs sent the env
>export GLOBUS_XIO_GSSAPI_FTP_DEBUG=255,filename
>"""
>So, is it possible to get this set in the env of the server you're
using, trigger the problem, then send the resulting gridftp.log?
I have done that and a sample log file (including log_level ALL) is attached as "gridftp-auth.xio_gssapi_ftp_debug.log" This log file covers a sample test of 11 transfers in which 1 failed.
[Long ago,] Eric Hjort did some testing with 1 second delays between successive
connections and found no failures. In recent limited testing with shorter
delays, it appears that there is a threshhold at about 0.1 sec. With delays longer than 0.1 sec, I've not seen any failures of this sort.
I installed the OSG-0.6.0 client package on presley.star.bnl.gov, which is between the RACF and BNL firewalls. It also experiences failures when connecting to stargrid02 (inside the RACF firewall).
We've made additional tests with different server and client systems and collected additional firewall logs and tcpdumps. For instance, using the g-u-c client on stargrid01.rcf.bnl.gov (inside both the RACF and BNL perimeter firewalls) and a gsiftp server on netmon.usatlas.bnl.gov (outside both firewalls) we see failures that appear to be the same. I have attached firewall logs from both the RACF firewall ("RACF_fw_logs.txt") and the BNL firewall ("BNL_perimeter_fw_logs.txt") for a test with 4 failures out of 50 transfers (using a small 2.5 KB file). Neither log shows anything out of the ordinary, with each expected connection showing up as permitted. Tcpdumps from the client and server are also attached ("stargrid01-client.pcap" and "netmon-server.pcap" respectively). They show a similar behaviour as in the previous dumps from NERSC and stargrid02, in which the failed connections appear to break immediately, with the client's first ACK packet somehow not quite being "understood" by the server.
RACF and ITD networking personnel have looked into this a bit. To
make a long story short, their best guess is "kernel bug,
probably a race condition". This is a highly speculative guess, with
no hard evidence. The fact that the problem has only been noticed when
crossing firewalls at BNL casts doubt on this. For instance, using a
client on a NERSC host connecting to netmon, I've seen no failures, and I need to make this clear to them. Based on tests with other clients (eg. presley.star.bnl.gov) and servers (eg. rftpexp01.rhic.bnl.gov), there is additional evidence that the problem only occurs when crossing firewalls at BNL, but I would like to quantify this, rather than relying on ad hoc testing by hand, with the hope of removing any significant possibility of statistical flukes in the test results so far.
In testing this week, I have focused on eliminating a couple of suspects. First, I replaced gsiftpd with a telnetd on stargrid03.rcf.bnl.gov. The telnetd was setup to run under xinetd using port 2811 -- thus very similar to a stock gsiftp service (and conveniently punched through the various firewalls). Testing this with connections from PDSF quickly turned up the same sort of broken connections as with gsiftp. This seems to exonerate the globus/VDT/OSG software stack, though it doesn't rule out the possiblity of a bug in a shared library that is used by the gsiftp server.
By building xinetd from the latest source (v 2.3.14, released Oct. 24, 2005) and replacing the executable from the stock Red Hat RPM on stargrid02 (with prior testing on stargrid03), the connection problems disappeared. (minor note: I built it with the libwrap and loadavg options compiled in, as I think Red Hat does.)
For the record, here is some version information for the servers used in various testing to date:
stargrid02 and stargrid03 are identical as far as relevant software versions:
Linux stargrid02.rcf.bnl.gov 2.4.21-47.ELsmp #1 SMP Wed Jul 5 20:38:41 EDT 2006 i686 i686 i386 GNU/Linux
Red Hat Enterprise Linux WS release 3 (Taroon Update 8)
xinetd-2.3.12-6.3E.2 (the most recent update from Red Hat for this package for RHEL 3. Confusingly enough, the CHANGELOG for this package indicates it is version 2:2.3.***13***-6.3E.2 (not 2.3.***12***))
Replacing this with xinetd-2.3.14 built from source has apparently fixed the problem on this node.
rftpexp01.rhic.bnl.gov (between the RACF and BNL firewalls):
Linux rftpexp01.rhic.bnl.gov 2.4.21-47.0.1.ELsmp #1 SMP Fri Oct 13 17:56:20 EDT 2006 i686 i686 i386 GNU/Linux
Red Hat Enterprise Linux WS release 3 (Taroon Update 8)
xinetd-2.3.12-6.3E.2
netmon.usatlas.bnl.gov (outside the firewalls at BNL):
Linux netmon.usatlas.bnl.gov 2.6.9-42.0.8.ELsmp #1 SMP Tue Jan 23 13:01:26 EST 2007 i686 i686 i386 GNU/Linux
Red Hat Enterprise Linux WS release 4 (Nahant Update 4)
xinetd-2.3.13-4.4E.1 (the most recent update from Red Hat for this package in RHEL 4.)
If you want to run the GridCat Python client, there is a problem on some nodes at BNL related to BNL's proxy settings. Here are some notes that may help.
First, you'll need to get the gcatc.py Python script itself and put it somewhere that you can access. Here is the URL I used to get it, though apparently others exist:
http://gdsuf.phys.ufl.edu:8080/releases/gridcat/gridcat-client/bin/gcatc.py
(I used wget on the node on which I planned to run it, you may get it any way that works.)
Now, the trick at BNL is to get the proxy set correctly. Even though nodes like stargrid02.rcf.bnl.gov have a default "http_proxy" environment variable, it seems that Python's httplib doesn't parse it correctly and thus it fails. But it is easy enough to override as needed.
For example, here is one way in a bash shell:
[root@stargrid02 root]# http_proxy=192.168.1.4:3128; python gcatc.py --directories STAR-WSU
griddir /data/r23b/grid
appdir /data/r20g/apps
datadir /data/r20g/data
tmpdir /data/r20g/tmp
wntmpdir /tmp
Similarly in a tcsh shell:
[stargrid02] ~/> env http_proxy=192.168.1.4:3128 python /tmp/gcatc.py --gsiftpstatus STAR-BNL
gsiftp_in Pass
gsiftp_out Pass
Doug's email of November 3, 2005 contained a more detailed shell script (that requires gcatc.py) to query lots of information:
http://lists.bnl.gov/mailman/private/stargrid-l/2005-November/002426.html.
You could add the proxy modification into that script, presumably as a local variable.