PBS/Globus Data Staging: User Interaction Notes (2020-03-16 6:40am)
Caveat: Scenarios and implementations are for discussion purposes only and do not represent official AIG recommendations
Table of Contents
- Background and Context
- The purpose of this work is to demonstrate interactions of a user wanting to stagein files, run a job, then stageout files.
- While the demo user's primary institutional affiliation is not ALCF, the person does have an ALCF allocation/account. After successfully authenticating with ALCF OTP credentials the user is authorized by the facility to move files across wide-area networks using the ALCF Theta DTN.
- Demo Deployment Details
- Clone the master pbs-vagrant box in https://xgitlab.cels.anl.gov/AIG/pbs-vagrant
- Create a pbs deployment with one service node (pbs-sn1) two compute clients (pbs-cc1 & pbs-cc2)
- export VAGRANT_PBS_CLIENTS=2
- vagrant up
- vagrant provision
- vagrant ssh
- sudo qmgr -c "create node pbs-cc1"
- sudo qmgr -c "create node pbs-cc2"
- Quick hack enabling user vagrant to issue qsub, qrls, qdel from a job running on pbs-cc2
- sudo qmgr -c "set server flatuid=true"
- sudo qmgr -c "set server acl_roots+=vagrant@*"
- sudo qmgr -c "set server operators+=vagrant@*"
- Install Globus Connect Personal (GCP) on the demo laptop
- Install the Globus client CLI on pbs-cc2
virtualenv "$HOME/.globus-cli-virtualenv"
source "$HOME/.globus-cli-virtualenv/bin/activate"
pip install globus-cli
deactivate
export PATH="$PATH:$HOME/.globus-cli-virtualenv/bin"
echo 'export PATH="$PATH:$HOME/.globus-cli-virtualenv/bin"' >> "$HOME/.bashrc"
- Ignorable demo setup scribbles for Lisa's laptop:
- cd /Users/childers/code/pbs/pbs-20191209/pbs-vagrant
- vagrant ssh
- ssh pbs-cc2
- GLOBUS_CLI_INSTALL_DIR="$(python -c 'import site; print(site.USER_BASE)')/bin"
- export PATH="$GLOBUS_CLI_INSTALL_DIR:$PATH"
- globus logout
- Logical System Design
- A user job system based on PBS Pro (v19.1):
- One login node (pbs-sn1 serves as the login node)
- One compute node (pbs-cc1)
- One dedicated node for submitting and monitoring Globus transfer requests (pbs-cc2)
- A public Globus tutorial endpoint is used as a source for some file transfers
- The ALCF Theta DTN is used as a source for some file transfers
- The laptop Globus Connect Personal (GCP) service is used as a destination for all file transfers
- https://app.globus.org/activity?show=history is used to show the Globus view of the user's activity
- Globus authorization details
- Globus transfers involve multiple distinct administrative domains across wide-area networks
- Client authorizations
- The Globus Web App client is authorized after the user successfully logs in to the Globus website (by authenticating with one of the identity providers supported by Globus); resulting OAuth tokens are stored in browser cookies
- For Globus CLI clients, the user's OAuth tokens are stored on the local filesystem (after successfully authenticating with a supported identity provider)>
- Overview of the CLI client OAuth token creation process using the Google OAuth identity provider:
- Using the ALCF OAuth identity provider:
- Endpoint authorizations
- To access a Globus endpoint, users must be pre-authorized by the endpoint administrator (via ACL) to use the endpoint
- Near the time of desired access, authorized users must "activate" the endpoint (Globus terminology) by supplying login information, thereby enabling short-lived access to the file system associated with the endpoint
- Only endpoint creators/administrators have the ability to delete a given Globus endpoint definition; Andrew Cherry administers ALCF Globus endpoints
- A list of the user's endpoints can be viewed in the Globus Web app: https://app.globus.org/endpoints
- The user can activate/deactivate ALCF endpoints using the web app: https://app.globus.org/file-manager/collections/08925f04-569f-11e7-bef8-22000b9a448b
- Endpoint activation can be initiated using the Globus CLI, though a web browser is ultimately required to complete the process:
[vagrant@pbs-cc2 ~]$ globus endpoint activate 08925f04-569f-11e7-bef8-22000b9a448b
The endpoint could not be auto-activated.
This endpoint supports the following activation methods: web, oauth, delegate proxy
For web activation use:
'globus endpoint activate --web 08925f04-569f-11e7-bef8-22000b9a448b'
For oauth activation use web activation:
'globus endpoint activate --web 08925f04-569f-11e7-bef8-22000b9a448b'
For delegate proxy activation use:
'globus endpoint activate --delegate-proxy X.509_PEM_FILE 08925f04-569f-11e7-bef8-22000b9a448b'
Delegate proxy activation requires an additional dependency on cryptography. See the docs for details:
https://docs.globus.org/cli/reference/endpoint_activate/
- Live Demo
- Enable Globus CLI client interactions on the PBS Node running on Lisa's laptop:
The Globus Web app is running on Lisa's laptop browser window
The Globus CLI is running on the dedicated PBS Node running on Lisa's laptop
The Source Endpoint is the ALCF DTN (Globus Data Transfer Node)
The Destination Endpoint is a Globus Connect Personal server running on Lisa's laptop
- On pbs-cc2 (the node used to execute Globus client requests):
- Execute a "globus login" to create user OAuth tokens (needed to submit transfer requests)
- Perform OAuth2 authentication/consent process
- ls -l ~/.globus.cfg to show token storage
- Example globus.cfg file with obscured data:
[vagrant@pbs-cc2 ~]$ cat SampleOAuthTokenFile
[cli]
client_id = exxxx171-6xx6-xxxx-xx15-5xxxxxx627e5
client_secret = xxxxxxxxxxxTzxCxJxx2xqxixKxVxrxBxxxxxxxxMEg=
transfer_refresh_token = Agxxxxxxxxxxxxxxkxxxxxxxxyn9DBx96xxxxxxxxxxxxxxolxxxxxxxxxxxoKerw3e0V4MxxxxxxxxxxxxxxxxxV51gX
transfer_access_token = AgBj40OxxxxxxxxxxxxxxxVxG4KO8NJJxxxxxxxxxxxxxxxxx8hyCaPkMYmgvxxxxxxxxxxxxxxxxxxxxxxxxSqJ6N
transfer_access_token_expires = 1580489674
auth_refresh_token = AgxxxxxxxxxxxxxxxxxxxDaVBgDq4xxxxxxxxxxxxxxxxxxxxxxxxxxxxeyQz4Mqq6PJxxxxxxxxxxxxxxxxxo50XBMy3
auth_access_token = Agbe1QmMxxxxxxxxxxxxxxxxxxxVopg9266BxxxxxxxxxxxxxxH5Clyyb5PNYKX1xxxxxxxxxxxqNoUD8QKvnHxxxxxxxxxxxxx7to9Kw
auth_access_token_expires = 1580489674
- (Note if there were a shared filesystem the user could create tokens from any ALCF machine)
- On pbs-sn1 (the node used to submit jobs):
- Show V1 data staging wrapper script
- Show "afterok" and "afternotok" dependencies on the V1 qsub commands
- Show V1 control branching pic
- Show V2 control branching pic
- Show V2 data staging wrapper script if they want to see it
- Let them choose their own adventure; show demo implementation they want to see
- Show env variables being passed
- Walk through each of the job scripts, showing Globus CLI commands
- Scenario A: Stagein-compute-stageout w/successful stagein job (job and stageout executes after successful stagein)
- Pause laptop endpoint; explain that pause is needed because demo transfers are small and fast
- Execute data staging wrapper script
- Execute qstat
- Show stagein request on https://app.globus.org/activity?show=history
- Unpause laptop endpoint
- Wait for stageout job to appear on https://app.globus.org/activity?show=history
- Show output files; note that the cleanup job did not run
- Scenario B: Stagein-compute-stageout w/failed stagein job (clean up job executes in response to stagein failure)
- Pause laptop endpoint
- Execute data staging wrapper script
- Execute qstat
- Show stagein request on https://app.globus.org/activity?show=history
- Terminate Globus stagein task (induce fatal stagein error)
- Execute qstat
- Show output files; note that the compute and stageout jobs did not run
- Scenario C: Stagein-compute-stageout w/expired ALCF DTN user credential (stagein waits until user provides credentials)
- Unpause laptop endpoint
- Deactivate user's Theta DTN credential (induce ephemeral error)
- Execute data staging wrapper script
- Show stagein request on https://app.globus.org/activity?show=history
- Execute qstat
- Show Globus endpoint activation needed email
- Reactivate user's Theta DTN credential
- Show output files
- Scenario D: Stagein-compute-stageout w/simulated machine outage and job resubmission (stagein workflow reconstructed after queue is nuked)
- Pause laptop endpoint
- Execute data staging wrapper script
- Execute qstat
- Show stagein request on https://app.globus.org/activity?show=history
- Delete all the jobs in the queue (simulate loss of PBS state): qselect -u vagrant | xargs qdel
- Execute qstat
- Show orphaned Globus request on https://app.globus.org/activity?show=history
- Show transfer submission output; copy orphaned task id
- Execute data staging wrapper script, passing orphaned task id
- Unpause laptop endpoint
- Execute qstat
- Show output files
- Scenario E: Stagein-compute-stageout w/out-of-band Globus stagein initiation (existing transfer grafted onto a workflow)
- Unpause laptop endpoint
- Submit out-of-band transfer request on https://app.globus.org/file-manager; get task id
- Execute data staging wrapper script, passing task id from above request
- Execute qstat
- Show output files
- Key features demonstrated
- Allows users to construct custom data staging workflows with success and failure branching
- Built on public/supported/documented apis from PBS and Globus; low ALCF maintenance costs, future proof
- Accepted practice for OAuth2 token management
- Assuming shared fs, user only directly interacts w/ALCF login nodes and OAuth server
- Issues (in no particular order)
- How best to manage staging jobs in the face of planned system outages?
Issue pending; waiting for info
Notes:
- Reminder:
- The V1 implementation involves a single long-lived stagein job, then a compute job, followed by a short-lived stageout job. If the stagein job ends with a nonzero exit code the compute and stageout jobs are deleted and a cleanup job runs.
- The V2 implementation involves a short-lived stagein request job, then a series of short-lived stagein monitoring jobs, followed by a compute job, then a short-lived stageout job. If a monitoring job ends with a nonzero exit code the compute and stageout jobs are deleted and a cleanup job runs.
- What happens to staging workflow jobs during Preventive Maintenance?
- For a definitive recommendation we need to know: 1) how PM reservations will be implemented, and 2) where the data staging client jobs will be routed/executed
- How will PM reservations be implemented in April 2020?
- How will the implementation change, if at all, after 2020?
- If long-lived (2+ weeks) stagein jobs are supported by ALCF then V1 might be a reasonable option for the user. During PM all V1 jobs in the queue could be put on admin hold and resumed once the system was released. Assuming the job ids don't change, PBS qsub dependencies should enable the workflow to proceed as intended.
- If only short-lived stagein jobs are supported by ALCF then V2 might be an option. During PM all V2 jobs in the queue could be put on admin hold and then resumed (with originally-specified user holds in place) once the system was released. Assuming the job ids don't change, the workflow should proceed as intended.
- What happens to staging workflow jobs during ad hoc planned outages?
- No known differences between ad hoc planned outages and PM outages (wrt data staging workflows)
- How best to manage staging jobs in the face of unplanned system outages (understand failure handling generally for PBS jobs)
Issue open
Notes:
- What happens when the PBS postgres db goes down while jobs are running?
[vagrant@pbs-sn1 ~]$ ./dataStagingWorkflow.sh
1041.pbs-sn1
1042.pbs-sn1
1043.pbs-sn1
1044.pbs-sn1
[vagrant@pbs-sn1 ~]$ qstat
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
1041.pbs-sn1 stageWait.sh vagrant 00:00:00 R workq
1042.pbs-sn1 cleanup.sh vagrant 0 H workq
1043.pbs-sn1 computeJob.sh vagrant 0 H workq
1044.pbs-sn1 stageNoWait.sh vagrant 0 H workq
[vagrant@pbs-sn1 ~]$ sudo pkill -u postgres
[vagrant@pbs-sn1 ~]$ qstat
Connection refused
qstat: cannot connect to server pbs-sn1 (errno=111)
- Server log
[vagrant@pbs-sn1 ~]$ sudo tail -15 /var/spool/pbs/server_logs/20200306
03/06/2020 18:47:04;0100;Server@pbs-sn1;Job;1044.pbs-sn1;enqueuing into workq, state 1 hop 1
03/06/2020 18:47:04;0008;Server@pbs-sn1;Job;1044.pbs-sn1;Job Queued at request of vagrant@pbs-sn1, owner = vagrant@pbs-sn1, job name = stageNoWait.sh, queue = workq
03/06/2020 18:47:06;0100;Server@pbs-sn1;Req;;Type 0 request received from vagrant@pbs-sn1, sock=16
03/06/2020 18:47:06;0100;Server@pbs-sn1;Req;;Type 49 request received from vagrant@pbs-sn1, sock=17
03/06/2020 18:47:06;0100;Server@pbs-sn1;Req;;Type 21 request received from vagrant@pbs-sn1, sock=16
03/06/2020 18:47:06;0100;Server@pbs-sn1;Req;;Type 19 request received from vagrant@pbs-sn1, sock=16
03/06/2020 18:47:06;0100;Server@pbs-sn1;Req;;Type 0 request received from vagrant@pbs-sn1, sock=17
03/06/2020 18:47:06;0100;Server@pbs-sn1;Req;;Type 49 request received from vagrant@pbs-sn1, sock=18
03/06/2020 18:47:06;0100;Server@pbs-sn1;Req;;Type 21 request received from vagrant@pbs-sn1, sock=17
03/06/2020 18:47:30;0001;Server@pbs-sn1;Svr;Server@pbs-sn1;job_save, Failed to save job 1041.pbs-sn1 Transaction begin failed: FATAL: terminating connection due to administrator command
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
03/06/2020 18:47:30;0001;Server@pbs-sn1;Svr;Server@pbs-sn1;panic_stop_db, Panic shutdown of Server on database error. Please check PBS_HOME file system for no space condition.
03/06/2020 18:47:30;0002;Server@pbs-sn1;Svr;Log;Log closed
- Postgres log
[vagrant@pbs-sn1 ~]$ sudo tail -7 /var/spool/pbs/datastore/pg_log/postgresql-Fri.log
2020-03-06 14:08:01 UTCLOG: database system is ready to accept connections
2020-03-06 14:08:01 UTCLOG: autovacuum launcher started
2020-03-06 18:47:28 UTCLOG: autovacuum launcher shutting down
2020-03-06 18:47:28 UTCFATAL: terminating connection due to administrator command
2020-03-06 18:47:28 UTCLOG: received smart shutdown request
2020-03-06 18:47:28 UTCLOG: shutting down
2020-03-06 18:47:28 UTCLOG: database system is shut down
- Accounting log (note the S record for 1041.pbs-sn1)
[vagrant@pbs-sn1 ~]$ sudo cat /var/spool/pbs/server_priv/accounting/20200306
03/06/2020 15:00:00;L;license;floating license hour:0 day:0 month:0 max:0
03/06/2020 16:00:00;L;license;floating license hour:0 day:0 month:0 max:0
03/06/2020 17:00:00;L;license;floating license hour:0 day:0 month:0 max:0
03/06/2020 18:00:00;L;license;floating license hour:0 day:0 month:0 max:0
03/06/2020 18:47:04;Q;1041.pbs-sn1;queue=workq
03/06/2020 18:47:04;Q;1042.pbs-sn1;queue=workq
03/06/2020 18:47:04;S;1041.pbs-sn1;user=vagrant group=vagrant project=_pbs_project_default jobname=stageWait.sh queue=workq ctime=1583520424 qtime=1583520424 etime=1583520424 start=1583520424 exec_host=pbs-cc2/0 exec_vnode=(pbs-cc2:ncpus=1) Resource_List.ncpus=1 Resource_List.nodect=1 Resource_List.nodes=pbs-cc2 Resource_List.place=scatter Resource_List.select=1:ncpus=1:host=pbs-cc2 resource_assigned.ncpus=1
03/06/2020 18:47:04;Q;1043.pbs-sn1;queue=workq
03/06/2020 18:47:04;Q;1044.pbs-sn1;queue=workq
The next day, after restarting the system and submitting a new workflow job, the old jobs were in the queue in addition to the new jobs. The new and old jobs were successfully executed, and corresponding stderr and stdout files were created. Accounting records show start and end records for both the new and old jobs.
[vagrant@pbs-sn1 ~]$ ./dataStagingWorkflow.sh
2041.pbs-sn1
2042.pbs-sn1
2043.pbs-sn1
2044.pbs-sn1
[vagrant@pbs-sn1 ~]$ qstat
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
1043.pbs-sn1 computeJob.sh vagrant 00:00:00 R workq
1044.pbs-sn1 stageNoWait.sh vagrant 0 H workq
2041.pbs-sn1 stageWait.sh vagrant 00:00:00 R workq
2042.pbs-sn1 cleanup.sh vagrant 0 H workq
2043.pbs-sn1 computeJob.sh vagrant 0 H workq
2044.pbs-sn1 stageNoWait.sh vagrant 0 H workq
[vagrant@pbs-sn1 ~]$ ls -lt
-rw-------. 1 vagrant vagrant 267 Mar 7 13:49 stageNoWait.sh.o2044
-rw-------. 1 vagrant vagrant 0 Mar 7 13:49 stageNoWait.sh.e2044
-rw-------. 1 vagrant vagrant 132 Mar 7 13:48 computeJob.sh.o2043
-rw-------. 1 vagrant vagrant 0 Mar 7 13:48 computeJob.sh.e2043
-rw-------. 1 vagrant vagrant 258 Mar 7 13:48 stageWait.sh.o2041
-rw-------. 1 vagrant vagrant 6 Mar 7 13:48 stageWait.sh.e2041
-rw-------. 1 vagrant vagrant 267 Mar 7 13:48 stageNoWait.sh.o1044
-rw-------. 1 vagrant vagrant 0 Mar 7 13:48 stageNoWait.sh.e1044
-rw-------. 1 vagrant vagrant 132 Mar 7 13:48 computeJob.sh.o1043
-rw-------. 1 vagrant vagrant 0 Mar 7 13:47 computeJob.sh.e1043
-rw-------. 1 vagrant vagrant 377 Mar 6 19:36 stageWait.sh.e1041
-rw-------. 1 vagrant vagrant 258 Mar 6 19:36 stageWait.sh.o1041
[vagrant@pbs-sn1 ~]$ sudo cat /var/spool/pbs/server_priv/accounting/20200307
03/07/2020 13:45:34;A;1042.pbs-sn1;Job deleted as result of dependency on job 1041.pbs-sn1
03/07/2020 13:45:34;E;1041.pbs-sn1;user=vagrant group=vagrant project=_pbs_project_default jobname=stageWait.sh queue=workq ctime=1583520424 qtime=1583520424 etime=1583520424 start=1583520424 exec_host=pbs-cc2/0 exec_vnode=(pbs-cc2:ncpus=1) Resource_List.ncpus=1 Resource_List.nodect=1 Resource_List.nodes=pbs-cc2 Resource_List.place=scatter Resource_List.select=1:ncpus=1:host=pbs-cc2 session=6337 end=1583588734 Exit_status=0 resources_used.cpupercent=0 resources_used.cput=00:00:00 resources_used.mem=33676kb resources_used.ncpus=1 resources_used.vmem=178868kb resources_used.walltime=00:00:10 run_count=1
03/07/2020 13:47:57;Q;2041.pbs-sn1;queue=workq
03/07/2020 13:47:57;Q;2042.pbs-sn1;queue=workq
03/07/2020 13:47:57;S;1043.pbs-sn1;user=vagrant group=vagrant project=_pbs_project_default jobname=computeJob.sh queue=workq ctime=1583520424 qtime=1583520424 etime=1583588734 start=1583588877 exec_host=pbs-cc1/0 exec_vnode=(pbs-cc1:ncpus=1) Resource_List.ncpus=1 Resource_List.nodect=1 Resource_List.nodes=pbs-cc1 Resource_List.place=scatter Resource_List.select=1:ncpus=1:host=pbs-cc1 resource_assigned.ncpus=1
03/07/2020 13:47:57;S;2041.pbs-sn1;user=vagrant group=vagrant project=_pbs_project_default jobname=stageWait.sh queue=workq ctime=1583588877 qtime=1583588877 etime=1583588877 start=1583588877 exec_host=pbs-cc2/0 exec_vnode=(pbs-cc2:ncpus=1) Resource_List.ncpus=1 Resource_List.nodect=1 Resource_List.nodes=pbs-cc2 Resource_List.place=scatter Resource_List.select=1:ncpus=1:host=pbs-cc2 resource_assigned.ncpus=1
03/07/2020 13:47:57;Q;2043.pbs-sn1;queue=workq
03/07/2020 13:47:57;Q;2044.pbs-sn1;queue=workq
03/07/2020 13:48:07;E;1043.pbs-sn1;user=vagrant group=vagrant project=_pbs_project_default jobname=computeJob.sh queue=workq ctime=1583520424 qtime=1583520424 etime=1583588734 start=1583588877 exec_host=pbs-cc1/0 exec_vnode=(pbs-cc1:ncpus=1) Resource_List.ncpus=1 Resource_List.nodect=1 Resource_List.nodes=pbs-cc1 Resource_List.place=scatter Resource_List.select=1:ncpus=1:host=pbs-cc1 session=5946 end=1583588887 Exit_status=0 resources_used.cpupercent=0 resources_used.cput=00:00:00 resources_used.mem=3456kb resources_used.ncpus=1 resources_used.vmem=35892kb resources_used.walltime=00:00:10 run_count=1
03/07/2020 13:48:07;S;1044.pbs-sn1;user=vagrant group=vagrant project=_pbs_project_default jobname=stageNoWait.sh queue=workq ctime=1583520424 qtime=1583520424 etime=1583588887 start=1583588887 exec_host=pbs-cc2/1 exec_vnode=(pbs-cc2:ncpus=1) Resource_List.ncpus=1 Resource_List.nodect=1 Resource_List.nodes=pbs-cc2 Resource_List.place=scatter Resource_List.select=1:ncpus=1:host=pbs-cc2 resource_assigned.ncpus=1
03/07/2020 13:48:10;E;1044.pbs-sn1;user=vagrant group=vagrant project=_pbs_project_default jobname=stageNoWait.sh queue=workq ctime=1583520424 qtime=1583520424 etime=1583588887 start=1583588887 exec_host=pbs-cc2/1 exec_vnode=(pbs-cc2:ncpus=1) Resource_List.ncpus=1 Resource_List.nodect=1 Resource_List.nodes=pbs-cc2 Resource_List.place=scatter Resource_List.select=1:ncpus=1:host=pbs-cc2 session=6019 end=1583588890 Exit_status=0 resources_used.cpupercent=0 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.ncpus=1 resources_used.vmem=0kb resources_used.walltime=00:00:01 run_count=1
03/07/2020 13:48:49;A;2042.pbs-sn1;Job deleted as result of dependency on job 2041.pbs-sn1
03/07/2020 13:48:49;E;2041.pbs-sn1;user=vagrant group=vagrant project=_pbs_project_default jobname=stageWait.sh queue=workq ctime=1583588877 qtime=1583588877 etime=1583588877 start=1583588877 exec_host=pbs-cc2/0 exec_vnode=(pbs-cc2:ncpus=1) Resource_List.ncpus=1 Resource_List.nodect=1 Resource_List.nodes=pbs-cc2 Resource_List.place=scatter Resource_List.select=1:ncpus=1:host=pbs-cc2 session=5975 end=1583588929 Exit_status=0 resources_used.cpupercent=7 resources_used.cput=00:00:01 resources_used.mem=33728kb resources_used.ncpus=1 resources_used.vmem=178868kb resources_used.walltime=00:00:51 run_count=1
03/07/2020 13:48:49;S;2043.pbs-sn1;user=vagrant group=vagrant project=_pbs_project_default jobname=computeJob.sh queue=workq ctime=1583588877 qtime=1583588877 etime=1583588929 start=1583588929 exec_host=pbs-cc1/0 exec_vnode=(pbs-cc1:ncpus=1) Resource_List.ncpus=1 Resource_List.nodect=1 Resource_List.nodes=pbs-cc1 Resource_List.place=scatter Resource_List.select=1:ncpus=1:host=pbs-cc1 resource_assigned.ncpus=1
03/07/2020 13:49:01;E;2043.pbs-sn1;user=vagrant group=vagrant project=_pbs_project_default jobname=computeJob.sh queue=workq ctime=1583588877 qtime=1583588877 etime=1583588929 start=1583588929 exec_host=pbs-cc1/0 exec_vnode=(pbs-cc1:ncpus=1) Resource_List.ncpus=1 Resource_List.nodect=1 Resource_List.nodes=pbs-cc1 Resource_List.place=scatter Resource_List.select=1:ncpus=1:host=pbs-cc1 session=5996 end=1583588941 Exit_status=0 resources_used.cpupercent=0 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.ncpus=1 resources_used.vmem=0kb resources_used.walltime=00:00:10 run_count=1
03/07/2020 13:49:01;S;2044.pbs-sn1;user=vagrant group=vagrant project=_pbs_project_default jobname=stageNoWait.sh queue=workq ctime=1583588877 qtime=1583588877 etime=1583588941 start=1583588941 exec_host=pbs-cc2/0 exec_vnode=(pbs-cc2:ncpus=1) Resource_List.ncpus=1 Resource_List.nodect=1 Resource_List.nodes=pbs-cc2 Resource_List.place=scatter Resource_List.select=1:ncpus=1:host=pbs-cc2 resource_assigned.ncpus=1
03/07/2020 13:49:03;E;2044.pbs-sn1;user=vagrant group=vagrant project=_pbs_project_default jobname=stageNoWait.sh queue=workq ctime=1583588877 qtime=1583588877 etime=1583588941 start=1583588941 exec_host=pbs-cc2/0 exec_vnode=(pbs-cc2:ncpus=1) Resource_List.ncpus=1 Resource_List.nodect=1 Resource_List.nodes=pbs-cc2 Resource_List.place=scatter Resource_List.select=1:ncpus=1:host=pbs-cc2 session=6085 end=1583588943 Exit_status=0 resources_used.cpupercent=0 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.ncpus=1 resources_used.vmem=0kb resources_used.walltime=00:00:01 run_count=1
- What happens when the PBS postgres db goes down prior to qsub?
[vagrant@pbs-sn1 ~]$ sudo pkill -u postgres
[vagrant@pbs-sn1 ~]$ ./dataStagingWorkflow.sh
qsub: Failed to save job/resv, refer server logs for details
...
[vagrant@pbs-sn1 ~]$ sudo cat /var/spool/pbs/server_logs/20200305
...
03/05/2020 21:20:09;0100;Server@pbs-sn1;Job;48.pbs-sn1;enqueuing into workq, state 1 hop 1
03/05/2020 21:20:09;0001;Server@pbs-sn1;Svr;Server@pbs-sn1;job_save, Failed to save job 48.pbs-sn1 Transaction begin failed: no connection to the server
03/05/2020 21:20:09;0100;Server@pbs-sn1;Job;48.pbs-sn1;dequeuing from workq, state 1
03/05/2020 21:20:09;0001;Server@pbs-sn1;Job;48.pbs-sn1;job_purge, Removal of job from datastore failed
03/05/2020 21:20:09;0080;Server@pbs-sn1;Req;req_reject;Reject reply code=15161, aux=0, type=5, from vagrant@pbs-sn1
03/05/2020 21:20:09;0040;Server@pbs-sn1;Svr;pbs-sn1;Scheduler sent command 1
03/05/2020 21:20:09;0040;Server@pbs-sn1;Svr;pbs-sn1;Scheduler sent command 0
...
- What happens when the PBS server goes down?
- What happens when a MOM goes down?
- What happens if the node running a data workflow script crashes?
- Ignorable test setup scribbles for Lisa's laptop
- (Failure test deployment dir: /Users/childers/code/pbs/pbs-20200220/pbs-vagrant)
- (ssh pbs-cc1 for xfer client machine on failure deployment only)
- PBS db files can be found in /var/spool/pbs/datastore on pbs-sn1
- Reset the pbs postgres datastore password:
sudo ./pbs_ds_password
- What should the mechanism be, if any, to track the state of data staging workflow jobs?
Issue open
Notes:
- PBS Pro has a built-in state tracking mechanism using an internal postgres db. What data are stored in the internal PBS db?
- Important state for data staging workflows might include:
- PBS job ids
- qsub job dependencies
- qsub job environment variable settings
- Job return codes
- Globus task ids
- Workflow dependencies/structure
- Job user hold setting
- Other data?
- What is the current mechanism for tracking job state on Theta? What state is tracked?
- What is the impact of user job crashes on data staging workflows?
Issue open
Notes:
- In general we implement a "let it crash" design for the services under ALCF's control (including the PBS server, scheduler, moms, etc.) However, jobs will never be reliable because users create them.
- Might we create an optional "let it crash" monitoring service for users' PBS jobs?
- Such a service might allow users to specify how/what to monitor
- Investigate the trigger-action paradigm (e.g., IFTTT, Tasker, Google Rules)
- Simple rule: 1 trigger and 1 action
- More complex rules: 1+ triggers and 1+ actions
- "state" trigger example: PBS job substate==54 (job being aborted by PBS server)
- "sensor" trigger example: "disk quota exceeded"
- Action example: send email to user
- What triggers/actions might be of interest to ALCF users?
- PBS qsub currently has a limited trigger mechanism bound to an email action:
- Send mail when job is aborted: qsub -m a
- Send mail when job begins execution: qsub -m b
- Send mail when job terminates: qsub -m e
- PBS also has a server event trigger mechanism bound to a "hook" plugin system that launches (admin-privileged) scripts
- Side note: ALCF (via Eric and Brian) is contributing a new server management event to PBS. This contribution enables site-specific actions to be defined for server management operations (e.g., adding a node, importing a hook script, etc.)
- Any post-crash resubmission logic must not re-execute staging/compute jobs that have successfully completed
- Can resubmission be made idempotent by caching PBS job IDs and Globus task IDs?
- Workflows more complex than V1 and V2 should be considered
- Additional use scenarios that would be good to add to the demo
Issue open
Notes:
- A two week preventative maintenance outage scenario
- A "walltime exceeded" scenario
- others?
- What about walltime? Can the staging jobs run "forever"? Will the scheduler start jobs if the walltime crosses preventative maintenance reservation boundaries? Find out more about the PBS walltime implementation.
Issue open
Notes:
- Figure out how to specify resources for staging jobs. Perhaps using a resource group specification (i.e., "staging_resources")? Useful for preventative maintenance purposes for all staging jobs to be routed to the same queue? Data staging jobs need to be excluded from sbank accounting
Issue open
Notes:
- In addition to stdio/stderr, would be good to show the PBS logs as part of the demo. Check out /var/spool/pbs/ (on both the service and client nodes) to find interesting data. (Also think about: is it worth it to recreate the cobalt logs? If we are the only ones who care about cobalt additions then maybe we don't need to migrate all cobalt log additions to PBS?)
Issue open
Notes:
- How to best collate stderr&stdout for the whole workflow?
Issue open
Notes:
- Notes
- While executing a 'globus login' CLI command, a Globus consent page appears during the OAuth2 process. Will this always happen or is this an artifact of the demo deployment? Can consents be made sticky on the Globus side so users will only need to give consent once?
Issue closed: Yes Globus consents are saved until explicitly rescinded by the user
Notes:
- When the user explicitly executes a 'globus logout' consents are rescinded and the user will be asked to provide a new consent at the subsequent interaction. Thus the consent at the beginning is an artifact of the demo script; in practice users will only be asked once.
- Are there any human-friendly names for Globus endpoint UUIDs?
Issue closed
Notes:
- Globus no longer supports human-friendly aliases as endpoint ids; UUIDs are required.
- Perhaps we might define ALCF environment variables for our Globus endpoints?
- Todos
- Investigate open issues as time permits
- Talk to Balsam folks about data staging workflows
- Balsam: Near Real-time Experimental Data Analysis on Supercomputers
- Are Globus transfers represented as tasks in the Balsam service?
- How does balsam know when the Globus transfer finishes?
- How does balsam behave when a Globus request submission fails? When the transfer itself fails?
- Does balsam support long-lived transfers (2+ weeks)?
- What happens to balsam tasks if the balsam server node goes down for a planned/unplanned outage?
- Are balsam tasks idempotent or does the user need to guard against accidental resubmission of compute jobs?
- Are there plans to produce a Balsam launcher for PBS Pro?
- Show demo to operations folks
- Talk to Andrew
- Where should staging jobs run (a.k.a. transfer clients.) On the login nodes? On the DTNs? On a dedicated non-compute staging node? This is infrastructure's call.
- Talk to Brian about JLSE
- Would be good for ops folks to be able to play with a ready PBS deployment
- Show to others beyond ops (catalysts, ?)
- When we get this all figured out maybe make a video for users
- V2 Implementation (serial short-lived PBS jobs to monitor Globus stagein process)
- V1 Implementation (single long-lived PBS job to monitor Globus stagein process)
Used in Both V1 & V2
[vagrant@pbs-sn1 ~]$ cat computeJob.sh
#!/bin/bash
date
echo "Hello from computeJob running on host:"
hostname
echo "Sleeping 10 seconds... zzz"
sleep 10
date
exit 0
[vagrant@pbs-sn1 ~]$ cat stageNoWait.sh
#!/bin/bash
date
echo "Hello from stageNoWait running on host:"
hostname
echo "Submitting 'globus transfer' request..."
tid=$(~/.local/bin/globus transfer --jq "task_id" --format UNIX --skip-activation-check \
--label="$TRANSFER_LABEL" $SRC_DIR $DEST_DIR --recursive)
if [ -z "$tid" ]; then
echo "Transfer request submission failed!"
exit 1
fi
echo "Transfer request $tid submitted. To monitor progress, use the Globus service. Bye bye!"
date
exit 0
[vagrant@pbs-sn1 ~]$ cat cleanup.sh
#!/bin/bash
date
echo "Hello from cleanup running on host:"
hostname
echo "No data for you!"
date
exit 0
Used in V2 Only
vagrant@pbs-sn1 ~]$ cat v2DataStagingWorkflow.sh
#!/bin/bash
# Set up convenience vars for oft-used Globus endpoint UUIDs
GlobusTutorialEndpoint=ddb59aef-6d04-11e5-ba46-22000b92c6ec
LaptopEndpoint=61d8a676-447d-11ea-ab4d-0a7959ea6081
ThetaDTN=08925f04-569f-11e7-bef8-22000b9a448b
# Set up convenience vars for source/dest directories at each of the Globus endpoints
GlobusDirectory=/share/godata/
LaptopDirectory=/~/data/globus/xfer/
ThetaDirectory=/~/dmtestdata/
# Put the compute job in the queue with a user hold; if stagein succeeds the hold should be released
computeJob=$(qsub -l nodes=pbs-cc1 -h computeJob.sh)
if [ $? -ne 0 ]; then
echo "Compute job qsub failed! Aborting..."
exit 1
fi
echo $computeJob
# Put the cleanup job in the queue with a user hold; if stagein fails the hold should be released
cleanupJob=$(qsub -l nodes=pbs-cc1 -h cleanup.sh)
echo $cleanupJob
# If compute job succeeds, run stageout job
stageOut=$(qsub -l nodes=pbs-cc2 -v TRANSFER_LABEL='ALCF stageout request',SRC_DIR=$GlobusTutorialEndpoint:$GlobusDirectory,DEST_DIR=$LaptopEndpoint:$LaptopDirectory -W depend=afterok:$computeJob stageNoWait.sh)
echo $stageOut
# The first step in this data staging workflow is to run the stagein job and monitor progress
globusXferMonitor=$(qsub -l nodes=pbs-cc2 -v TRANSFER_LABEL='ALCF stagein request',SRC_DIR=$ThetaDTN:$ThetaDirectory,DEST_DIR=$LaptopEndpoint:$LaptopDirectory,ComputeJob=$computeJob,CleanupJob=$cleanupJob,StageoutJob=$stageOut,tid=$1 globusXferPlusMonitor.sh)
echo $globusXferMonitor
exit 0
[vagrant@pbs-sn1 ~]$ cat globusXferPlusMonitor.sh
#!/bin/bash
date
echo "Hello from globusXferPlusMonitor running on host:"
hostname
if [ -z "$tid" ]; then
echo "Submitting 'globus transfer' request..."
tid=$(~/.local/bin/globus transfer --jq "task_id" --format UNIX --skip-activation-check \
--label="$TRANSFER_LABEL" $SRC_DIR $DEST_DIR --recursive)
if [ -z "$tid" ]; then
echo "Transfer request submission failed!"
`qrls $CleanupJob`
`qdel $ComputeJob`
`qdel $StageoutJob`
date
exit 1
fi
fi
# Run a job to monitor the stagein
globusTaskMonitor=$(qsub -l nodes=pbs-cc2 -v TRANSFER_LABEL='ALCF stagein request',SRC_DIR=$ThetaDTN:$ThetaDirectory,DEST_DIR=$LaptopEndpoint:$LaptopDirectory,ComputeJob=$ComputeJob,CleanupJob=$CleanupJob,StageoutJob=$StageoutJob,MonitorSeconds="150",tid=$tid globusTaskMonitor.sh)
echo $globusTaskMonitor
date
echo "First globusTaskMonitor jobid is $globusTaskMonitor; Globus task id is $tid"
exit 0
[vagrant@pbs-cc2 ~]$ cat globusTaskMonitor.sh
#!/bin/bash
date
echo "Hello from globusTaskMonitor running on host:"
hostname
echo "globusTaskMonitor jobid is $PBS_JOBID"
echo "jobid on hold is $ComputeJob"
if [ -z "$MonitorSeconds" ]; then
$MonitorSeconds=300
fi
LoopEnd=$((SECONDS+$MonitorSeconds))
echo "globusTaskMonitor will run for approximately $LoopEnd seconds or until Globus task $tid terminates."
while [ $SECONDS -lt $LoopEnd ]
do
status=$(~/.local/bin/globus task show --jq "status" --format=UNIX $tid)
if [ $status == "FAILED" ]; then
echo "No longer waiting... Transfer failed!"
`qrls $CleanupJob`
`qdel $ComputeJob`
`qdel $StageoutJob`
date
exit 1
else
if [ $status == "SUCCEEDED" ]; then
echo "No longer waiting... Transfer succeeded!"
`qrls $ComputeJob`
`qdel $CleanupJob`
date
exit 0
fi
fi
sleep 10
done
# The task has not terminated within the requested monitor interval; launch a new monitor and exit
globusTaskMonitor=$(qsub -l nodes=pbs-cc2 -v TRANSFER_LABEL='ALCF stagein request',SRC_DIR=$ThetaDTN:$ThetaDirectory,DEST_DIR=$LaptopEndpoint:$LaptopDirectory,ComputeJob=$ComputeJob,CleanupJob=$CleanupJob,StageoutJob=$StageoutJob,MonitorSeconds=$MonitorSeconds,tid=$tid globusTaskMonitor.sh)
echo "New monitor is jobid $globusTaskMonitor"
date
exit 0
Used in V1 Only
[vagrant@pbs-sn1 ~]$ cat ./dataStagingWorkflow.sh
#!/bin/bash
# Set up convenience vars for oft-used Globus endpoint UUIDs
GlobusTutorialEndpoint=ddb59aef-6d04-11e5-ba46-22000b92c6ec
LaptopEndpoint=61d8a676-447d-11ea-ab4d-0a7959ea6081
ThetaDTN=08925f04-569f-11e7-bef8-22000b9a448b
# Set up convenience vars for source/dest directories at each of the Globus endpoints
GlobusDirectory=/share/godata/
LaptopDirectory=/~/data/globus/xfer/
ThetaDirectory=/~/dmtestdata/
# Run stagein job
stageIn=$(qsub -l nodes=pbs-cc2 -v TRANSFER_LABEL='ALCF stagein request',SRC_DIR=$ThetaDTN:$ThetaDirectory,DEST_DIR=$LaptopEndpoint:$LaptopDirectory,tid=$1 stageWait.sh)
echo $stageIn
# If stagein job fails, run clean-up job
executeFailedRequestCleanup=$(qsub -l nodes=pbs-cc1 -W depend=afternotok:$stageIn cleanup.sh)
echo $executeFailedRequestCleanup
# If stagein job succeeds, run real job
executeComputeJob=$(qsub -l nodes=pbs-cc1 -W depend=afterok:$stageIn computeJob.sh)
echo $executeComputeJob
# If both stagein & desired jobs succeed, run the stageout job
stageOut=$(qsub -l nodes=pbs-cc2 -v TRANSFER_LABEL='ALCF stageout request',SRC_DIR=$GlobusTutorialEndpoint:$GlobusDirectory,DEST_DIR=$LaptopEndpoint:$LaptopDirectory -W depend=afterok:$stageIn:$executeComputeJob stageNoWait.sh)
echo $stageOut
exit 0
[vagrant@pbs-sn1 ~]$ cat stageWait.sh
#!/bin/bash
date
echo "Hello from stageWait running on host:"
hostname
if [ -z "$tid" ]; then
echo "Submitting 'globus transfer' request..."
tid=$(~/.local/bin/globus transfer --jq "task_id" --format UNIX --skip-activation-check \
--label="$TRANSFER_LABEL" $SRC_DIR $DEST_DIR --recursive)
if [ -z "$tid" ]; then
echo "Transfer request submission failed!"
date
exit 1
fi
fi
echo "Waiting for transfer $tid to finish..."
# Wait until transfer requests succeeds or fails
while :
do
~/.local/bin/globus task wait --polling-interval '7' -H $tid
if [ $? -eq 0 ]; then
status=$(~/.local/bin/globus task show --jq "status" --format=UNIX $tid)
if [ $status == "FAILED" ]; then
echo "No longer waiting... Transfer failed!"
date
exit 1
fi
if [ $status == "SUCCEEDED" ]; then
echo "No longer waiting... Transfer succeeded!"
date
exit 0
fi
else
status=$(~/.local/bin/globus task show --jq "status" --format=UNIX $tid)
if [ $status == "FAILED" ]; then
echo "No longer waiting... wait command returned an error!"
date
exit 1
fi
fi
done