PBS/Globus Data Staging: User Interaction Notes (2020-03-16 6:40am)

Caveat: Scenarios and implementations are for discussion purposes only and do not represent official AIG recommendations

Table of Contents


Section A: Demo

  1. Background and Context
    1. The purpose of this work is to demonstrate interactions of a user wanting to stagein files, run a job, then stageout files.
    2. While the demo user's primary institutional affiliation is not ALCF, the person does have an ALCF allocation/account. After successfully authenticating with ALCF OTP credentials the user is authorized by the facility to move files across wide-area networks using the ALCF Theta DTN.
    3. Demo Deployment Details
    4. Logical System Design
    5. Globus authorization details
      1. Globus transfers involve multiple distinct administrative domains across wide-area networks
      2. Client authorizations
        • The Globus Web App client is authorized after the user successfully logs in to the Globus website (by authenticating with one of the identity providers supported by Globus); resulting OAuth tokens are stored in browser cookies
        • For Globus CLI clients, the user's OAuth tokens are stored on the local filesystem (after successfully authenticating with a supported identity provider)
        • Overview of the CLI client OAuth token creation process using the Google OAuth identity provider:
        • Using the ALCF OAuth identity provider:
      3. Endpoint authorizations
        • To access a Globus endpoint, users must be pre-authorized by the endpoint administrator (via ACL) to use the endpoint
        • Near the time of desired access, authorized users must "activate" the endpoint (Globus terminology) by supplying login information, thereby enabling short-lived access to the file system associated with the endpoint
        • Only endpoint creators/administrators have the ability to delete a given Globus endpoint definition; Andrew Cherry administers ALCF Globus endpoints
        • A list of the user's endpoints can be viewed in the Globus Web app: https://app.globus.org/endpoints
        • The user can activate/deactivate ALCF endpoints using the web app: https://app.globus.org/file-manager/collections/08925f04-569f-11e7-bef8-22000b9a448b
        • Endpoint activation can be initiated using the Globus CLI, though a web browser is ultimately required to complete the process:
          [vagrant@pbs-cc2 ~]$ globus endpoint activate 08925f04-569f-11e7-bef8-22000b9a448b
          
          The endpoint could not be auto-activated.
          This endpoint supports the following activation methods: web, oauth, delegate proxy
          For web activation use:
          'globus endpoint activate --web 08925f04-569f-11e7-bef8-22000b9a448b'
          For oauth activation use web activation:
          'globus endpoint activate --web 08925f04-569f-11e7-bef8-22000b9a448b'
          For delegate proxy activation use:
          'globus endpoint activate --delegate-proxy X.509_PEM_FILE 08925f04-569f-11e7-bef8-22000b9a448b'
          Delegate proxy activation requires an additional dependency on cryptography. See the docs for details:
          https://docs.globus.org/cli/reference/endpoint_activate/
          
          
  2. Live Demo
    1. Enable Globus CLI client interactions on the PBS Node running on Lisa's laptop:
      1. The Globus Web app is running on Lisa's laptop browser window
        The Globus CLI is running on the dedicated PBS Node running on Lisa's laptop
        The Source Endpoint is the ALCF DTN (Globus Data Transfer Node)
        The Destination Endpoint is a Globus Connect Personal server running on Lisa's laptop
      2. On pbs-cc2 (the node used to execute Globus client requests):
        • Execute a "globus login" to create user OAuth tokens (needed to submit transfer requests)
        • Perform OAuth2 authentication/consent process
        • ls -l ~/.globus.cfg to show token storage
        • Example globus.cfg file with obscured data:
        • [vagrant@pbs-cc2 ~]$ cat SampleOAuthTokenFile
          
          [cli]
          client_id = exxxx171-6xx6-xxxx-xx15-5xxxxxx627e5
          client_secret = xxxxxxxxxxxTzxCxJxx2xqxixKxVxrxBxxxxxxxxMEg=
          transfer_refresh_token = Agxxxxxxxxxxxxxxkxxxxxxxxyn9DBx96xxxxxxxxxxxxxxolxxxxxxxxxxxoKerw3e0V4MxxxxxxxxxxxxxxxxxV51gX
          transfer_access_token = AgBj40OxxxxxxxxxxxxxxxVxG4KO8NJJxxxxxxxxxxxxxxxxx8hyCaPkMYmgvxxxxxxxxxxxxxxxxxxxxxxxxSqJ6N
          transfer_access_token_expires = 1580489674
          auth_refresh_token = AgxxxxxxxxxxxxxxxxxxxDaVBgDq4xxxxxxxxxxxxxxxxxxxxxxxxxxxxeyQz4Mqq6PJxxxxxxxxxxxxxxxxxo50XBMy3
          auth_access_token = Agbe1QmMxxxxxxxxxxxxxxxxxxxVopg9266BxxxxxxxxxxxxxxH5Clyyb5PNYKX1xxxxxxxxxxxqNoUD8QKvnHxxxxxxxxxxxxx7to9Kw
          auth_access_token_expires = 1580489674
          
          
        • (Note if there were a shared filesystem the user could create tokens from any ALCF machine)
      3. On pbs-sn1 (the node used to submit jobs):
        • Show V1 data staging wrapper script
        • Show "afterok" and "afternotok" dependencies on the V1 qsub commands
        • Show V1 control branching pic
        • Show V2 control branching pic
        • Show V2 data staging wrapper script if they want to see it
        • Let them choose their own adventure; show demo implementation they want to see
        • Show env variables being passed
        • Walk through each of the job scripts, showing Globus CLI commands
    2. Scenario A: Stagein-compute-stageout w/successful stagein job (job and stageout executes after successful stagein)
      1. Pause laptop endpoint; explain that pause is needed because demo transfers are small and fast
      2. Execute data staging wrapper script
      3. Execute qstat
      4. Show stagein request on https://app.globus.org/activity?show=history
      5. Unpause laptop endpoint
      6. Wait for stageout job to appear on https://app.globus.org/activity?show=history
      7. Show output files; note that the cleanup job did not run
    3. Scenario B: Stagein-compute-stageout w/failed stagein job (clean up job executes in response to stagein failure)
      1. Pause laptop endpoint
      2. Execute data staging wrapper script
      3. Execute qstat
      4. Show stagein request on https://app.globus.org/activity?show=history
      5. Terminate Globus stagein task (induce fatal stagein error)
      6. Execute qstat
      7. Show output files; note that the compute and stageout jobs did not run
    4. Scenario C: Stagein-compute-stageout w/expired ALCF DTN user credential (stagein waits until user provides credentials)
      1. Unpause laptop endpoint
      2. Deactivate user's Theta DTN credential (induce ephemeral error)
      3. Execute data staging wrapper script
      4. Show stagein request on https://app.globus.org/activity?show=history
      5. Execute qstat
      6. Show Globus endpoint activation needed email
      7. Reactivate user's Theta DTN credential
      8. Show output files
    5. Scenario D: Stagein-compute-stageout w/simulated machine outage and job resubmission (stagein workflow reconstructed after queue is nuked)
      1. Pause laptop endpoint
      2. Execute data staging wrapper script
      3. Execute qstat
      4. Show stagein request on https://app.globus.org/activity?show=history
      5. Delete all the jobs in the queue (simulate loss of PBS state): qselect -u vagrant | xargs qdel
      6. Execute qstat
      7. Show orphaned Globus request on https://app.globus.org/activity?show=history
      8. Show transfer submission output; copy orphaned task id
      9. Execute data staging wrapper script, passing orphaned task id
      10. Unpause laptop endpoint
      11. Execute qstat
      12. Show output files
    6. Scenario E: Stagein-compute-stageout w/out-of-band Globus stagein initiation (existing transfer grafted onto a workflow)
      1. Unpause laptop endpoint
      2. Submit out-of-band transfer request on https://app.globus.org/file-manager; get task id
      3. Execute data staging wrapper script, passing task id from above request
      4. Execute qstat
      5. Show output files
    7. Key features demonstrated

Section B: Issues, Notes, Todos

  1. Issues (in no particular order)
    1. How best to manage staging jobs in the face of planned system outages?
      Issue pending; waiting for info
      Notes:
      1. Reminder:
        • The V1 implementation involves a single long-lived stagein job, then a compute job, followed by a short-lived stageout job. If the stagein job ends with a nonzero exit code the compute and stageout jobs are deleted and a cleanup job runs.
        • The V2 implementation involves a short-lived stagein request job, then a series of short-lived stagein monitoring jobs, followed by a compute job, then a short-lived stageout job. If a monitoring job ends with a nonzero exit code the compute and stageout jobs are deleted and a cleanup job runs.
      2. What happens to staging workflow jobs during Preventive Maintenance?
        • For a definitive recommendation we need to know: 1) how PM reservations will be implemented, and 2) where the data staging client jobs will be routed/executed
          • How will PM reservations be implemented in April 2020?
          • How will the implementation change, if at all, after 2020?
        • If long-lived (2+ weeks) stagein jobs are supported by ALCF then V1 might be a reasonable option for the user. During PM all V1 jobs in the queue could be put on admin hold and resumed once the system was released. Assuming the job ids don't change, PBS qsub dependencies should enable the workflow to proceed as intended.
        • If only short-lived stagein jobs are supported by ALCF then V2 might be an option. During PM all V2 jobs in the queue could be put on admin hold and then resumed (with originally-specified user holds in place) once the system was released. Assuming the job ids don't change, the workflow should proceed as intended.
      3. What happens to staging workflow jobs during ad hoc planned outages?
        • No known differences between ad hoc planned outages and PM outages (wrt data staging workflows)
    2. How best to manage staging jobs in the face of unplanned system outages (understand failure handling generally for PBS jobs)
      Issue open
      Notes:
      1. What happens when the PBS postgres db goes down while jobs are running?
      2. [vagrant@pbs-sn1 ~]$ ./dataStagingWorkflow.sh
        
        1041.pbs-sn1
        1042.pbs-sn1
        1043.pbs-sn1
        1044.pbs-sn1
        
        [vagrant@pbs-sn1 ~]$ qstat
        
        Job id            Name             User              Time Use S Queue
        ----------------  ---------------- ----------------  -------- - -----
        1041.pbs-sn1      stageWait.sh     vagrant           00:00:00 R workq
        1042.pbs-sn1      cleanup.sh       vagrant                  0 H workq
        1043.pbs-sn1      computeJob.sh    vagrant                  0 H workq
        1044.pbs-sn1      stageNoWait.sh   vagrant                  0 H workq
        
        [vagrant@pbs-sn1 ~]$ sudo pkill -u postgres
        [vagrant@pbs-sn1 ~]$ qstat
        
        Connection refused
        qstat: cannot connect to server pbs-sn1 (errno=111)
        
        
        • Server log
          [vagrant@pbs-sn1 ~]$ sudo tail -15 /var/spool/pbs/server_logs/20200306 03/06/2020 18:47:04;0100;Server@pbs-sn1;Job;1044.pbs-sn1;enqueuing into workq, state 1 hop 1 03/06/2020 18:47:04;0008;Server@pbs-sn1;Job;1044.pbs-sn1;Job Queued at request of vagrant@pbs-sn1, owner = vagrant@pbs-sn1, job name = stageNoWait.sh, queue = workq 03/06/2020 18:47:06;0100;Server@pbs-sn1;Req;;Type 0 request received from vagrant@pbs-sn1, sock=16 03/06/2020 18:47:06;0100;Server@pbs-sn1;Req;;Type 49 request received from vagrant@pbs-sn1, sock=17 03/06/2020 18:47:06;0100;Server@pbs-sn1;Req;;Type 21 request received from vagrant@pbs-sn1, sock=16 03/06/2020 18:47:06;0100;Server@pbs-sn1;Req;;Type 19 request received from vagrant@pbs-sn1, sock=16 03/06/2020 18:47:06;0100;Server@pbs-sn1;Req;;Type 0 request received from vagrant@pbs-sn1, sock=17 03/06/2020 18:47:06;0100;Server@pbs-sn1;Req;;Type 49 request received from vagrant@pbs-sn1, sock=18 03/06/2020 18:47:06;0100;Server@pbs-sn1;Req;;Type 21 request received from vagrant@pbs-sn1, sock=17 03/06/2020 18:47:30;0001;Server@pbs-sn1;Svr;Server@pbs-sn1;job_save, Failed to save job 1041.pbs-sn1 Transaction begin failed: FATAL: terminating connection due to administrator command server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. 03/06/2020 18:47:30;0001;Server@pbs-sn1;Svr;Server@pbs-sn1;panic_stop_db, Panic shutdown of Server on database error. Please check PBS_HOME file system for no space condition. 03/06/2020 18:47:30;0002;Server@pbs-sn1;Svr;Log;Log closed
        • Postgres log
          [vagrant@pbs-sn1 ~]$ sudo tail -7 /var/spool/pbs/datastore/pg_log/postgresql-Fri.log 2020-03-06 14:08:01 UTCLOG: database system is ready to accept connections 2020-03-06 14:08:01 UTCLOG: autovacuum launcher started 2020-03-06 18:47:28 UTCLOG: autovacuum launcher shutting down 2020-03-06 18:47:28 UTCFATAL: terminating connection due to administrator command 2020-03-06 18:47:28 UTCLOG: received smart shutdown request 2020-03-06 18:47:28 UTCLOG: shutting down 2020-03-06 18:47:28 UTCLOG: database system is shut down
        • Accounting log (note the S record for 1041.pbs-sn1)
          [vagrant@pbs-sn1 ~]$ sudo cat /var/spool/pbs/server_priv/accounting/20200306 03/06/2020 15:00:00;L;license;floating license hour:0 day:0 month:0 max:0 03/06/2020 16:00:00;L;license;floating license hour:0 day:0 month:0 max:0 03/06/2020 17:00:00;L;license;floating license hour:0 day:0 month:0 max:0 03/06/2020 18:00:00;L;license;floating license hour:0 day:0 month:0 max:0 03/06/2020 18:47:04;Q;1041.pbs-sn1;queue=workq 03/06/2020 18:47:04;Q;1042.pbs-sn1;queue=workq 03/06/2020 18:47:04;S;1041.pbs-sn1;user=vagrant group=vagrant project=_pbs_project_default jobname=stageWait.sh queue=workq ctime=1583520424 qtime=1583520424 etime=1583520424 start=1583520424 exec_host=pbs-cc2/0 exec_vnode=(pbs-cc2:ncpus=1) Resource_List.ncpus=1 Resource_List.nodect=1 Resource_List.nodes=pbs-cc2 Resource_List.place=scatter Resource_List.select=1:ncpus=1:host=pbs-cc2 resource_assigned.ncpus=1 03/06/2020 18:47:04;Q;1043.pbs-sn1;queue=workq 03/06/2020 18:47:04;Q;1044.pbs-sn1;queue=workq
        The next day, after restarting the system and submitting a new workflow job, the old jobs were in the queue in addition to the new jobs. The new and old jobs were successfully executed, and corresponding stderr and stdout files were created. Accounting records show start and end records for both the new and old jobs.
        [vagrant@pbs-sn1 ~]$ ./dataStagingWorkflow.sh
        
        2041.pbs-sn1
        2042.pbs-sn1
        2043.pbs-sn1
        2044.pbs-sn1
        
        [vagrant@pbs-sn1 ~]$ qstat
        
        Job id            Name             User              Time Use S Queue
        ----------------  ---------------- ----------------  -------- - -----
        1043.pbs-sn1      computeJob.sh    vagrant           00:00:00 R workq
        1044.pbs-sn1      stageNoWait.sh   vagrant                  0 H workq
        2041.pbs-sn1      stageWait.sh     vagrant           00:00:00 R workq
        2042.pbs-sn1      cleanup.sh       vagrant                  0 H workq
        2043.pbs-sn1      computeJob.sh    vagrant                  0 H workq
        2044.pbs-sn1      stageNoWait.sh   vagrant                  0 H workq
        
        [vagrant@pbs-sn1 ~]$ ls -lt
        
        -rw-------.  1 vagrant vagrant  267 Mar  7 13:49 stageNoWait.sh.o2044
        -rw-------.  1 vagrant vagrant    0 Mar  7 13:49 stageNoWait.sh.e2044
        -rw-------.  1 vagrant vagrant  132 Mar  7 13:48 computeJob.sh.o2043
        -rw-------.  1 vagrant vagrant    0 Mar  7 13:48 computeJob.sh.e2043
        -rw-------.  1 vagrant vagrant  258 Mar  7 13:48 stageWait.sh.o2041
        -rw-------.  1 vagrant vagrant    6 Mar  7 13:48 stageWait.sh.e2041
        -rw-------.  1 vagrant vagrant  267 Mar  7 13:48 stageNoWait.sh.o1044
        -rw-------.  1 vagrant vagrant    0 Mar  7 13:48 stageNoWait.sh.e1044
        -rw-------.  1 vagrant vagrant  132 Mar  7 13:48 computeJob.sh.o1043
        -rw-------.  1 vagrant vagrant    0 Mar  7 13:47 computeJob.sh.e1043
        -rw-------.  1 vagrant vagrant  377 Mar  6 19:36 stageWait.sh.e1041
        -rw-------.  1 vagrant vagrant  258 Mar  6 19:36 stageWait.sh.o1041
        
        [vagrant@pbs-sn1 ~]$ sudo cat /var/spool/pbs/server_priv/accounting/20200307
        
        03/07/2020 13:45:34;A;1042.pbs-sn1;Job deleted as result of dependency on job 1041.pbs-sn1
        03/07/2020 13:45:34;E;1041.pbs-sn1;user=vagrant group=vagrant project=_pbs_project_default jobname=stageWait.sh queue=workq ctime=1583520424 qtime=1583520424 etime=1583520424 start=1583520424 exec_host=pbs-cc2/0 exec_vnode=(pbs-cc2:ncpus=1) Resource_List.ncpus=1 Resource_List.nodect=1 Resource_List.nodes=pbs-cc2 Resource_List.place=scatter Resource_List.select=1:ncpus=1:host=pbs-cc2 session=6337 end=1583588734 Exit_status=0 resources_used.cpupercent=0 resources_used.cput=00:00:00 resources_used.mem=33676kb resources_used.ncpus=1 resources_used.vmem=178868kb resources_used.walltime=00:00:10 run_count=1
        03/07/2020 13:47:57;Q;2041.pbs-sn1;queue=workq
        03/07/2020 13:47:57;Q;2042.pbs-sn1;queue=workq
        03/07/2020 13:47:57;S;1043.pbs-sn1;user=vagrant group=vagrant project=_pbs_project_default jobname=computeJob.sh queue=workq ctime=1583520424 qtime=1583520424 etime=1583588734 start=1583588877 exec_host=pbs-cc1/0 exec_vnode=(pbs-cc1:ncpus=1) Resource_List.ncpus=1 Resource_List.nodect=1 Resource_List.nodes=pbs-cc1 Resource_List.place=scatter Resource_List.select=1:ncpus=1:host=pbs-cc1 resource_assigned.ncpus=1
        03/07/2020 13:47:57;S;2041.pbs-sn1;user=vagrant group=vagrant project=_pbs_project_default jobname=stageWait.sh queue=workq ctime=1583588877 qtime=1583588877 etime=1583588877 start=1583588877 exec_host=pbs-cc2/0 exec_vnode=(pbs-cc2:ncpus=1) Resource_List.ncpus=1 Resource_List.nodect=1 Resource_List.nodes=pbs-cc2 Resource_List.place=scatter Resource_List.select=1:ncpus=1:host=pbs-cc2 resource_assigned.ncpus=1
        03/07/2020 13:47:57;Q;2043.pbs-sn1;queue=workq
        03/07/2020 13:47:57;Q;2044.pbs-sn1;queue=workq
        03/07/2020 13:48:07;E;1043.pbs-sn1;user=vagrant group=vagrant project=_pbs_project_default jobname=computeJob.sh queue=workq ctime=1583520424 qtime=1583520424 etime=1583588734 start=1583588877 exec_host=pbs-cc1/0 exec_vnode=(pbs-cc1:ncpus=1) Resource_List.ncpus=1 Resource_List.nodect=1 Resource_List.nodes=pbs-cc1 Resource_List.place=scatter Resource_List.select=1:ncpus=1:host=pbs-cc1 session=5946 end=1583588887 Exit_status=0 resources_used.cpupercent=0 resources_used.cput=00:00:00 resources_used.mem=3456kb resources_used.ncpus=1 resources_used.vmem=35892kb resources_used.walltime=00:00:10 run_count=1
        03/07/2020 13:48:07;S;1044.pbs-sn1;user=vagrant group=vagrant project=_pbs_project_default jobname=stageNoWait.sh queue=workq ctime=1583520424 qtime=1583520424 etime=1583588887 start=1583588887 exec_host=pbs-cc2/1 exec_vnode=(pbs-cc2:ncpus=1) Resource_List.ncpus=1 Resource_List.nodect=1 Resource_List.nodes=pbs-cc2 Resource_List.place=scatter Resource_List.select=1:ncpus=1:host=pbs-cc2 resource_assigned.ncpus=1
        03/07/2020 13:48:10;E;1044.pbs-sn1;user=vagrant group=vagrant project=_pbs_project_default jobname=stageNoWait.sh queue=workq ctime=1583520424 qtime=1583520424 etime=1583588887 start=1583588887 exec_host=pbs-cc2/1 exec_vnode=(pbs-cc2:ncpus=1) Resource_List.ncpus=1 Resource_List.nodect=1 Resource_List.nodes=pbs-cc2 Resource_List.place=scatter Resource_List.select=1:ncpus=1:host=pbs-cc2 session=6019 end=1583588890 Exit_status=0 resources_used.cpupercent=0 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.ncpus=1 resources_used.vmem=0kb resources_used.walltime=00:00:01 run_count=1
        03/07/2020 13:48:49;A;2042.pbs-sn1;Job deleted as result of dependency on job 2041.pbs-sn1
        03/07/2020 13:48:49;E;2041.pbs-sn1;user=vagrant group=vagrant project=_pbs_project_default jobname=stageWait.sh queue=workq ctime=1583588877 qtime=1583588877 etime=1583588877 start=1583588877 exec_host=pbs-cc2/0 exec_vnode=(pbs-cc2:ncpus=1) Resource_List.ncpus=1 Resource_List.nodect=1 Resource_List.nodes=pbs-cc2 Resource_List.place=scatter Resource_List.select=1:ncpus=1:host=pbs-cc2 session=5975 end=1583588929 Exit_status=0 resources_used.cpupercent=7 resources_used.cput=00:00:01 resources_used.mem=33728kb resources_used.ncpus=1 resources_used.vmem=178868kb resources_used.walltime=00:00:51 run_count=1
        03/07/2020 13:48:49;S;2043.pbs-sn1;user=vagrant group=vagrant project=_pbs_project_default jobname=computeJob.sh queue=workq ctime=1583588877 qtime=1583588877 etime=1583588929 start=1583588929 exec_host=pbs-cc1/0 exec_vnode=(pbs-cc1:ncpus=1) Resource_List.ncpus=1 Resource_List.nodect=1 Resource_List.nodes=pbs-cc1 Resource_List.place=scatter Resource_List.select=1:ncpus=1:host=pbs-cc1 resource_assigned.ncpus=1
        03/07/2020 13:49:01;E;2043.pbs-sn1;user=vagrant group=vagrant project=_pbs_project_default jobname=computeJob.sh queue=workq ctime=1583588877 qtime=1583588877 etime=1583588929 start=1583588929 exec_host=pbs-cc1/0 exec_vnode=(pbs-cc1:ncpus=1) Resource_List.ncpus=1 Resource_List.nodect=1 Resource_List.nodes=pbs-cc1 Resource_List.place=scatter Resource_List.select=1:ncpus=1:host=pbs-cc1 session=5996 end=1583588941 Exit_status=0 resources_used.cpupercent=0 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.ncpus=1 resources_used.vmem=0kb resources_used.walltime=00:00:10 run_count=1
        03/07/2020 13:49:01;S;2044.pbs-sn1;user=vagrant group=vagrant project=_pbs_project_default jobname=stageNoWait.sh queue=workq ctime=1583588877 qtime=1583588877 etime=1583588941 start=1583588941 exec_host=pbs-cc2/0 exec_vnode=(pbs-cc2:ncpus=1) Resource_List.ncpus=1 Resource_List.nodect=1 Resource_List.nodes=pbs-cc2 Resource_List.place=scatter Resource_List.select=1:ncpus=1:host=pbs-cc2 resource_assigned.ncpus=1
        03/07/2020 13:49:03;E;2044.pbs-sn1;user=vagrant group=vagrant project=_pbs_project_default jobname=stageNoWait.sh queue=workq ctime=1583588877 qtime=1583588877 etime=1583588941 start=1583588941 exec_host=pbs-cc2/0 exec_vnode=(pbs-cc2:ncpus=1) Resource_List.ncpus=1 Resource_List.nodect=1 Resource_List.nodes=pbs-cc2 Resource_List.place=scatter Resource_List.select=1:ncpus=1:host=pbs-cc2 session=6085 end=1583588943 Exit_status=0 resources_used.cpupercent=0 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.ncpus=1 resources_used.vmem=0kb resources_used.walltime=00:00:01 run_count=1
        
        
        
      3. What happens when the PBS postgres db goes down prior to qsub?
      4. [vagrant@pbs-sn1 ~]$ sudo pkill -u postgres
        [vagrant@pbs-sn1 ~]$ ./dataStagingWorkflow.sh
        
        qsub: Failed to save job/resv, refer server logs for details
        
        ...
        
        [vagrant@pbs-sn1 ~]$ sudo cat /var/spool/pbs/server_logs/20200305
        ...
        
        03/05/2020 21:20:09;0100;Server@pbs-sn1;Job;48.pbs-sn1;enqueuing into workq, state 1 hop 1
        03/05/2020 21:20:09;0001;Server@pbs-sn1;Svr;Server@pbs-sn1;job_save, Failed to save job 48.pbs-sn1 Transaction begin failed: no connection to the server
        03/05/2020 21:20:09;0100;Server@pbs-sn1;Job;48.pbs-sn1;dequeuing from workq, state 1
        03/05/2020 21:20:09;0001;Server@pbs-sn1;Job;48.pbs-sn1;job_purge, Removal of job from datastore failed
        03/05/2020 21:20:09;0080;Server@pbs-sn1;Req;req_reject;Reject reply code=15161, aux=0, type=5, from vagrant@pbs-sn1
        03/05/2020 21:20:09;0040;Server@pbs-sn1;Svr;pbs-sn1;Scheduler sent command 1
        03/05/2020 21:20:09;0040;Server@pbs-sn1;Svr;pbs-sn1;Scheduler sent command 0
        
        ...
        
      5. What happens when the PBS server goes down?
      6. What happens when a MOM goes down?
      7. What happens if the node running a data workflow script crashes?
      8. Ignorable test setup scribbles for Lisa's laptop
        • (Failure test deployment dir: /Users/childers/code/pbs/pbs-20200220/pbs-vagrant)
        • (ssh pbs-cc1 for xfer client machine on failure deployment only)
        • PBS db files can be found in /var/spool/pbs/datastore on pbs-sn1
        • Reset the pbs postgres datastore password:
          sudo ./pbs_ds_password
          
    3. What should the mechanism be, if any, to track the state of data staging workflow jobs?
      Issue open
      Notes:
      1. PBS Pro has a built-in state tracking mechanism using an internal postgres db. What data are stored in the internal PBS db?
      2. Important state for data staging workflows might include:
        • PBS job ids
        • qsub job dependencies
        • qsub job environment variable settings
        • Job return codes
        • Globus task ids
        • Workflow dependencies/structure
        • Job user hold setting
        • Other data?
      3. What is the current mechanism for tracking job state on Theta? What state is tracked?
    4. What is the impact of user job crashes on data staging workflows?
      Issue open
      Notes:
      1. In general we implement a "let it crash" design for the services under ALCF's control (including the PBS server, scheduler, moms, etc.) However, jobs will never be reliable because users create them.
      2. Might we create an optional "let it crash" monitoring service for users' PBS jobs?
        1. Such a service might allow users to specify how/what to monitor
        2. Investigate the trigger-action paradigm (e.g., IFTTT, Tasker, Google Rules)
          1. Simple rule: 1 trigger and 1 action
          2. More complex rules: 1+ triggers and 1+ actions
          3. "state" trigger example: PBS job substate==54 (job being aborted by PBS server)
          4. "sensor" trigger example: "disk quota exceeded"
          5. Action example: send email to user
          6. What triggers/actions might be of interest to ALCF users?
        3. PBS qsub currently has a limited trigger mechanism bound to an email action:
          • Send mail when job is aborted: qsub -m a
          • Send mail when job begins execution: qsub -m b
          • Send mail when job terminates: qsub -m e
        4. PBS also has a server event trigger mechanism bound to a "hook" plugin system that launches (admin-privileged) scripts
          • Side note: ALCF (via Eric and Brian) is contributing a new server management event to PBS. This contribution enables site-specific actions to be defined for server management operations (e.g., adding a node, importing a hook script, etc.)
      3. Any post-crash resubmission logic must not re-execute staging/compute jobs that have successfully completed
        1. Can resubmission be made idempotent by caching PBS job IDs and Globus task IDs?
        2. Workflows more complex than V1 and V2 should be considered
    5. Additional use scenarios that would be good to add to the demo
      Issue open
      Notes:
      1. A two week preventative maintenance outage scenario
      2. A "walltime exceeded" scenario
      3. others?
    6. What about walltime? Can the staging jobs run "forever"? Will the scheduler start jobs if the walltime crosses preventative maintenance reservation boundaries? Find out more about the PBS walltime implementation.
      Issue open
      Notes:
    7. Figure out how to specify resources for staging jobs. Perhaps using a resource group specification (i.e., "staging_resources")? Useful for preventative maintenance purposes for all staging jobs to be routed to the same queue? Data staging jobs need to be excluded from sbank accounting
      Issue open
      Notes:
    8. In addition to stdio/stderr, would be good to show the PBS logs as part of the demo. Check out /var/spool/pbs/ (on both the service and client nodes) to find interesting data. (Also think about: is it worth it to recreate the cobalt logs? If we are the only ones who care about cobalt additions then maybe we don't need to migrate all cobalt log additions to PBS?)
      Issue open
      Notes:
    9. How to best collate stderr&stdout for the whole workflow?
      Issue open
      Notes:
  2. Notes
    1. While executing a 'globus login' CLI command, a Globus consent page appears during the OAuth2 process. Will this always happen or is this an artifact of the demo deployment? Can consents be made sticky on the Globus side so users will only need to give consent once?
      Issue closed: Yes Globus consents are saved until explicitly rescinded by the user
      Notes:
    2. Are there any human-friendly names for Globus endpoint UUIDs?
      Issue closed
      Notes:
  3. Todos
    1. Investigate open issues as time permits
    2. Talk to Balsam folks about data staging workflows
      1. Balsam: Near Real-time Experimental Data Analysis on Supercomputers
      2. Are Globus transfers represented as tasks in the Balsam service?
      3. How does balsam know when the Globus transfer finishes?
      4. How does balsam behave when a Globus request submission fails? When the transfer itself fails?
      5. Does balsam support long-lived transfers (2+ weeks)?
      6. What happens to balsam tasks if the balsam server node goes down for a planned/unplanned outage?
      7. Are balsam tasks idempotent or does the user need to guard against accidental resubmission of compute jobs?
      8. Are there plans to produce a Balsam launcher for PBS Pro?
    3. Show demo to operations folks
    4. Talk to Andrew
    5. Talk to Brian about JLSE
    6. Show to others beyond ops (catalysts, ?)
    7. When we get this all figured out maybe make a video for users

Section C: Implementation Details

Used in Both V1 & V2

V1 & V2: Compute job script

[vagrant@pbs-sn1 ~]$ cat computeJob.sh

#!/bin/bash
date
echo "Hello from computeJob running on host:"
hostname
echo "Sleeping 10 seconds... zzz"
sleep 10
date
exit 0

V1 & V2: File stageout submission script w/no monitoring

[vagrant@pbs-sn1 ~]$ cat stageNoWait.sh

#!/bin/bash
date
echo "Hello from stageNoWait running on host:"
hostname

echo "Submitting 'globus transfer' request..."
tid=$(~/.local/bin/globus transfer --jq "task_id" --format UNIX --skip-activation-check \
   --label="$TRANSFER_LABEL" $SRC_DIR $DEST_DIR --recursive)

if [ -z "$tid" ]; then
    echo "Transfer request submission failed!"
    exit 1
fi
echo "Transfer request $tid submitted. To monitor progress, use the Globus service. Bye bye!"
date
exit 0

V1 & V2: Failure cleanup script

[vagrant@pbs-sn1 ~]$ cat cleanup.sh

#!/bin/bash
date
echo "Hello from cleanup running on host:"
hostname
echo "No data for you!"
date
exit 0

Used in V2 Only

V2: Wrapper script for data staging workflow

vagrant@pbs-sn1 ~]$ cat v2DataStagingWorkflow.sh

#!/bin/bash

# Set up convenience vars for oft-used Globus endpoint UUIDs
GlobusTutorialEndpoint=ddb59aef-6d04-11e5-ba46-22000b92c6ec
LaptopEndpoint=61d8a676-447d-11ea-ab4d-0a7959ea6081
ThetaDTN=08925f04-569f-11e7-bef8-22000b9a448b

# Set up convenience vars for source/dest directories at each of the Globus endpoints
GlobusDirectory=/share/godata/
LaptopDirectory=/~/data/globus/xfer/
ThetaDirectory=/~/dmtestdata/

# Put the compute job in the queue with a user hold; if stagein succeeds the hold should be released
computeJob=$(qsub -l nodes=pbs-cc1 -h computeJob.sh)
if [ $? -ne 0 ]; then
    echo "Compute job qsub failed! Aborting..."
    exit 1
fi
echo $computeJob

# Put the cleanup job in the queue with a user hold; if stagein fails the hold should be released
cleanupJob=$(qsub -l nodes=pbs-cc1 -h cleanup.sh)
echo $cleanupJob

# If compute job succeeds, run stageout job
stageOut=$(qsub -l nodes=pbs-cc2 -v TRANSFER_LABEL='ALCF stageout request',SRC_DIR=$GlobusTutorialEndpoint:$GlobusDirectory,DEST_DIR=$LaptopEndpoint:$LaptopDirectory -W depend=afterok:$computeJob stageNoWait.sh)
echo $stageOut

# The first step in this data staging workflow is to run the stagein job and monitor progress
globusXferMonitor=$(qsub -l nodes=pbs-cc2 -v TRANSFER_LABEL='ALCF stagein request',SRC_DIR=$ThetaDTN:$ThetaDirectory,DEST_DIR=$LaptopEndpoint:$LaptopDirectory,ComputeJob=$computeJob,CleanupJob=$cleanupJob,StageoutJob=$stageOut,tid=$1 globusXferPlusMonitor.sh)
echo $globusXferMonitor

exit 0

V2: File stagein submission script w/initiation of serial monitoring

[vagrant@pbs-sn1 ~]$ cat globusXferPlusMonitor.sh

#!/bin/bash
date
echo "Hello from globusXferPlusMonitor running on host:"
hostname

if [ -z "$tid" ]; then
    echo "Submitting 'globus transfer' request..."
    tid=$(~/.local/bin/globus transfer --jq "task_id" --format UNIX --skip-activation-check \
          --label="$TRANSFER_LABEL" $SRC_DIR $DEST_DIR --recursive)

    if [ -z "$tid" ]; then
        echo "Transfer request submission failed!"
        `qrls $CleanupJob`
        `qdel $ComputeJob`
        `qdel $StageoutJob`
        date
        exit 1
    fi
fi

# Run a job to monitor the stagein
globusTaskMonitor=$(qsub -l nodes=pbs-cc2 -v TRANSFER_LABEL='ALCF stagein request',SRC_DIR=$ThetaDTN:$ThetaDirectory,DEST_DIR=$LaptopEndpoint:$LaptopDirectory,ComputeJob=$ComputeJob,CleanupJob=$CleanupJob,StageoutJob=$StageoutJob,MonitorSeconds="150",tid=$tid globusTaskMonitor.sh)
echo $globusTaskMonitor

date
echo "First globusTaskMonitor jobid is $globusTaskMonitor; Globus task id is $tid"
exit 0

V2: File stagein serial monitoring script

[vagrant@pbs-cc2 ~]$ cat globusTaskMonitor.sh

#!/bin/bash
date
echo "Hello from globusTaskMonitor running on host:"
hostname
echo "globusTaskMonitor jobid is $PBS_JOBID"
echo "jobid on hold is $ComputeJob"

if [ -z "$MonitorSeconds" ]; then
    $MonitorSeconds=300
fi

LoopEnd=$((SECONDS+$MonitorSeconds))
echo "globusTaskMonitor will run for approximately $LoopEnd seconds or until Globus task $tid terminates."

while [ $SECONDS -lt $LoopEnd ]
do
    status=$(~/.local/bin/globus task show --jq "status" --format=UNIX $tid)
    if [ $status == "FAILED" ]; then
        echo "No longer waiting... Transfer failed!"
        `qrls $CleanupJob`
        `qdel $ComputeJob`
        `qdel $StageoutJob`
        date
        exit 1
    else
        if [ $status == "SUCCEEDED" ]; then
            echo "No longer waiting... Transfer succeeded!"
            `qrls $ComputeJob`
            `qdel $CleanupJob`
            date
            exit 0
        fi
    fi
    sleep 10
done

# The task has not terminated within the requested monitor interval; launch a new monitor and exit
globusTaskMonitor=$(qsub -l nodes=pbs-cc2 -v TRANSFER_LABEL='ALCF stagein request',SRC_DIR=$ThetaDTN:$ThetaDirectory,DEST_DIR=$LaptopEndpoint:$LaptopDirectory,ComputeJob=$ComputeJob,CleanupJob=$CleanupJob,StageoutJob=$StageoutJob,MonitorSeconds=$MonitorSeconds,tid=$tid globusTaskMonitor.sh)
echo "New monitor is jobid $globusTaskMonitor"

date
exit 0

Used in V1 Only

V1: Wrapper script for data staging workflow

[vagrant@pbs-sn1 ~]$ cat ./dataStagingWorkflow.sh

#!/bin/bash

# Set up convenience vars for oft-used Globus endpoint UUIDs
GlobusTutorialEndpoint=ddb59aef-6d04-11e5-ba46-22000b92c6ec
LaptopEndpoint=61d8a676-447d-11ea-ab4d-0a7959ea6081
ThetaDTN=08925f04-569f-11e7-bef8-22000b9a448b

# Set up convenience vars for source/dest directories at each of the Globus endpoints
GlobusDirectory=/share/godata/
LaptopDirectory=/~/data/globus/xfer/
ThetaDirectory=/~/dmtestdata/

# Run stagein job
stageIn=$(qsub -l nodes=pbs-cc2 -v TRANSFER_LABEL='ALCF stagein request',SRC_DIR=$ThetaDTN:$ThetaDirectory,DEST_DIR=$LaptopEndpoint:$LaptopDirectory,tid=$1 stageWait.sh)
echo $stageIn

# If stagein job fails, run clean-up job
executeFailedRequestCleanup=$(qsub -l nodes=pbs-cc1 -W depend=afternotok:$stageIn cleanup.sh)
echo $executeFailedRequestCleanup

# If stagein job succeeds, run real job
executeComputeJob=$(qsub -l nodes=pbs-cc1 -W depend=afterok:$stageIn computeJob.sh)
echo $executeComputeJob

# If both stagein & desired jobs succeed, run the stageout job
stageOut=$(qsub -l nodes=pbs-cc2 -v TRANSFER_LABEL='ALCF stageout request',SRC_DIR=$GlobusTutorialEndpoint:$GlobusDirectory,DEST_DIR=$LaptopEndpoint:$LaptopDirectory -W depend=afterok:$stageIn:$executeComputeJob stageNoWait.sh)
echo $stageOut

exit 0

V1: File stagein submission script w/long-lived monitoring

[vagrant@pbs-sn1 ~]$ cat stageWait.sh

#!/bin/bash
date
echo "Hello from stageWait running on host:"
hostname

if [ -z "$tid" ]; then
    echo "Submitting 'globus transfer' request..."
    tid=$(~/.local/bin/globus transfer --jq "task_id" --format UNIX --skip-activation-check \
          --label="$TRANSFER_LABEL" $SRC_DIR $DEST_DIR --recursive)

    if [ -z "$tid" ]; then
        echo "Transfer request submission failed!"
        date
        exit 1
    fi
fi

echo "Waiting for transfer $tid to finish..."
# Wait until transfer requests succeeds or fails
while :
do
    ~/.local/bin/globus task wait --polling-interval '7' -H $tid
    if [ $? -eq 0 ]; then
        status=$(~/.local/bin/globus task show --jq "status" --format=UNIX $tid)
        if [ $status == "FAILED" ]; then
            echo "No longer waiting... Transfer failed!"
            date
            exit 1
        fi
        if [ $status == "SUCCEEDED" ]; then
            echo "No longer waiting... Transfer succeeded!"
            date
            exit 0
        fi
    else
        status=$(~/.local/bin/globus task show --jq "status" --format=UNIX $tid)
        if [ $status == "FAILED" ]; then
            echo "No longer waiting... wait command returned an error!"
            date
            exit 1
        fi
    fi
done