Skip to content
Snippets Groups Projects

Update docker_wrapper.py

Merged James Kahn requested to merge feature_jk_docker_gpus into master
3 unresolved threads

Added gpu flag and update trigger to include docker create calls

Merge request reports

Approved by

Merged by James KahnJames Kahn 4 years ago (Sep 24, 2020 9:57am UTC)

Merge details

  • Changes merged into master with 67ea784c (commits were squashed).
  • Deleted the source branch.

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Looks good to me. :thumbsup:

    • Author Developer

      Rene pointed out that the jobs produce this:

      06/02/20 16:39:33 (pid:818809) (D_ALWAYS) Create_Process succeeded, pid=818813
      06/02/20 16:39:33 (pid:818809) (D_ALWAYS) Process exited, pid=818813, status=0
      06/02/20 16:39:33 (pid:818809) (D_ALWAYS) Output file: /var/lib/condor/execute/dir_818809/_condor_stdout
      06/02/20 16:39:33 (pid:818809) (D_ALWAYS) Error file: /var/lib/condor/execute/dir_818809/_condor_stderr
      06/02/20 16:39:33 (pid:818809) (D_ALWAYS) Runnning: /etc/condor/scripts/docker_wrapper.py start -a HTCJob738089_0_slot1_1_PID818809
      06/02/20 16:39:34 (pid:818809) (D_ALWAYS) unhandled job exit: pid=818813, status=0
      06/02/20 16:39:38 (pid:818809) (D_ALWAYS) Process exited, pid=818831, status=0
      06/02/20 16:39:48 (pid:818809) (D_ALWAYS) DockerProc::JobExit() container 'HTCJob738089_0_slot1_1_PID818809'
      06/02/20 16:39:49 (pid:818809) (D_ALWAYS) All jobs have exited... starter exiting
      06/02/20 16:39:49 (pid:818809) (D_ALWAYS) After chmod(), still can't remove "/var/lib/condor/execute/dir_818809" as directory owner, giving up!
      06/02/20 16:39:49 (pid:818809) (D_ALWAYS) **** condor_starter (condor_STARTER) pid 818809 EXITING WITH STATUS 0

      From what I can see the unhandled job exit: part seems relatively common but I have no idea what causes it. The line:

      (D_ALWAYS) After chmod(), still can't remove "/var/lib/condor/execute/dir_818809" as directory owner, giving up!

      appears to be due to the xrdcp copied file. If I manually remove it in my job that error disappears and I get:

      06/02/20 16:48:33 (pid:820940) (D_ALWAYS) Create_Process succeeded, pid=820944
      06/02/20 16:48:34 (pid:820940) (D_ALWAYS) Process exited, pid=820944, status=0
      06/02/20 16:48:34 (pid:820940) (D_ALWAYS) Output file: /var/lib/condor/execute/dir_820940/_condor_stdout
      06/02/20 16:48:34 (pid:820940) (D_ALWAYS) Error file: /var/lib/condor/execute/dir_820940/_condor_stderr
      06/02/20 16:48:34 (pid:820940) (D_ALWAYS) Runnning: /etc/condor/scripts/docker_wrapper.py start -a HTCJob738090_0_slot1_1_PID820940
      06/02/20 16:48:34 (pid:820940) (D_ALWAYS) unhandled job exit: pid=820944, status=0
      06/02/20 16:48:38 (pid:820940) (D_ALWAYS) Process exited, pid=820964, status=0
      06/02/20 16:48:38 (pid:820940) (D_ALWAYS) DockerProc::JobExit() container 'HTCJob738090_0_slot1_1_PID820940'
      06/02/20 16:48:39 (pid:820940) (D_ALWAYS) All jobs have exited... starter exiting
      06/02/20 16:48:39 (pid:820940) (D_ALWAYS) **** condor_starter (condor_STARTER) pid 820940 EXITING WITH STATUS 0

      What's strange is the file permissions look fine:

      # ls after xrdcp:
      total 1.1G
      -rw-r--r-- 1 nobody nobody 1.0G Jun  2 14:48 1GB.test
      -rw-r--r-- 1 nobody nobody  335 Jun  2 14:48 _condor_stderr
      -rw-r--r-- 1 nobody nobody  793 Jun  2 14:48 _condor_stdout
      -rwxr-xr-x 1 nobody nobody  854 Jun  2 14:48 condor_exec.exe
      -rwxr-xr-x 1 nobody nobody    0 Jun  2 14:48 docker_stderror
      drwx------ 2 nobody nobody    6 Jun  2 14:48 tmp
      drwx------ 3 nobody nobody   17 Jun  2 14:48 var

      The directory is:

      # ls -lhd:
      drwx------ 4 nobody nobody 211 Jun  2 14:48 .

      So I'm a little puzzled tbh, @mschnepf got any clues as to what causes this?

    • Author Developer

      For completeness the corresponding submission and bash script are:

      ########################
      # Submit description file for test program
      # Follow instructions from https://wiki.ekp.kit.edu/bin/view/EkpMain/EKPCondorCluster
      ########################
      Executable    = test.sh
      #Universe      = vanilla
      Universe      = docker
      Output        = out.test
      Error         = err.test
      Log           = log.test 
      requirements	= Machine == "f03-001-151-e.gridka.de"
      requirements	= CloudSite	== "topas"
      #request_GPUs	= 2
      
      # For the ETP queue specifically
      +RemoteJob		= True
      ### 1 hour walltime
      +RequestWalltime = 3600
      # single CPU
      RequestCPUs = 1
      ### 4GB RAM
      RequestMemory = 2000
      
      ## select accounting group 
      ### belle, ams, 
      ### cms.top, cms.higgs, cms.production, cms.jet
      accounting_group = belle
      
      # Choose a GPU-appropriate image to run inside
      docker_image = kahnjms/slc7-condocker-pytorch
      
      Queue 
      #!/bin/sh
      echo "Checking misc environment"
      id
      pwd
      hostname -f
      echo "ls -lhd:"
      ls -lhd
      
      # echo "=================================="
      #ls /usr/bin
      echo "=================================="
      # $(which nvidia-smi)
      # nvidia-smi
      
      echo CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES
      
      echo "=================================="
      echo "Check Pytorch"
      #ls /opt/conda
      python3 -c "import torch; print(f'Torch available: {torch.cuda.is_available()}'); print(f'Device count: {torch.cuda.device_count()}')"
      echo "=================================="
      
      echo "Checking xrdcp"
      which xrdcp
      echo "ls:"
      ls -lh
      echo "========="
      xrdcp root://ceph-node-a.etp.kit.edu:1094//jkahn/1GB.test .
      echo "ls after xrdcp:"
      ls -lh
      
      echo "Remove file"
      rm -f 1GB.test
      echo "ls after rm:"
      ls -lh
      echo "ls -lhd"
      ls -lhd
      
      echo "=================================="
      echo "Confirm inside container"
      cat /proc/1/cgroup
    • Author Developer

      Looked into it further. It seems to be due to the 1GB test file being copied back to the submission host/directory. So that I guess is locking up the directory.

      I'll play with copying into tmp or another directory and try it out, and I'll have to make this clear in my instructions to users.

    • Please register or sign in to reply
  • I think you should include something like transfer_output_files = *your output file* in the jdl. Otherwise HTCondor will try to transfer any newly created file in the working directory (excluding subdirectories) back to the submission host.

  • Author Developer

    Ok behaviour's still a little strange. Doing an xrdcp into the job's workdir tmp/ directory removes the error. Specifying the transfer_output_files doesn't fix it (if the copied file is still directly in the job's workdir).

    Either way I'll put together the example and make all the caveats clear until I know how to resolve this.

  • It could be a problem with the user / group namespaces. @jkahn did you do the lscommand inside the container or baremetal?

    I looked also at the wrapper script from the guys at Nebraska: https://gist.github.com/jthiltges/cf45227c1f8e687d308481d11731562e They added the --group-add argument to the wrapper.

  • Author Developer

    I used ls inside the container. I'll try it out on baremetal as well to see if there's anything strange.

    Also I didn't clarify in my last comment, I copied a 1GB file via xrdcp and within the job touch another file, then copy back only the touched file (specified via transfer_output_files). Having the 1GB file in the workdir is what causes the error.

    Regarding --group-add, our wrapper calls that as well. The full command with the docker_wrapper translation is (=> shows where the translation happens):

    Jun  2 19:20:32 f03-001-151-e docker_wrapper.py:
    "/etc/condor/scripts/docker_wrapper.py create --cpu-shares=100 --memory=2000m --cap-drop=all --hostname jkahn-738121.0-f03-001-151-e.gridka.de --name HTCJob738121_0_slot1_1_PID847484 --label=org.htcondorproject=True -e CUDA_VISIBLE_DEVICES=0 -e TEMP=/var/lib/condor/execute/dir_847484 -e OMP_NUM_THREADS=1 -e BATCH_SYSTEM=HTCondor -e TMPDIR=/var/lib/condor/execute/dir_847484 -e _CONDOR_JOB_PIDS= -e _CONDOR_AssignedGPUs=CUDA0 -e GPU_DEVICE_ORDINAL=0 -e _CHIRP_DELAYED_UPDATE_PREFIX=Chirp* -e _CONDOR_JOB_IWD=/var/lib/condor/execute/dir_847484 -e TMP=/var/lib/condor/execute/dir_847484 -e _CONDOR_JOB_AD=/var/lib/condor/execute/dir_847484/.job.ad -e _CONDOR_SCRATCH_DIR=/var/lib/condor/execute/dir_847484 -e _CONDOR_SLOT=slot1_1 -e _CONDOR_CHIRP_CONFIG=/var/lib/condor/execute/dir_847484/.chirp.config -e _CONDOR_MACHINE_AD=/var/lib/condor/execute/dir_847484/.machine.ad --volume /var/lib/condor/execute/dir_847484:/var/lib/condor/execute/dir_847484 --volume /var/lib/condor/execute/dir_847484/tmp/:/tmp --volume /var/lib/condor/execute/dir_847484/var/tmp/:/var/tmp --volume /cvmfs:/cvmfs:shared,ro --volume /etc/passwd:/etc/passwd --volume /etc/cvmfs/SITECONF:/etc/cvmfs/SITECONF:ro --volume /etc/xrootd/client.plugins.d/client-plugin-proxy.conf.bak:/etc/xrootd/client.plugins.d/client-plugin-proxy.conf --workdir /var/lib/condor/execute/dir_847484 --user 99:99 --group-add 99 kahnjms/slc7-condocker-pytorch ./condor_exe
    c.exe"
    =>
     "/usr/bin/docker create --name HTCJob738121_0_slot1_1_PID847484 --group-add 99 --hostname jkahn-738121.0-f03-001-151-e.gridka.de --label org.htcondorproject=True --volume /var/lib/condor/execute/dir_847484:/var/lib/condor/execute/dir_847484 --volume /var/lib/condor/execute/dir_847484/tmp/:/tmp --volume /var/lib/condor/execute/dir_847484/var/tmp/:/var/tmp --volume /cvmfs:/cvmfs:shared,ro --volume /etc/passwd:/etc/passwd --volume /etc/cvmfs/SITECONF:/etc/cvmfs/SITECONF:ro --volume /etc/xrootd/client.plugins.d/client-plugin-proxy.conf.bak:/etc/xrootd/client.plugins.d/client-plugin-proxy.conf --workdir /var/lib/condor/execute/dir_847484 --user 99:99 --env CUDA_VISIBLE_DEVICES=0 --env TEMP=/var/lib/condor/execute/dir_847484 --env OMP_NUM_THREADS=1 --env BATCH_SYSTEM=HTCondor --env TMPDIR=/var/lib/condor/execute/dir_847484 --env _CONDOR_JOB_PIDS= --env _CONDOR_AssignedGPUs=CUDA0 --env GPU_DEVICE_ORDINAL=0 --env _CHIRP_DELAYED_UPDATE_PREFIX=Chirp* --env _CONDOR_JOB_IWD=/var/lib/condor/execute/dir_847484 --env TMP=/var/lib/condor/execute/dir_847484 --env _CONDOR_JOB_AD=/var/lib/condor/execute/dir_847484/.job.ad --env _CONDOR_SCRATCH_DIR=/var/lib/condor/execute/dir_847484 --env _CONDOR_SLOT=slot1_1 --env _CONDOR_CHIRP_CONFIG=/var/lib/condor/execute/dir_847484/.chirp.config --env _CONDOR_MACHINE_AD=/var/lib/condor/execute/dir_847484/.machine.ad --memory 2000m --cap-drop all --cpu-shares 100 --gpus all kahnjms/slc7-condocker-pytorch ./condor_exec.exe"
    Edited by James Kahn
  • Author Developer

    Same output on baremetal (user:group are also 99:99):

    # ls -lh /var/lib/condor/execute/dir_986178/
    total 1.1G
    -rw-r--r-- 1 nobody nobody 1.0G Jun  3 15:26 1GB.test
    -rwxr-xr-x 1 nobody nobody 1.3K Jun  3 15:24 condor_exec.exe
    -rw-r--r-- 1 nobody nobody  343 Jun  3 15:24 _condor_stderr
    -rw-r--r-- 1 nobody nobody 4.8K Jun  3 15:24 _condor_stdout
    -rwxr-xr-x 1 nobody nobody    0 Jun  3 15:24 docker_stderror
    -rw-r--r-- 1 nobody nobody    0 Jun  3 15:24 hello_world
    drwx------ 2 nobody nobody    6 Jun  3 15:24 tmp
    drwx------ 3 nobody nobody   17 Jun  3 15:24 var
  • James Kahn added 1 commit

    added 1 commit

    Compare with previous version

  • James Kahn added 1 commit

    added 1 commit

    • 016a3449 - Update docker_wrapper.py, syntax whoopsie

    Compare with previous version

    • Author Developer

      @mschnepf are we actually performing namespace remapping? I don't see the nobody user in /etc/subuid, nor is remapping specified in /etc/docker/daemon.json.

      Also, do we actually need the --group-add 99 flag since we already specify --user 99:99? As I understand that makes the group-add useless.

      Edited by James Kahn
    • The ls from bare metal look fine. Who is the owner of the job dir dir_...? Docker do not use namespace remapping in our setup. However, without specifying the user or group inside the container you are root user / in the root group. Yes adding of the groupid to the user (--user 99:99) should replace the --group-add 99

      Edited by Matthias Schnepf
    • Author Developer

      Ok nice, the --group-add flag is added somewhere earlier though, not in the wrapper. I imagine it's harmless though.

    • The --group-add is added by condor directly. From what I can tell its mainly intended to have not only the primary group of the user in the container but also secondary groups.

    • Please register or sign in to reply
    • Author Developer

      Hey @mschnepf @rcaspart , do either of you recall the final status of this? There are new merge conflicts I can resolve and ideally merge before I bounce. As far as jobs running goes the copy into /tmp is outlined in the instructions with clear warnings so at least for now it's sufficient for users to run jobs (assuming they read the instructions).

    • Author Developer

      Tbh I think I answered my own question, so with either of your blessing I'll resolve and merge.

    • Please register or sign in to reply
  • Hi, from my side I think this is good to go. Although just as a brief word of warning. At the moment the caching is disabled on the Tier-3 due to issues with accessing files at GridKa via the proxy (a known bug in dCache which is addresses in newer version). We expect the dCache to be updated at the next GridKa downtime (6th October) and the caching to be re-enabled shortly afterwards. In principle though, this does not have a huge impact in terms of usability for the user other than the transferring of input files being slower and using the network connection to ETP every time (and hence consider it mainly as a side note to keep in mind at this point).

  • Hi, @mhorzela and I did some changes on the docker_wrapper.py in the master branch. This would cause this merge conflicts. However, your changes look good.

  • James Kahn added 10 commits

    added 10 commits

    Compare with previous version

  • Author Developer

    Conflicts resolved, I'll merge tomorrow so we're around to make sure nothing breaks.

  • James Kahn approved this merge request

    approved this merge request

  • merged

  • James Kahn mentioned in commit 67ea784c

    mentioned in commit 67ea784c

Please register or sign in to reply
Loading