Update docker_wrapper.py
Added gpu flag and update trigger to include docker create calls
Merge request reports
Activity
Rene pointed out that the jobs produce this:
06/02/20 16:39:33 (pid:818809) (D_ALWAYS) Create_Process succeeded, pid=818813 06/02/20 16:39:33 (pid:818809) (D_ALWAYS) Process exited, pid=818813, status=0 06/02/20 16:39:33 (pid:818809) (D_ALWAYS) Output file: /var/lib/condor/execute/dir_818809/_condor_stdout 06/02/20 16:39:33 (pid:818809) (D_ALWAYS) Error file: /var/lib/condor/execute/dir_818809/_condor_stderr 06/02/20 16:39:33 (pid:818809) (D_ALWAYS) Runnning: /etc/condor/scripts/docker_wrapper.py start -a HTCJob738089_0_slot1_1_PID818809 06/02/20 16:39:34 (pid:818809) (D_ALWAYS) unhandled job exit: pid=818813, status=0 06/02/20 16:39:38 (pid:818809) (D_ALWAYS) Process exited, pid=818831, status=0 06/02/20 16:39:48 (pid:818809) (D_ALWAYS) DockerProc::JobExit() container 'HTCJob738089_0_slot1_1_PID818809' 06/02/20 16:39:49 (pid:818809) (D_ALWAYS) All jobs have exited... starter exiting 06/02/20 16:39:49 (pid:818809) (D_ALWAYS) After chmod(), still can't remove "/var/lib/condor/execute/dir_818809" as directory owner, giving up! 06/02/20 16:39:49 (pid:818809) (D_ALWAYS) **** condor_starter (condor_STARTER) pid 818809 EXITING WITH STATUS 0
From what I can see the
unhandled job exit:
part seems relatively common but I have no idea what causes it. The line:(D_ALWAYS) After chmod(), still can't remove "/var/lib/condor/execute/dir_818809" as directory owner, giving up!
appears to be due to the
xrdcp
copied file. If I manually remove it in my job that error disappears and I get:06/02/20 16:48:33 (pid:820940) (D_ALWAYS) Create_Process succeeded, pid=820944 06/02/20 16:48:34 (pid:820940) (D_ALWAYS) Process exited, pid=820944, status=0 06/02/20 16:48:34 (pid:820940) (D_ALWAYS) Output file: /var/lib/condor/execute/dir_820940/_condor_stdout 06/02/20 16:48:34 (pid:820940) (D_ALWAYS) Error file: /var/lib/condor/execute/dir_820940/_condor_stderr 06/02/20 16:48:34 (pid:820940) (D_ALWAYS) Runnning: /etc/condor/scripts/docker_wrapper.py start -a HTCJob738090_0_slot1_1_PID820940 06/02/20 16:48:34 (pid:820940) (D_ALWAYS) unhandled job exit: pid=820944, status=0 06/02/20 16:48:38 (pid:820940) (D_ALWAYS) Process exited, pid=820964, status=0 06/02/20 16:48:38 (pid:820940) (D_ALWAYS) DockerProc::JobExit() container 'HTCJob738090_0_slot1_1_PID820940' 06/02/20 16:48:39 (pid:820940) (D_ALWAYS) All jobs have exited... starter exiting 06/02/20 16:48:39 (pid:820940) (D_ALWAYS) **** condor_starter (condor_STARTER) pid 820940 EXITING WITH STATUS 0
What's strange is the file permissions look fine:
# ls after xrdcp: total 1.1G -rw-r--r-- 1 nobody nobody 1.0G Jun 2 14:48 1GB.test -rw-r--r-- 1 nobody nobody 335 Jun 2 14:48 _condor_stderr -rw-r--r-- 1 nobody nobody 793 Jun 2 14:48 _condor_stdout -rwxr-xr-x 1 nobody nobody 854 Jun 2 14:48 condor_exec.exe -rwxr-xr-x 1 nobody nobody 0 Jun 2 14:48 docker_stderror drwx------ 2 nobody nobody 6 Jun 2 14:48 tmp drwx------ 3 nobody nobody 17 Jun 2 14:48 var
The directory is:
# ls -lhd: drwx------ 4 nobody nobody 211 Jun 2 14:48 .
So I'm a little puzzled tbh, @mschnepf got any clues as to what causes this?
For completeness the corresponding submission and bash script are:
######################## # Submit description file for test program # Follow instructions from https://wiki.ekp.kit.edu/bin/view/EkpMain/EKPCondorCluster ######################## Executable = test.sh #Universe = vanilla Universe = docker Output = out.test Error = err.test Log = log.test requirements = Machine == "f03-001-151-e.gridka.de" requirements = CloudSite == "topas" #request_GPUs = 2 # For the ETP queue specifically +RemoteJob = True ### 1 hour walltime +RequestWalltime = 3600 # single CPU RequestCPUs = 1 ### 4GB RAM RequestMemory = 2000 ## select accounting group ### belle, ams, ### cms.top, cms.higgs, cms.production, cms.jet accounting_group = belle # Choose a GPU-appropriate image to run inside docker_image = kahnjms/slc7-condocker-pytorch Queue
#!/bin/sh echo "Checking misc environment" id pwd hostname -f echo "ls -lhd:" ls -lhd # echo "==================================" #ls /usr/bin echo "==================================" # $(which nvidia-smi) # nvidia-smi echo CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES echo "==================================" echo "Check Pytorch" #ls /opt/conda python3 -c "import torch; print(f'Torch available: {torch.cuda.is_available()}'); print(f'Device count: {torch.cuda.device_count()}')" echo "==================================" echo "Checking xrdcp" which xrdcp echo "ls:" ls -lh echo "=========" xrdcp root://ceph-node-a.etp.kit.edu:1094//jkahn/1GB.test . echo "ls after xrdcp:" ls -lh echo "Remove file" rm -f 1GB.test echo "ls after rm:" ls -lh echo "ls -lhd" ls -lhd echo "==================================" echo "Confirm inside container" cat /proc/1/cgroup
Looked into it further. It seems to be due to the 1GB test file being copied back to the submission host/directory. So that I guess is locking up the directory.
I'll play with copying into tmp or another directory and try it out, and I'll have to make this clear in my instructions to users.
Ok behaviour's still a little strange. Doing an
xrdcp
into the job's workdirtmp/
directory removes the error. Specifying thetransfer_output_files
doesn't fix it (if the copied file is still directly in the job's workdir).Either way I'll put together the example and make all the caveats clear until I know how to resolve this.
It could be a problem with the user / group namespaces. @jkahn did you do the
ls
command inside the container or baremetal?I looked also at the wrapper script from the guys at Nebraska: https://gist.github.com/jthiltges/cf45227c1f8e687d308481d11731562e They added the
--group-add
argument to the wrapper.I used
ls
inside the container. I'll try it out on baremetal as well to see if there's anything strange.Also I didn't clarify in my last comment, I copied a 1GB file via
xrdcp
and within the job touch another file, then copy back only the touched file (specified viatransfer_output_files
). Having the 1GB file in the workdir is what causes the error.Regarding
--group-add
, our wrapper calls that as well. The full command with thedocker_wrapper
translation is (=>
shows where the translation happens):Jun 2 19:20:32 f03-001-151-e docker_wrapper.py: "/etc/condor/scripts/docker_wrapper.py create --cpu-shares=100 --memory=2000m --cap-drop=all --hostname jkahn-738121.0-f03-001-151-e.gridka.de --name HTCJob738121_0_slot1_1_PID847484 --label=org.htcondorproject=True -e CUDA_VISIBLE_DEVICES=0 -e TEMP=/var/lib/condor/execute/dir_847484 -e OMP_NUM_THREADS=1 -e BATCH_SYSTEM=HTCondor -e TMPDIR=/var/lib/condor/execute/dir_847484 -e _CONDOR_JOB_PIDS= -e _CONDOR_AssignedGPUs=CUDA0 -e GPU_DEVICE_ORDINAL=0 -e _CHIRP_DELAYED_UPDATE_PREFIX=Chirp* -e _CONDOR_JOB_IWD=/var/lib/condor/execute/dir_847484 -e TMP=/var/lib/condor/execute/dir_847484 -e _CONDOR_JOB_AD=/var/lib/condor/execute/dir_847484/.job.ad -e _CONDOR_SCRATCH_DIR=/var/lib/condor/execute/dir_847484 -e _CONDOR_SLOT=slot1_1 -e _CONDOR_CHIRP_CONFIG=/var/lib/condor/execute/dir_847484/.chirp.config -e _CONDOR_MACHINE_AD=/var/lib/condor/execute/dir_847484/.machine.ad --volume /var/lib/condor/execute/dir_847484:/var/lib/condor/execute/dir_847484 --volume /var/lib/condor/execute/dir_847484/tmp/:/tmp --volume /var/lib/condor/execute/dir_847484/var/tmp/:/var/tmp --volume /cvmfs:/cvmfs:shared,ro --volume /etc/passwd:/etc/passwd --volume /etc/cvmfs/SITECONF:/etc/cvmfs/SITECONF:ro --volume /etc/xrootd/client.plugins.d/client-plugin-proxy.conf.bak:/etc/xrootd/client.plugins.d/client-plugin-proxy.conf --workdir /var/lib/condor/execute/dir_847484 --user 99:99 --group-add 99 kahnjms/slc7-condocker-pytorch ./condor_exe c.exe" => "/usr/bin/docker create --name HTCJob738121_0_slot1_1_PID847484 --group-add 99 --hostname jkahn-738121.0-f03-001-151-e.gridka.de --label org.htcondorproject=True --volume /var/lib/condor/execute/dir_847484:/var/lib/condor/execute/dir_847484 --volume /var/lib/condor/execute/dir_847484/tmp/:/tmp --volume /var/lib/condor/execute/dir_847484/var/tmp/:/var/tmp --volume /cvmfs:/cvmfs:shared,ro --volume /etc/passwd:/etc/passwd --volume /etc/cvmfs/SITECONF:/etc/cvmfs/SITECONF:ro --volume /etc/xrootd/client.plugins.d/client-plugin-proxy.conf.bak:/etc/xrootd/client.plugins.d/client-plugin-proxy.conf --workdir /var/lib/condor/execute/dir_847484 --user 99:99 --env CUDA_VISIBLE_DEVICES=0 --env TEMP=/var/lib/condor/execute/dir_847484 --env OMP_NUM_THREADS=1 --env BATCH_SYSTEM=HTCondor --env TMPDIR=/var/lib/condor/execute/dir_847484 --env _CONDOR_JOB_PIDS= --env _CONDOR_AssignedGPUs=CUDA0 --env GPU_DEVICE_ORDINAL=0 --env _CHIRP_DELAYED_UPDATE_PREFIX=Chirp* --env _CONDOR_JOB_IWD=/var/lib/condor/execute/dir_847484 --env TMP=/var/lib/condor/execute/dir_847484 --env _CONDOR_JOB_AD=/var/lib/condor/execute/dir_847484/.job.ad --env _CONDOR_SCRATCH_DIR=/var/lib/condor/execute/dir_847484 --env _CONDOR_SLOT=slot1_1 --env _CONDOR_CHIRP_CONFIG=/var/lib/condor/execute/dir_847484/.chirp.config --env _CONDOR_MACHINE_AD=/var/lib/condor/execute/dir_847484/.machine.ad --memory 2000m --cap-drop all --cpu-shares 100 --gpus all kahnjms/slc7-condocker-pytorch ./condor_exec.exe"
Edited by James KahnSame output on baremetal (user:group are also 99:99):
# ls -lh /var/lib/condor/execute/dir_986178/ total 1.1G -rw-r--r-- 1 nobody nobody 1.0G Jun 3 15:26 1GB.test -rwxr-xr-x 1 nobody nobody 1.3K Jun 3 15:24 condor_exec.exe -rw-r--r-- 1 nobody nobody 343 Jun 3 15:24 _condor_stderr -rw-r--r-- 1 nobody nobody 4.8K Jun 3 15:24 _condor_stdout -rwxr-xr-x 1 nobody nobody 0 Jun 3 15:24 docker_stderror -rw-r--r-- 1 nobody nobody 0 Jun 3 15:24 hello_world drwx------ 2 nobody nobody 6 Jun 3 15:24 tmp drwx------ 3 nobody nobody 17 Jun 3 15:24 var
@mschnepf are we actually performing namespace remapping? I don't see the
nobody
user in/etc/subuid
, nor is remapping specified in/etc/docker/daemon.json
.Also, do we actually need the
--group-add 99
flag since we already specify--user 99:99
? As I understand that makes the group-add useless.Edited by James KahnThe
ls
from bare metal look fine. Who is the owner of the job dirdir_...
? Docker do not use namespace remapping in our setup. However, without specifying the user or group inside the container you are root user / in the root group. Yes adding of the groupid to the user (--user 99:99
) should replace the--group-add 99
Edited by Matthias Schnepf
Hey @mschnepf @rcaspart , do either of you recall the final status of this? There are new merge conflicts I can resolve and ideally merge before I bounce. As far as jobs running goes the copy into
/tmp
is outlined in the instructions with clear warnings so at least for now it's sufficient for users to run jobs (assuming they read the instructions).
Hi, from my side I think this is good to go. Although just as a brief word of warning. At the moment the caching is disabled on the Tier-3 due to issues with accessing files at GridKa via the proxy (a known bug in dCache which is addresses in newer version). We expect the dCache to be updated at the next GridKa downtime (6th October) and the caching to be re-enabled shortly afterwards. In principle though, this does not have a huge impact in terms of usability for the user other than the transferring of input files being slower and using the network connection to ETP every time (and hence consider it mainly as a side note to keep in mind at this point).
Hi, @mhorzela and I did some changes on the docker_wrapper.py in the master branch. This would cause this merge conflicts. However, your changes look good.
added 10 commits
-
016a3449...5075a7aa - 9 commits from branch
master
- 4404f1ae - Merge branch 'master' into 'feature_jk_docker_gpus'
-
016a3449...5075a7aa - 9 commits from branch
mentioned in commit 67ea784c