Update docker_wrapper.py

Looks good to me.

Rene pointed out that the jobs produce this:

06/02/20 16:39:33 (pid:818809) (D_ALWAYS) Create_Process succeeded, pid=818813
06/02/20 16:39:33 (pid:818809) (D_ALWAYS) Process exited, pid=818813, status=0
06/02/20 16:39:33 (pid:818809) (D_ALWAYS) Output file: /var/lib/condor/execute/dir_818809/_condor_stdout
06/02/20 16:39:33 (pid:818809) (D_ALWAYS) Error file: /var/lib/condor/execute/dir_818809/_condor_stderr
06/02/20 16:39:33 (pid:818809) (D_ALWAYS) Runnning: /etc/condor/scripts/docker_wrapper.py start -a HTCJob738089_0_slot1_1_PID818809
06/02/20 16:39:34 (pid:818809) (D_ALWAYS) unhandled job exit: pid=818813, status=0
06/02/20 16:39:38 (pid:818809) (D_ALWAYS) Process exited, pid=818831, status=0
06/02/20 16:39:48 (pid:818809) (D_ALWAYS) DockerProc::JobExit() container 'HTCJob738089_0_slot1_1_PID818809'
06/02/20 16:39:49 (pid:818809) (D_ALWAYS) All jobs have exited... starter exiting
06/02/20 16:39:49 (pid:818809) (D_ALWAYS) After chmod(), still can't remove "/var/lib/condor/execute/dir_818809" as directory owner, giving up!
06/02/20 16:39:49 (pid:818809) (D_ALWAYS) **** condor_starter (condor_STARTER) pid 818809 EXITING WITH STATUS 0

From what I can see the unhandled job exit: part seems relatively common but I have no idea what causes it. The line:

(D_ALWAYS) After chmod(), still can't remove "/var/lib/condor/execute/dir_818809" as directory owner, giving up!

appears to be due to the xrdcp copied file. If I manually remove it in my job that error disappears and I get:

06/02/20 16:48:33 (pid:820940) (D_ALWAYS) Create_Process succeeded, pid=820944
06/02/20 16:48:34 (pid:820940) (D_ALWAYS) Process exited, pid=820944, status=0
06/02/20 16:48:34 (pid:820940) (D_ALWAYS) Output file: /var/lib/condor/execute/dir_820940/_condor_stdout
06/02/20 16:48:34 (pid:820940) (D_ALWAYS) Error file: /var/lib/condor/execute/dir_820940/_condor_stderr
06/02/20 16:48:34 (pid:820940) (D_ALWAYS) Runnning: /etc/condor/scripts/docker_wrapper.py start -a HTCJob738090_0_slot1_1_PID820940
06/02/20 16:48:34 (pid:820940) (D_ALWAYS) unhandled job exit: pid=820944, status=0
06/02/20 16:48:38 (pid:820940) (D_ALWAYS) Process exited, pid=820964, status=0
06/02/20 16:48:38 (pid:820940) (D_ALWAYS) DockerProc::JobExit() container 'HTCJob738090_0_slot1_1_PID820940'
06/02/20 16:48:39 (pid:820940) (D_ALWAYS) All jobs have exited... starter exiting
06/02/20 16:48:39 (pid:820940) (D_ALWAYS) **** condor_starter (condor_STARTER) pid 820940 EXITING WITH STATUS 0

What's strange is the file permissions look fine:

# ls after xrdcp:
total 1.1G
-rw-r--r-- 1 nobody nobody 1.0G Jun  2 14:48 1GB.test
-rw-r--r-- 1 nobody nobody  335 Jun  2 14:48 _condor_stderr
-rw-r--r-- 1 nobody nobody  793 Jun  2 14:48 _condor_stdout
-rwxr-xr-x 1 nobody nobody  854 Jun  2 14:48 condor_exec.exe
-rwxr-xr-x 1 nobody nobody    0 Jun  2 14:48 docker_stderror
drwx------ 2 nobody nobody    6 Jun  2 14:48 tmp
drwx------ 3 nobody nobody   17 Jun  2 14:48 var

The directory is:

# ls -lhd:
drwx------ 4 nobody nobody 211 Jun  2 14:48 .

So I'm a little puzzled tbh, @mschnepf got any clues as to what causes this?

For completeness the corresponding submission and bash script are:

########################
# Submit description file for test program
# Follow instructions from https://wiki.ekp.kit.edu/bin/view/EkpMain/EKPCondorCluster
########################
Executable    = test.sh
#Universe      = vanilla
Universe      = docker
Output        = out.test
Error         = err.test
Log           = log.test 
requirements	= Machine == "f03-001-151-e.gridka.de"
requirements	= CloudSite	== "topas"
#request_GPUs	= 2

# For the ETP queue specifically
+RemoteJob		= True
### 1 hour walltime
+RequestWalltime = 3600
# single CPU
RequestCPUs = 1
### 4GB RAM
RequestMemory = 2000

## select accounting group 
### belle, ams, 
### cms.top, cms.higgs, cms.production, cms.jet
accounting_group = belle

# Choose a GPU-appropriate image to run inside
docker_image = kahnjms/slc7-condocker-pytorch

Queue

#!/bin/sh
echo "Checking misc environment"
id
pwd
hostname -f
echo "ls -lhd:"
ls -lhd

# echo "=================================="
#ls /usr/bin
echo "=================================="
# $(which nvidia-smi)
# nvidia-smi

echo CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES

echo "=================================="
echo "Check Pytorch"
#ls /opt/conda
python3 -c "import torch; print(f'Torch available: {torch.cuda.is_available()}'); print(f'Device count: {torch.cuda.device_count()}')"
echo "=================================="

echo "Checking xrdcp"
which xrdcp
echo "ls:"
ls -lh
echo "========="
xrdcp root://ceph-node-a.etp.kit.edu:1094//jkahn/1GB.test .
echo "ls after xrdcp:"
ls -lh

echo "Remove file"
rm -f 1GB.test
echo "ls after rm:"
ls -lh
echo "ls -lhd"
ls -lhd

echo "=================================="
echo "Confirm inside container"
cat /proc/1/cgroup

Looked into it further. It seems to be due to the 1GB test file being copied back to the submission host/directory. So that I guess is locking up the directory.

I'll play with copying into tmp or another directory and try it out, and I'll have to make this clear in my instructions to users.

I think you should include something like transfer_output_files = *your output file* in the jdl. Otherwise HTCondor will try to transfer any newly created file in the working directory (excluding subdirectories) back to the submission host.

Ok behaviour's still a little strange. Doing an xrdcp into the job's workdir tmp/ directory removes the error. Specifying the transfer_output_files doesn't fix it (if the copied file is still directly in the job's workdir).

Either way I'll put together the example and make all the caveats clear until I know how to resolve this.

It could be a problem with the user / group namespaces. @jkahn did you do the lscommand inside the container or baremetal?

I looked also at the wrapper script from the guys at Nebraska: https://gist.github.com/jthiltges/cf45227c1f8e687d308481d11731562e They added the --group-add argument to the wrapper.

I used ls inside the container. I'll try it out on baremetal as well to see if there's anything strange.

Also I didn't clarify in my last comment, I copied a 1GB file via xrdcp and within the job touch another file, then copy back only the touched file (specified via transfer_output_files). Having the 1GB file in the workdir is what causes the error.

Regarding --group-add, our wrapper calls that as well. The full command with the docker_wrapper translation is (=> shows where the translation happens):

Jun  2 19:20:32 f03-001-151-e docker_wrapper.py:
"/etc/condor/scripts/docker_wrapper.py create --cpu-shares=100 --memory=2000m --cap-drop=all --hostname jkahn-738121.0-f03-001-151-e.gridka.de --name HTCJob738121_0_slot1_1_PID847484 --label=org.htcondorproject=True -e CUDA_VISIBLE_DEVICES=0 -e TEMP=/var/lib/condor/execute/dir_847484 -e OMP_NUM_THREADS=1 -e BATCH_SYSTEM=HTCondor -e TMPDIR=/var/lib/condor/execute/dir_847484 -e _CONDOR_JOB_PIDS= -e _CONDOR_AssignedGPUs=CUDA0 -e GPU_DEVICE_ORDINAL=0 -e _CHIRP_DELAYED_UPDATE_PREFIX=Chirp* -e _CONDOR_JOB_IWD=/var/lib/condor/execute/dir_847484 -e TMP=/var/lib/condor/execute/dir_847484 -e _CONDOR_JOB_AD=/var/lib/condor/execute/dir_847484/.job.ad -e _CONDOR_SCRATCH_DIR=/var/lib/condor/execute/dir_847484 -e _CONDOR_SLOT=slot1_1 -e _CONDOR_CHIRP_CONFIG=/var/lib/condor/execute/dir_847484/.chirp.config -e _CONDOR_MACHINE_AD=/var/lib/condor/execute/dir_847484/.machine.ad --volume /var/lib/condor/execute/dir_847484:/var/lib/condor/execute/dir_847484 --volume /var/lib/condor/execute/dir_847484/tmp/:/tmp --volume /var/lib/condor/execute/dir_847484/var/tmp/:/var/tmp --volume /cvmfs:/cvmfs:shared,ro --volume /etc/passwd:/etc/passwd --volume /etc/cvmfs/SITECONF:/etc/cvmfs/SITECONF:ro --volume /etc/xrootd/client.plugins.d/client-plugin-proxy.conf.bak:/etc/xrootd/client.plugins.d/client-plugin-proxy.conf --workdir /var/lib/condor/execute/dir_847484 --user 99:99 --group-add 99 kahnjms/slc7-condocker-pytorch ./condor_exe
c.exe"
=>
 "/usr/bin/docker create --name HTCJob738121_0_slot1_1_PID847484 --group-add 99 --hostname jkahn-738121.0-f03-001-151-e.gridka.de --label org.htcondorproject=True --volume /var/lib/condor/execute/dir_847484:/var/lib/condor/execute/dir_847484 --volume /var/lib/condor/execute/dir_847484/tmp/:/tmp --volume /var/lib/condor/execute/dir_847484/var/tmp/:/var/tmp --volume /cvmfs:/cvmfs:shared,ro --volume /etc/passwd:/etc/passwd --volume /etc/cvmfs/SITECONF:/etc/cvmfs/SITECONF:ro --volume /etc/xrootd/client.plugins.d/client-plugin-proxy.conf.bak:/etc/xrootd/client.plugins.d/client-plugin-proxy.conf --workdir /var/lib/condor/execute/dir_847484 --user 99:99 --env CUDA_VISIBLE_DEVICES=0 --env TEMP=/var/lib/condor/execute/dir_847484 --env OMP_NUM_THREADS=1 --env BATCH_SYSTEM=HTCondor --env TMPDIR=/var/lib/condor/execute/dir_847484 --env _CONDOR_JOB_PIDS= --env _CONDOR_AssignedGPUs=CUDA0 --env GPU_DEVICE_ORDINAL=0 --env _CHIRP_DELAYED_UPDATE_PREFIX=Chirp* --env _CONDOR_JOB_IWD=/var/lib/condor/execute/dir_847484 --env TMP=/var/lib/condor/execute/dir_847484 --env _CONDOR_JOB_AD=/var/lib/condor/execute/dir_847484/.job.ad --env _CONDOR_SCRATCH_DIR=/var/lib/condor/execute/dir_847484 --env _CONDOR_SLOT=slot1_1 --env _CONDOR_CHIRP_CONFIG=/var/lib/condor/execute/dir_847484/.chirp.config --env _CONDOR_MACHINE_AD=/var/lib/condor/execute/dir_847484/.machine.ad --memory 2000m --cap-drop all --cpu-shares 100 --gpus all kahnjms/slc7-condocker-pytorch ./condor_exec.exe"

Same output on baremetal (user:group are also 99:99):

# ls -lh /var/lib/condor/execute/dir_986178/
total 1.1G
-rw-r--r-- 1 nobody nobody 1.0G Jun  3 15:26 1GB.test
-rwxr-xr-x 1 nobody nobody 1.3K Jun  3 15:24 condor_exec.exe
-rw-r--r-- 1 nobody nobody  343 Jun  3 15:24 _condor_stderr
-rw-r--r-- 1 nobody nobody 4.8K Jun  3 15:24 _condor_stdout
-rwxr-xr-x 1 nobody nobody    0 Jun  3 15:24 docker_stderror
-rw-r--r-- 1 nobody nobody    0 Jun  3 15:24 hello_world
drwx------ 2 nobody nobody    6 Jun  3 15:24 tmp
drwx------ 3 nobody nobody   17 Jun  3 15:24 var

added 1 commit

18ad11b2 - Update docker_wrapper.py

Compare with previous version

added 1 commit

016a3449 - Update docker_wrapper.py, syntax whoopsie

Compare with previous version

@mschnepf are we actually performing namespace remapping? I don't see the nobody user in /etc/subuid, nor is remapping specified in /etc/docker/daemon.json.

Also, do we actually need the --group-add 99 flag since we already specify --user 99:99? As I understand that makes the group-add useless.

The ls from bare metal look fine. Who is the owner of the job dir dir_...? Docker do not use namespace remapping in our setup. However, without specifying the user or group inside the container you are root user / in the root group. Yes adding of the groupid to the user (--user 99:99) should replace the --group-add 99

Ok nice, the --group-add flag is added somewhere earlier though, not in the wrapper. I imagine it's harmless though.

The --group-add is added by condor directly. From what I can tell its mainly intended to have not only the primary group of the user in the container but also secondary groups.

Hey @mschnepf @rcaspart , do either of you recall the final status of this? There are new merge conflicts I can resolve and ideally merge before I bounce. As far as jobs running goes the copy into /tmp is outlined in the instructions with clear warnings so at least for now it's sufficient for users to run jobs (assuming they read the instructions).

Tbh I think I answered my own question, so with either of your blessing I'll resolve and merge.

Hi, from my side I think this is good to go. Although just as a brief word of warning. At the moment the caching is disabled on the Tier-3 due to issues with accessing files at GridKa via the proxy (a known bug in dCache which is addresses in newer version). We expect the dCache to be updated at the next GridKa downtime (6th October) and the caching to be re-enabled shortly afterwards. In principle though, this does not have a huge impact in terms of usability for the user other than the transferring of input files being slower and using the network connection to ETP every time (and hence consider it mainly as a side note to keep in mind at this point).

Hi, @mhorzela and I did some changes on the docker_wrapper.py in the master branch. This would cause this merge conflicts. However, your changes look good.

added 10 commits

016a3449...5075a7aa - 9 commits from branch master
4404f1ae - Merge branch 'master' into 'feature_jk_docker_gpus'

Compare with previous version

Conflicts resolved, I'll merge tomorrow so we're around to make sure nothing breaks.

approved this merge request

merged

mentioned in commit 67ea784c

Update docker_wrapper.py

Merged by James Kahn 4 years ago (Sep 24, 2020 9:57am UTC) 4 years ago

Activity

Update docker_wrapper.py

Merge request reports

Merged by James Kahn 4 years ago (Sep 24, 2020 9:57am UTC) 4 years ago

Activity