Where is is supposed to be? on the host or inside the container?
Currently on the host, the command is not found, and in the container without the nvidia.runtime
config option set, it is also command not found… And enabling nvidia.runtime
prevents the container from starting…
[2 minutes later]
Ok this is what I found:
me@host:~$ locate nvidia-container
/snap/lxd/6864/bin/nvidia-container-cli
/snap/lxd/6882/bin/nvidia-container-cli
/snap/lxd/6960/bin/nvidia-container-cli
me@host:~$ /snap/lxd/6960/bin/nvidia-container-cli info
basename: missing operand
Try 'basename --help' for more information.
/snap/lxd/6960/bin/nvidia-container-cli: 8: exec: /var/lib/snapd/hostfs/usr/bin/nvidia-container-cli: not found
Looking at /snap/lxd/6960/bin/nvidia-container-cli
reveals that it is a wrapper:
#!/bin/sh
# Set environment to run nvidia-container-cli from the host system
export SNAP_CURRENT="$(realpath "${SNAP}/..")/current"
export ARCH="$(basename $(readlink -f ${SNAP_CURRENT}/lib/*-linux-gnu/))"
export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:-}:/var/lib/snapd/hostfs/usr/lib/${ARCH}"
exec /var/lib/snapd/hostfs/usr/bin/nvidia-container-cli -r /var/lib/snapd/hostfs/ "$@"
So I tried to debug it a bit:
First of all, I don’t know where the SNAP
env var should be coming from. This environment variable is not set on the host, nor in the container.
Without it, the wrapper script fails:
The first variable (SNAP_CURRENT
) will not be set correctly:
me@host:~$ export SNAP_CURRENT="$(realpath "${SNAP}/..")/current"
me@host:~$ echo $SNAP_CURRENT
//current
Which cascade into:
me@host:~$ export ARCH="$(basename $(readlink -f ${SNAP_CURRENT}/lib/*-linux-gnu/))"
zsh: no matches found: //current/lib/*-linux-gnu/
basename: missing operand
Try 'basename --help' for more information.
Sooo, now I took the hypothesis that the SNAP
env var was meant to be set similar like this:
me@host:~$ export SNAP="/snap/lxd/6960"
Then:
me@host:~$ export SNAP_CURRENT="$(realpath "${SNAP}/..")/current"
me@host:~$ echo $SNAP_CURRENT
/snap/lxd/current
me@host:~$ export ARCH="$(basename $(readlink -f ${SNAP_CURRENT}/lib/*-linux-gnu/))"
me@host:~$ echo $ARCH
x86_64-linux-gnu
me@host:~$ export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:-}:/var/lib/snapd/hostfs/usr/lib/${ARCH}"
me@host:~$ echo $LD_LIBRARY_PATH
/home/redacted/bin/gurobi751/linux64/lib:/var/lib/snapd/hostfs/usr/lib/x86_64-linux-gnu
So it looks like it should work now:
me@host:~$ /snap/lxd/6960/bin/nvidia-container-cli info
/snap/lxd/6960/bin/nvidia-container-cli: 8: exec: /var/lib/snapd/hostfs/usr/bin/nvidia-container-cli: not found
Still not
(but the basename: missing operand
error is gone)
me@host:~$ ls -l /var/lib/snapd/hostfs
total 0
Rhaaa, I should be inside the snap. Ok then:
sudo nsenter -t $(pgrep daemon.start) -m bash
And now let’s try it again:
me@lxd-snap:~$ ls -l /var/lib/snapd/hostfs
total 116
drwxr-xr-x 2 root root 4096 Apr 27 09:41 bin
drwxr-xr-x 3 root root 4096 May 9 22:23 boot
...
The SNAP
env var is still not set by default, so I do it manually:
me@lxd-snap:~$ export SNAP="/snap/lxd/6960"
Moment of truth:
me@lxd-snap:~$ /snap/lxd/6960/bin/nvidia-container-cli info
/snap/lxd/6960/bin/nvidia-container-cli: 8: exec: /var/lib/snapd/hostfs/usr/bin/nvidia-container-cli: not found
Still not!?
me@lxd-snap:~$ ls /var/lib/snapd/hostfs/usr/bin/nvidia-*
/var/lib/snapd/hostfs/usr/bin/nvidia-bug-report.sh /var/lib/snapd/hostfs/usr/bin/nvidia-persistenced
/var/lib/snapd/hostfs/usr/bin/nvidia-cuda-mps-control /var/lib/snapd/hostfs/usr/bin/nvidia-settings
/var/lib/snapd/hostfs/usr/bin/nvidia-cuda-mps-server /var/lib/snapd/hostfs/usr/bin/nvidia-smi
/var/lib/snapd/hostfs/usr/bin/nvidia-debugdump /var/lib/snapd/hostfs/usr/bin/nvidia-xconfig
/var/lib/snapd/hostfs/usr/bin/nvidia-detector
So, now I reckon that I’m missing nvidia-container-cli
on the host under /usr/bin
…
Why is it not there?