Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump based CUDA image to ubuntu24.04 #1166

Merged
merged 23 commits into from
Dec 4, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
d602ff3
Test docker hub ubuntu24.04
DwarKapex Nov 21, 2024
7a93390
Adobt build for ubuntu-24.04
DwarKapex Nov 22, 2024
3f4efa5
Fix build for pax, t5x, gemma
DwarKapex Nov 22, 2024
b2eab65
Use master branch of TF-text
DwarKapex Nov 22, 2024
71ad68b
Fix gemma TF-text urls
DwarKapex Nov 22, 2024
0b452c4
Fix T5x build
DwarKapex Nov 25, 2024
62e7ed7
Address comments
DwarKapex Nov 26, 2024
beb4f82
Fix gemma build
DwarKapex Nov 27, 2024
3c2ec97
Clone airio
DwarKapex Nov 27, 2024
d279373
Merge remote-tracking branch 'origin/main' into vkozlov/move-to-ubunt…
DwarKapex Nov 27, 2024
173ddc5
Update maxtext docker
DwarKapex Nov 27, 2024
92996e3
Uninstall several packages and add PIP_BREAK_SYSTEM_PACKAGES=1 env var
DwarKapex Dec 2, 2024
8993deb
Uninstall several packages and add PIP_BREAK_SYSTEM_PACKAGES=1 env var
DwarKapex Dec 2, 2024
8c10287
Edit remove packages list
DwarKapex Dec 2, 2024
c75c825
Edit remove packages list
DwarKapex Dec 3, 2024
8468c9f
Edit remove packages list
DwarKapex Dec 3, 2024
008b3fc
[skip ci] Resurect amd64/arm64 dockerfiles
DwarKapex Dec 3, 2024
d633578
[skip ci] Resurect amd64/arm64 dockerfiles: fix whitespace error
DwarKapex Dec 3, 2024
81b50cc
[skip ci] Resurect amd64/arm64 dockerfiles: fix whitespace error
DwarKapex Dec 3, 2024
14c52be
Merge branch 'main' into vkozlov/move-to-ubuntu24.04
DwarKapex Dec 3, 2024
96c16a9
Add comment for pip install pip-23.3.1
DwarKapex Dec 3, 2024
8461c7a
Merge branch 'vkozlov/move-to-ubuntu24.04' of github.com:NVIDIA/JAX-T…
DwarKapex Dec 3, 2024
2c1ee0d
remove arch-specific Dockerfiles and add pointer to utopian versions
yhtang Dec 4, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 26 additions & 5 deletions .github/container/Dockerfile.base
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# syntax=docker/dockerfile:1-labs
ARG BASE_IMAGE=nvidia/cuda:12.6.2-devel-ubuntu22.04
ARG BASE_IMAGE=nvidia/cuda:12.6.2-devel-ubuntu24.04
ARG GIT_USER_NAME="JAX Toolbox"
ARG GIT_USER_EMAIL=jax@nvidia.com
ARG CLANG_VERSION=18
Expand Down Expand Up @@ -60,7 +60,8 @@ apt_packages=(
wget
jq
# llvm.sh
lsb-release software-properties-common
lsb-release
software-properties-common
# GCP autoconfig
pciutils hwloc bind9-host
)
Expand All @@ -74,8 +75,6 @@ apt-get install -y ${apt_packages[@]}

# Install LLVM/Clang
bash -c "$(wget -O - https://apt.llvm.org/llvm.sh)" -- ${CLANG_VERSION}
apt-get remove -y software-properties-common lsb-release
apt-get autoremove -y # removes python3-blinker which conflicts with pip-compile in JAX

# Make sure that clang and clang++ point to the new version. This list is based
# on the symlinks installed by the `clang` (as opposed to `clang-14`) and `lld`
Expand Down Expand Up @@ -106,6 +105,21 @@ EOL

apt-get clean
rm -rf /var/lib/apt/lists/*

# There are several python packages (in the list below) that are installed with OS
# package manager (the run of `apt-get install` above) and can not be uninstall
# using pip (in pip-finalize.sh script) during JAX installation. Remove then in
# advance to avoid JAX installation issue.
remove_packages=(
python3-gi
software-properties-common
lsb-release
python3-yaml
python3-pygments
)

apt-get remove -y ${remove_packages[@]}
apt-get autoremove -y # removes python3-blinker which conflicts with pip-compile in JAX
EOF

RUN <<"EOF" bash -ex
Expand All @@ -129,7 +143,14 @@ git apply </opt/pip/pip-vcs-equivalency.patch
git add -u
git commit -m 'Adds JAX_TOOLBOX_VCS_EQUIVALENCY as a trigger to treat all github VCS installs for a package as equivalent. The spec of the last encountered version will be used'
EOF
RUN pip install --upgrade --no-cache-dir -e /opt/pip pip-tools && rm -rf ~/.cache/*

# install all python packages system-wide.
ENV PIP_BREAK_SYSTEM_PACKAGES=1
# An extra flag `--ignore-installed` is added below, because of the following reason:
# after upgrading to ver 23.3.1 (from /opt/pip) `pip` tries to uninstall itself (default pip-24.0)
# and fails due to pip-24.0 has been installed with system tool `apt` but not `python`. So we keep
# both pip-24.0 and pip-23.3.1 in the system, but use 23.3.1 with equivalency patch (see above).
RUN pip install --upgrade --ignore-installed --no-cache-dir -e /opt/pip pip-tools && rm -rf ~/.cache/*
DwarKapex marked this conversation as resolved.
Show resolved Hide resolved

###############################################################################
## Install TCPx
Expand Down
1 change: 0 additions & 1 deletion .github/container/Dockerfile.jax
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,6 @@ RUN --mount=type=ssh \
--mount=type=secret,id=SSH_KNOWN_HOSTS,target=/root/.ssh/known_hosts \
<<"EOF" bash -ex
git-clone.sh ${URLREF_JAX} ${SRC_PATH_JAX}
sed 's/^numpy.*/numpy<2.0.0/' ${SRC_PATH_JAX}/build/requirements.in
git-clone.sh ${URLREF_XLA} ${SRC_PATH_XLA}
EOF

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

ARG BASE_IMAGE=ghcr.io/nvidia/jax-mealkit:jax
ARG URLREF_MAXTEXT=https://github.com/google/maxtext.git#main
ARG URLREF_TFTEXT=https://github.com/tensorflow/text.git#v2.13.0
ARG URLREF_TFTEXT=https://github.com/tensorflow/text.git#master
DwarKapex marked this conversation as resolved.
Show resolved Hide resolved
ARG SRC_PATH_MAXTEXT=/opt/maxtext
ARG SRC_PATH_TFTEXT=/opt/tensorflow-text

Expand All @@ -17,18 +17,20 @@ FROM ${BASE_IMAGE} as wheel-builder
# build tensorflow-text from source
#------------------------------------------------------------------------------

# Remove TFTEXT build from source when it has py-3.12 wheels for x86/arm64
FROM wheel-builder as tftext-builder
ARG URLREF_TFTEXT
ARG SRC_PATH_TFTEXT

RUN pip install tensorflow_datasets==4.9.2 auditwheel tensorflow==2.18.0
RUN git-clone.sh ${URLREF_TFTEXT} ${SRC_PATH_TFTEXT}
RUN <<"EOF" bash -exu -o pipefail
pip install tensorflow_datasets==4.9.2 auditwheel tensorflow==2.13.0
git-clone.sh ${URLREF_TFTEXT} ${SRC_PATH_TFTEXT}
cd ${SRC_PATH_TFTEXT}

# The tftext build script queries GitHub, but these requests are sometimes
# throttled by GH, resulting in a corrupted uri for tensorflow in WORKSPACE.
# A workaround (needs to be updated when the tensorflow version changes):
sed -i "s/# Update TF dependency to installed tensorflow/commit_sha=1cb1a030a62b169d90d34c747ab9b09f332bf905/" oss_scripts/prepare_tf_dep.sh
sed -i "s/# Update TF dependency to installed tensorflow./commit_slug=6550e4bd80223cdb8be6c3afd1f81e86a4d433c3/" oss_scripts/prepare_tf_dep.sh

# Newer versions of LLVM make lld's --undefined-version check of lld is strict
# by default (https://reviews.llvm.org/D135402), but the tftext build seems to
Expand All @@ -38,14 +40,13 @@ echo "write_to_bazelrc \"build --linkopt='-Wl,--undefined-version'\"" >> oss_scr
./oss_scripts/run_build.sh
EOF


###############################################################################
## Download source and add auxiliary scripts
###############################################################################

FROM ${BASE_IMAGE} as mealkit
ARG URLREF_MAXTEXT
ARG URLREF_TFTEXT=https://github.com/tensorflow/text.git#v2.13.0
ARG URLREF_TFTEXT=https://github.com/tensorflow/text.git#master
ARG SRC_PATH_MAXTEXT
ARG SRC_PATH_TFTEXT=/opt/tensorflow-text

Expand All @@ -56,6 +57,17 @@ RUN echo "tensorflow-text @ file://$(ls /opt/tensorflow_text*.whl)" >> /opt/pip-
RUN <<"EOF" bash -ex
git-clone.sh ${URLREF_MAXTEXT} ${SRC_PATH_MAXTEXT}
echo "-r ${SRC_PATH_MAXTEXT}/requirements.txt" >> /opt/pip-tools.d/requirements-maxtext.in

# specify some restrictions to speed up the build and
# avoid pip to download and check all available versions of packages
for pattern in \
"s|absl-py|absl-py>=2.1.0|g" \
"s|protobuf==3.20.3|protobuf>=3.19.0|g" \
"s|tensorflow-datasets|tensorflow-datasets>=4.8.0|g" \
; do
sed -i "${pattern}" ${SRC_PATH_MAXTEXT}/requirements.txt;
done
echo "tensorflow-metadata>=1.15.0" >> ${SRC_PATH_MAXTEXT}/requirements.txt
EOF

###############################################################################
Expand All @@ -73,3 +85,6 @@ FROM mealkit as final
RUN pip-finalize.sh

WORKDIR ${SRC_PATH_MAXTEXT}

# When tftext and lingvo wheels are published on pypi.org, revert this
# Dockerfile to 5c4b687b918e6569bca43758c346ad8e67460154
34 changes: 0 additions & 34 deletions .github/container/Dockerfile.maxtext.amd64

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
ARG BASE_IMAGE=ghcr.io/nvidia/jax-mealkit:jax
ARG URLREF_PAXML=https://github.com/google/paxml.git#main
ARG URLREF_PRAXIS=https://github.com/google/praxis.git#main
ARG URLREF_TFTEXT=https://github.com/tensorflow/text.git#v2.13.0
ARG URLREF_TFTEXT=https://github.com/tensorflow/text.git#master
yhtang marked this conversation as resolved.
Show resolved Hide resolved
ARG URLREF_LINGVO=https://github.com/tensorflow/lingvo.git#master
ARG SRC_PATH_PAXML=/opt/paxml
ARG SRC_PATH_PRAXIS=/opt/praxis
Expand All @@ -21,18 +21,19 @@ FROM ${BASE_IMAGE} as wheel-builder
# build tensorflow-text from source
#------------------------------------------------------------------------------

# Remove TFTEXT build from source when it has py-3.12 wheels for x86/arm64
FROM wheel-builder as tftext-builder
ARG URLREF_TFTEXT
ARG SRC_PATH_TFTEXT
RUN <<"EOF" bash -exu -o pipefail
pip install tensorflow_datasets==4.9.2 auditwheel tensorflow==2.13.0
pip install tensorflow_datasets==4.9.2 auditwheel tensorflow==2.18.0
git-clone.sh ${URLREF_TFTEXT} ${SRC_PATH_TFTEXT}
cd ${SRC_PATH_TFTEXT}

# The tftext build script queries GitHub, but these requests are sometimes
# throttled by GH, resulting in a corrupted uri for tensorflow in WORKSPACE.
# A workaround (needs to be updated when the tensorflow version changes):
sed -i "s/# Update TF dependency to installed tensorflow/commit_sha=1cb1a030a62b169d90d34c747ab9b09f332bf905/" oss_scripts/prepare_tf_dep.sh
sed -i "s/# Update TF dependency to installed tensorflow./commit_slug=6550e4bd80223cdb8be6c3afd1f81e86a4d433c3/" oss_scripts/prepare_tf_dep.sh

# Newer versions of LLVM make lld's --undefined-version check of lld is strict
# by default (https://reviews.llvm.org/D135402), but the tftext build seems to
Expand All @@ -46,6 +47,7 @@ EOF
# build lingvo
#------------------------------------------------------------------------------

# Remove Lingvo build from source when it has py-3.12 wheels for x86/arm64
FROM wheel-builder as lingvo-builder
ARG URLREF_LINGVO
ARG SRC_PATH_TFTEXT
Expand All @@ -55,15 +57,16 @@ ARG SRC_PATH_LINGVO
COPY --from=tftext-builder /opt/manifest.d/git-clone.yaml /opt/manifest.d/git-clone.yaml
COPY --from=tftext-builder ${SRC_PATH_TFTEXT}/tensorflow_text*.whl /opt/

RUN <<"EOF" bash -exu -o pipefail
git-clone.sh ${URLREF_LINGVO} ${SRC_PATH_LINGVO}
EOF

ENV USE_BAZEL_VERSION=7.1.2

# build lingvo
RUN <<"EOF" bash -exu -o pipefail
git-clone.sh ${URLREF_LINGVO} ${SRC_PATH_LINGVO}
pushd ${SRC_PATH_LINGVO}

CPU_ARCH="$(dpkg --print-architecture)"
if [[ "${CPU_ARCH}" == "arm64" ]]; then

# Use aarch distribution of protobufs
patch -p1 <<"EOFINNER"
diff --git a/lingvo/repo.bzl b/lingvo/repo.bzl
Expand All @@ -84,13 +87,34 @@ index ce65822d2..d9c0277aa 100644
def icu():
EOFINNER

pip install tensorflow_datasets==4.9.2 auditwheel tensorflow==2.13.0 /opt/tensorflow_text*.whl
sed -i 's/tensorflow=/#tensorflow=/' docker/dev.requirements.txt
sed -i 's/tensorflow-text=/#tensorflow-text=/' docker/dev.requirements.txt
sed -i 's/dataclasses=/#dataclasses=/' docker/dev.requirements.txt
fi

pip install tensorflow_datasets==4.9.2 auditwheel tensorflow==2.18.0 /opt/tensorflow_text*.whl
for pattern in \
"s|tensorflow=|#tensorflow=|g" \
"s|tensorflow-text=|#tensorflow-text=|g" \
"s|dataclasses=|#dataclasses=|g" \
"s|==.*||g" \
; do
sed -i "${pattern}" ${SRC_PATH_LINGVO}/docker/dev.requirements.txt
done
# Lingvo support only python < 3.12, so we hack it and update dependencies
# to be able to build for py-3.12
for pattern in \
"s|tensorflow-text~=2.13.0|tensorflow-text~=2.18.0|g" \
"s|tensorflow~=2.13.0|tensorflow~=2.18.0|g" \
"s|python_requires='>=3.8,<3.11'|python_requires='>=3.8,<3.13'|" \
; do
sed -i "${pattern}" ${SRC_PATH_LINGVO}/pip_package/setup.py;
done
pip install -r docker/dev.requirements.txt

# Some tests are flaky right now, so we skip running the tests.
BUILD_ARCH="x86_64"
if [[ "$CPU_ARCH" == "arm64" ]]; then
BUILD_ARCH="aarch64";
fi
sed -i 's/manylinux2014_x86_64/manylinux_2_38_'"${BUILD_ARCH}"'/' pip_package/build.sh
SKIP_TESTS=1 PYTHON_MINOR_VERSION=$(python --version | cut -d ' ' -f 2 | cut -d '.' -f 2) pip_package/build.sh
EOF

Expand All @@ -108,15 +132,14 @@ ARG SRC_PATH_TFTEXT

# Preserve version information of tensorflow-text and lingvo
COPY --from=lingvo-builder /opt/manifest.d/git-clone.yaml /opt/manifest.d/git-clone.yaml
COPY --from=lingvo-builder /tmp/lingvo/dist/lingvo*linux_aarch64.whl /opt/
COPY --from=lingvo-builder /tmp/lingvo/dist/lingvo*-linux*.whl /opt/
RUN echo "lingvo @ file://$(ls /opt/lingvo*.whl)" >> /opt/pip-tools.d/requirements-paxml.in

COPY --from=tftext-builder ${SRC_PATH_TFTEXT}/tensorflow_text*.whl /opt/
RUN echo "tensorflow-text @ file://$(ls /opt/tensorflow_text*.whl)" >> /opt/pip-tools.d/requirements-paxml.in

# paxml + praxis
RUN <<"EOF" bash -ex
echo "tensorflow==2.13.0" >> /opt/pip-tools.d/requirements-paxml.in
echo "tensorflow_datasets==4.9.2" >> /opt/pip-tools.d/requirements-paxml.in
echo "auditwheel" >> /opt/pip-tools.d/requirements-paxml.in

Expand All @@ -131,11 +154,14 @@ for src in ${SRC_PATH_PAXML} ${SRC_PATH_PRAXIS}; do
for pattern in \
"s| @ git+https://github.com/google/flax||g" \
"s| @ git+https://github.com/google/jax||g" \
"s| @ git+https://github.com/google/fiddle||g" \
"s|^tensorflow|#tensorflow|" \
"s|^lingvo|#lingvo|" \
"s|^scikit-learn|#scikit-learn|" \
"s|^protobuf|#protobuf|" \
"s|^numpy|#numpy|" \
"s|^orbax-checkpoint|#orbax-checkpoint|" \
"s| @ git+https://github.com/google/CommonLoopUtils||g" \
; do
sed -i "${pattern}" */pip_package/requirements.txt requirements.in
done
Expand All @@ -148,6 +174,7 @@ for src in ${SRC_PATH_PAXML} ${SRC_PATH_PRAXIS}; do
fi
popd
done
sed -i 's/pysimdjson==[0-9.]*/pysimdjson/' ${SRC_PATH_PAXML}/setup.py
EOF

ADD test-pax.sh /usr/local/bin
Expand All @@ -159,3 +186,6 @@ ADD test-pax.sh /usr/local/bin
FROM mealkit as final

RUN pip-finalize.sh

# When tftext and lingvo wheels are published on pypi.org, revert this
# Dockerfile to 5c4b687b918e6569bca43758c346ad8e67460154
53 changes: 0 additions & 53 deletions .github/container/Dockerfile.pax.amd64

This file was deleted.

Loading
Loading