DKube Developer’s Guide

This section provides instructions on how to develop code that will integrate with the DKube system.

File Paths

For IDE & Run jobs, DKube provides a method to access the files in code, data, and model repositories without needing to know the exact folder within the DKube storage hierarchy. The repos are available in the following paths:

Repo Type

Path

Code

Fixed path: /mnt/dkube/workspace

Dataset

Mount path as described at Mount Path

Model

Mount path as described at Mount Path

The Dataset & Model repos are available at the following paths in addition to the user-configured mount paths:

Repo Type

Path

Dataset

/mnt/dkube/datasets/<user name>/<dataset name>

Model

/mnt/dkube/models/<user name>/<dataset name>

In the case of AWS ( Amazon S3) and Redshift ( Amazon Redshift), the mount paths also include the metadata files with the endpoint configuration.

Configuration File

A configuration file can be uploaded into DKube for an IDE ( Configuration File) or Run ( Configuration File). The configuration file can be accessed from the code at the following location:

/mnt/dkube/config/<config file name>

Home Directory

DKube maintains a home directory for each user, at the location:

/home/<user name>

Files for all user-owned resources are created in this area, including metadata for Runs, IDEs, & Inferences. These can be accessed by an IDE.

The following folders are created within the home directory.

Workspace

Contains folders for each Code Repo owned by the user. These can be updated from a source git repo, edited and committed back to git repo.

Dataset

Contains folders for each Dataset Repo owned by the user. Each Dataset folder contains subdirectories for each version with the dataset files for the version.

Model

Contains folders for each Model Repo owned by the user. Each Model directory contains subdirectories for each version with the model files for the version.

Notebook

Contains metadata for user IDE instances

Training

Contains metadata for user Training Run instances

Preprocessing

Contains metadata for user Preprocessing Run instances

Inference

Contains metadata for user Inference instances

Amazon S3

DKube native support for Amazon S3. In order to use Redshift within DKube, a Repo must first be created. This is described at S3. This section describes how to access the data and integrate it into your program. The mount path for the S3 Dataset repo contains the config.json & credentials files.

config.json

{
  "Bucket": "<bucket name>",
  "Prefix": "<prefix>",
  "Endpoint": "<endpoint>"
}

credentials

[default]
aws_access_key_id = xxxxxxx
aws_secret_access_key = xxxxxx

In addition, the path /etc/dkube/.aws contains the metadata and credentials for all of the S3 Datasets owned by the user.

/etc/dkube/.aws/config

[default]
bucket = <bucket name 1>
prefix = <prefix 1>
[dataset-2]
bucket = <bucket name 2>
prefix = <prefix 2>
[dataset-3]
bucket = <bucket name 3>
prefix = <prefix 3>

/etc/dkube/.aws/credentials

[default]
aws_access_key_id = xxxxxxx
aws_secret_access_key = xxxxxxxxx
[dataset-2]
aws_access_key_id = xxxxxxx
aws_secret_access_key = xxxxxxxxx
[dataset-3]
aws_access_key_id = xxxxxxx
aws_secret_access_key = xxxxxxxxx

Amazon Redshift

DKube has native support for Amazon Redshift. In order to use Redshift within DKube, a Repo must first be created. This is described at Redshift. This section describes how to access the data and integrate it into your program. Redshift-specific environment variables are listed at Redshift Variables. Redshift can be accessed with or without an API server.

Redshift Access Configuration

Redshift Access with an API Server

In order to configure the API server to fetch the metadata, a kubernetes config map is configured with the following information:

echo " apiVersion: v1 kind: ConfigMap metadata: name: redshift-access-info namespace: dkube data: token: $TOKEN endpoint: $ENDPOINT " | kubectl apply -f -

Variable

Description

TOKEN

Security token for the API server

ENDPOINT

url for the API server

DKube fetches the list of databases available and associated configuration information such as endpoints and availability region. Additionally, DKube fetches the schemas of the databases from the API server.

Redshift Access without an API Server

By default, DKube will use the following query to fetch the redshift schemas and show them as versions in DKube UI when creating a Dataset.

select * from PG_NAMESPACE;

Accessing the Redshift Data from the Program

Redshift data can be accessed from any Notebook or Run.

The metadata to access the Redshift data for the current job is provided from the mount path ( Mount Path ) specified when the Job is created.

redshift.json

{
  "rs_name": "<name>",
  "rs_endpoint": "<endpoint>",
  "rs_database": "<database-name>",
  "rs_db_schema": "<schema-name>",
  "rs_user": "<user-name>"
}

Metadata for all of the selected Redshift datasets for the User is available at /etc/dkube/redshift.json for the Job.

[
   {
     "rs_name": "<name 1>",
     "rs_endpoint": "<endpoint 1>",
     "rs_database": "database-name 1>",
     "rs_db_schema": "<schema-name 1>",
     "rs_user": "<user 1>"
   },
   {
     "rs_name": "<name 2>",
     "rs_endpoint": "<endpoint 2>",
     "rs_database": "database-name 2",
     "rs_db_schema": "<schema-name 2>",
     "rs_user": "<user 2>"
   },
]

Redshift Password

The password for the Redshift data is stored encrypted within DKube. The code segment below can be used to retrieve the information without encryption.

import os, requests, json def rs_fetch_datasets(): user = os.getenv("DKUBE_USER_LOGIN_NAME") url =
"http://dkube-controller-master.dkube:5000/dkube/v2/controller/users/%s/datums/class/dataset/datum/%s"
headers={"authorization": "Bearer "+os.getenv("DKUBE_USER_ACCESS_TOKEN")} datasets = [] for ds in json.load(open('/etc/dkube/redshift.json')): if ds.get('rs_owner', '') != user: continue resp = requests.get(url % (user, ds.get('rs_name')), headers=headers).json() ds['rs_password'] = resp['data']['datum']['redshift']['password'] datasets.append(ds) return datasets

This will return the datasets in the following format:

[ { "rs_name": "name1", "rs_endpoint": "https://xx.xxx.xxx.xx:yyyy", "rs_database": "dkube", "rs_db_schema": "pg_catalog", "rs_user": "user", "rs_owner": "owner", "rs_password": "*****" }, .... ]

Mount Path

The mount path provides a way for the project code to access the repositories. This section describes the steps needed to enable thie access.

Before accessing a dataset, featureset, or model from the code, it needs to be created within DKube, as described at Add a Dataset and Add a Model. This will enable DKube to access the entity. The following image shows a Dataset detail screen for a GitHub dataset that has been uploaded to the DKube storage. It shows the actual folder where the dataset resides.

_images/Developer_Model_Details.png

DKube allows the Project code to access the Dataset, FeatureSet, or Model without needing to know the exact folder structure through the mount path. When creating an IDE or Run, the mount path field should be filled in to correspond to the Project code.

_images/Mount_Point_Diagram_R22.png

The following guidelines ensure that the mount path operates properly:

  • Preface the mount path with /opt/dkube/. For example, the input mount path for an input Dataset can be “/opt/dkube/input”

  • The code should use the same path for input and output

Environment Variables

This section describes the environment variables that allow the program code to access DKube-specific information. These are accessed from the program code through calls such as:

EPOCHS = int(os.getenv('EPOCHS', 1))

General Variables

Name

Description

DKUBE_URL

API Server REST endpoint

DKUBE_USER_LOGIN_NAME

Login user name

DKUBE_USER_ACCESS_TOKEN

JWT token for DKube access

DKUBE_JOB_CONFIG_FILE

Configuration file specified when creating a Job ( Configuration File)

DKUBE_USER_STORE

Mount path for user-owned resources

DKUBE_DATA_BASE_PATH

Mount path for resources configured for an IDE/Run

DKUBE_NB_ARGS

Jupyterlab command line arguments containing auth token, base url and home dir, used in entrypoint for Jupyterlab

KF_PIPELINES_ENDPOINT

REST API endpoint for pipelines to authenticate pipeline requests. If not set, pipelines are created without authentication

DKUBE_JOB_CLASS

Type of Job (training, preprocessing, custom, notebook, rstudio, inference, tensorboard)

DKUBE_JOB_ID

Unique Job ID

DKUBE_JOB_UUID

Unique Job UUID

Variables Passed to Jobs

The user can provide program variables when creating an IDE or Run, as described at Configuration File. These variables are available to the program based on the variable name. Some examples of these are shown here.

Name

Description

STEPS

Number of training steps

BATCHSIZE

Batchsize for training

EPOCHS

Number of training epochs

Repo Variables

Name

Description

S3

S3_BUCKET

Storage bucket

S3_ENDPOINT

URL of server

S3_VERIFY_SSL

Verify SSL in S3 Bucket

S3_REQUEST_TIMEOUT_MSEC

Request timeout for Tensorflow to storage connection in milliseconds

S3_CONNECT_TIMEOUT_MSEC

Connection timeout for Tensorflow to storage connection in milliseconds

S3_USE_HTTPS

Use https (1) or http (0)

AWS

AWS_ACCESS_KEY_ID

Access key

AWS_SECRET_ACCESS_KEY

Secret key

Redshift Variables

Name

Description

DKUBE_DATASET_REDSHIFT_CONFIG

Redshift dataset metadata for user owned Redshift datasets

DKUBE_DATASET_REDSHIFT_DB_SCHEMA

Schema

DKUBE_DATASET_REDSHIFT_ENDPOINT

Dataset url

DKUBE_DATASET_REDSHIFT_DATABASE

Database name

DKUBE_DATASET_NAME

Dataset name

DKUBE_DATASET_REDSHIFT_USER

User name

DKUBE_DATASET_REDSHIFT_CERT

SSL Certificate

Hyperparameter Tuning Variables

Name

Description

DKUBE_JOB_HP_TUNING_INFO_FILE

Configuration file specified when creating a Run ( Hyperparameter Tuning)

PARENT_ID

Unique identifier (uuid)

OBJECTIVE_METRIC_NAME

Objective metric

TRIAL

Count of trial runs

DKube SDK

One Convergence provides an SDK to allow direct access to DKube actions. In order to make use of this, the SDK needs to be called at the start on the code. An SDK guide is available at:

https://dkube.io/dkube-sdk2.2/index.html

Building a Custom Docker Image

If the standard DKube Docker image does not provide the packages that are necessary for your code to execute, you can create a custom Docker image and use this for IDEs and Runs. This section describes the process to build a custom Docker image. The process for using a custom image is explained at Custom Containers.

Getting the Base Image

In order to create a custom image for DKube, you should start with the standard DKube image for the framework and version, and add the packages that you need. The standard images for DKube are:

Framework

Version

CPU/CPU

Image

TensorFlow

1.14

CPU

ocdr/d3-datascience-tf-cpu:v1.14

TensorFlow

1.14

GPU

ocdr/d3-datascience-tf-gpu:v1.14

TensorFlow

2.0

CPU

ocdr/dkube-datascience-tf-cpu:v2.0.0

TensorFlow

2.0

GPU

ocdr/dkube-datascience-tf-gpu:v2.0.0

Pytorch

1.6

CPU

ocdr/d3-datascience-pytorch-cpu:v1.6

PyTorch

1.6

GPU

ocdr/d3-datascience-pytorch-gpu:v1.6

Scikit Learn

0.23.2

CPU

ocdr/d3-datascience-sklearn:v0.23.2

Adding Your Packages

In order to add your packages to the standard DKube image, you create a Dockerfile with the packages included. The Dockerfile commands are:

FROM <Base Image> RUN pip install <Package>

Building the Docker Image

The new image can be built with the following command:

docker build -t <username>/<image:version> -f <Dockerfile Name> .

Pushing the Image to Docker Hub

In order to push the image, login to Docker Hub and run the following command:

docker push <username>/<image:version>

Using the Custom Image within DKube

When starting a Run or IDE, select a Custom Container and use the name of the image that was saved in the previous step. The form of the image will be:

docker.io/<username>/<image:version>