Provisioning compute
In this section we will configure Karpenter to allow the creation of Inferentia and Trainium EC2 instances. Karpenter can detect the pending Pods that require an inf2 or trn1 instance. Karpenter will then launch the required instance to schedule the Pod.
You can learn more about Karpenter in the Karpenter module that's provided in this workshop.
Karpenter has been installed in our EKS cluster, and runs as a Deployment:
NAME READY UP-TO-DATE AVAILABLE AGE
...
karpenter 2/2 2 2 11m
Karpenter requires a NodePool to provision nodes. This is the Karpenter NodePool that we will create:
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: aiml
spec:
  template:
    metadata:
      labels:
        instanceType: "neuron"
        provisionerType: "karpenter"
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values:
            - on-demand
        - key: karpenter.k8s.aws/instance-family
          operator: In
          values:
            - inf2
            - trn1
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: aiml
---
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: aiml
spec:
  amiFamily: AL2023
  amiSelectorTerms:
    - alias: al2023@latest
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        deleteOnTermination: true
        volumeSize: 100Gi
        volumeType: gp3
        iops: 16000
        throughput: 1000
  role: ${KARPENTER_NODE_ROLE}
  userData: |
    MIME-Version: 1.0
    Content-Type: multipart/mixed; boundary="//"
    --//
    Content-Type: text/x-shellscript; charset="us-ascii"
    #!/bin/bash
    sed -i "s/^max_concurrent_downloads_per_image = .*$/max_concurrent_downloads_per_image = 10/" /etc/soci-snapshotter-grpc/config.toml
    sed -i "s/^max_concurrent_unpacks_per_image = .*$/max_concurrent_unpacks_per_image = 10/" /etc/soci-snapshotter-grpc/config.toml
    --//
    Content-Type: application/node.eks.aws
    apiVersion: node.eks.aws/v1alpha1
    kind: NodeConfig
    spec:
      featureGates:
        FastImagePull: true
    --//
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: ${EKS_CLUSTER_NAME}
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: ${EKS_CLUSTER_NAME}
  tags:
    app.kubernetes.io/created-by: eks-workshop
In this section we assign what instances this NodePool is allowed to provision for us
You can see here that we've configured this NodePool to only allow the creation of inf2 and trn1 instances
Apply the NodePool and EC2NodeClass manifest:
Now the NodePool is ready for the creation of our training and inference Pods.