로컬에서 도커로 airflow 테스트를 하다 GKE 위에서 airflow 작업을 하기로 했다. 그 이유는 다음과 같다.
- 클라우드 서비스를 주로 쓰는 분위기
- 요즘은 효율적인 서비스, 자원 관리에 쿠버네티스를 주로 사용하므로, 쿠버네티스 위에서 서비스를 할 수 있는 법을 알고자 함.
이 작업을 하긴 헀지만 여러 가직 선행지식이 필요함을 느꼈다.
- 쿠버네티스와 관련 툴 (helm) 사용법
- 이 분들의 개념과 구조
- 쿠버와 helm 차트의 yaml 파일 작성법
- 커맨드
- gcloud 커맨드에 익숙해지기
- 클러스터 생성시 설정내용
작업하면서 배운 것들은 어느정도 습득을 하게 되면 프로젝트 등에서 툴을 사용하는 데는 문제가 줄어든다. 하지만 지식을 파편적으로 가지고 있게 되어서 전체적으로는 그 툴에 대해 모르면서 안다고 생각할 수 있게 되는 게 함정이다.
최근에 이 문제로 크게 느낀 적이 있어서 지금은 노션에 학습하면서 배운 것들을 정리하고, 툴에 대한 컨셉, 전반적인 활용법 등도 찾아서 정리하기 위해 노력하고 있다.
이 작업은 다음을 참조했다.
이 글에서는 다음을 작업한다.
- GKE에서 쿠버네티스 클러스터를 생성
- helm과 values.yaml을 이용해서 airflow 환경을 설정하고 배포
- GCP 로드밸런스를 통해서 GKE에 airflow 서버를 노출
추가적으로 나는 git-sync 설정을 수정해서 내 깃의 dag를 연결시켰다.
처음에는 GKE에서 튜토리얼로 제공하는 my-first-cluster에 이 작업을 하려고했는데 노드 자원이 부족해서 위의 튜토리얼에서처럼 클러스터를 새로 생성해서 다시 작업했다.
보면서 클러스터 region이나 zone개념 등등 공부해야 할 것이 많다.
다음에 할 작업:
- helm차트를 보는 게 익숙체 않아서 worker의 log를 cloud storage로 연결 작업중. REMOTE 관련 ENV를 추가했는데 뭔가가 잘 안 되어서 그런지 작동을 안 함
- DAG 테스트
레포추가 & 확인
helm repo add apache-airflow https://airflow.apache.org
"apache-airflow" has been added to your repositories
helm repo list
NAME URL
stable https://charts.helm.sh/stable
local http://127.0.0.1:8879/charts
apache-airflow https://airflow.apache.org
helm3 설치
https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3
chmod 700 get_helm.sh
bash ./get_helm.sh
https://airflow.apache.org/docs
helm upgrade --install airflow apache-airflow/airflow --namespace airflow --debug
kubectl port-forward svc/airflow-webserver 7070:8080 --namespace airflow
Forwarding from 127.0.0.1:7070 -> 8080
Forwarding from [::1]:7070 -> 8080
작업을 할 때 gke 튜토리얼이 제공하는 노드3개의 클러스터를 이용했는데 에어플로우 웹서버가 port forward가 계속 실패해서 workload에 들어가보니 서버 자원이 모자라다는 경고가 떠있었다. pod상태를 보니 ...
kubectl get pod --namespace airflow
NAME READY STATUS RESTARTS AGE
airflow-flower-5d59bf75fc-kfjfk 0/1 CrashLoopBackOff 6 (53s ago) 8m54s
airflow-postgresql-0 1/1 Running 0 66s
airflow-redis-0 1/1 Running 0 65s
airflow-scheduler-c7647fff-trj2n 2/2 Running 1 (26m ago) 64m
airflow-statsd-7586f9998-mpkcz 1/1 Running 0 8m53s
airflow-triggerer-799fbf6779-6m9sn 0/1 Init:0/1 0 8m54s
airflow-webserver-85fb5d6b76-4s8jf 0/1 Error 1 76m
airflow-webserver-85fb5d6b76-54f97 0/1 Evicted 0 63m
airflow-webserver-85fb5d6b76-6wns7 0/1 Evicted 0 63m
airflow-webserver-85fb5d6b76-7mkkp 0/1 Evicted 0 63m
airflow-webserver-85fb5d6b76-8wfdt 0/1 Evicted 0 63m
airflow-webserver-85fb5d6b76-96454 0/1 Evicted 0 63m
airflow-webserver-85fb5d6b76-f7k8k 0/1 ContainerStatusUnknown 1 63m
airflow-webserver-85fb5d6b76-gl869 0/1 Evicted 0 63m
airflow-webserver-85fb5d6b76-hvmk5 0/1 Evicted 0 63m
airflow-webserver-85fb5d6b76-kt858 0/1 Evicted 0 63m
airflow-webserver-85fb5d6b76-tffrv 0/1 Evicted 0 63m
airflow-webserver-85fb5d6b76-wmv8j 0/1 CrashLoopBackOff 12 (4m5s ago) 58m
airflow-worker-0 0/2 Init:0/1 0 67s
(base) lucca@luccaslab-s1:/data/lucca/airflow$
배포와 서비스 상태.
kubectl get deployment,svc --namespace airflow
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/airflow-flower 1/1 1 1 10h
deployment.apps/airflow-scheduler 1/1 1 1 10h
deployment.apps/airflow-statsd 1/1 1 1 10h
deployment.apps/airflow-triggerer 1/1 1 1 10h
deployment.apps/airflow-webserver 0/1 1 0 10h
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/airflow-flower ClusterIP 10.108.11.47 5555/TCP 10h
service/airflow-postgresql ClusterIP 10.108.9.229 5432/TCP 10h
service/airflow-postgresql-headless ClusterIP None 5432/TCP 10h
service/airflow-redis ClusterIP 10.108.10.23 6379/TCP 10h
service/airflow-statsd ClusterIP 10.108.1.50 9125/UDP,9102/TCP 10h
service/airflow-webserver ClusterIP 10.108.13.166 8080/TCP 10h
service/airflow-worker ClusterIP None 8793/TCP 10h
그래서 새 클러스터를 다시 생성. 노드 개수는 1개로 했는데 지정해줘도 기본이 3개인가?
gcloud container clusters create airflow-cluster \
> --machine-type n1-standard-4 \
> --num-nodes 1 \
> --region "us-central1"
Default change: VPC-native is the default mode during cluster creation for versions greater than 1.21.0-gke.1500. To create advanced routes based clusters, please pass the `--no-enable-ip-alias` flag
Note: Your Pod address range (`--cluster-ipv4-cidr`) can accommodate at most 1008 node(s).
Creating cluster airflow-cluster in us-central1...done.
Created [https://container.googleapis.com/v1/projects/elt-pipeline/zones/us-central1/clusters/airflow-cluster].
To inspect the contents of your cluster, go to: https://console.cloud.google.com/kubernetes/workload_/gcloud/us-central1/airflow-cluster?project=elt-pipeline
kubeconfig entry generated for airflow-cluster.
NAME LOCATION MASTER_VERSION MASTER_IP MACHINE_TYPE NODE_VERSION NUM_NODES STATUS
airflow-cluster us-central1 1.21.6-gke.1500 34.66.248.71 n1-standard-4 1.21.6-gke.1500 3 RUNNING
GKE의 쿠버 클러스터와 연결
gcloud container clusters get-credentials airflow-cluster --region "us-central1"
Fetching cluster endpoint and auth data.
kubeconfig entry generated for airflow-cluster.
네임스페이스 생성
kubectl create namespace airflow
namespace/airflow created
airflow 설치와 확인
airflow-flower
and airflow-redis
service는 Celery Excutor를 위한 것임. 후에 이것을 LocalExcutor로 변경할 것임.
helm upgrade --install airflow apache-airflow/airflow -n airflow --debug
kubectl get pod --namespace airflow
NAME READY STATUS RESTARTS AGE
airflow-flower-5d59bf75fc-m5vdb 1/1 Running 0 2m50s
airflow-postgresql-0 1/1 Running 0 2m50s
airflow-redis-0 1/1 Running 0 2m50s
airflow-scheduler-c7647fff-jkcrz 2/2 Running 0 2m50s
airflow-statsd-7586f9998-stmpc 1/1 Running 0 2m51s
airflow-triggerer-799fbf6779-ps267 1/1 Running 0 2m51s
airflow-webserver-7b4477d47c-kzfkz 1/1 Running 0 74s
airflow-worker-0 2/2 Running 0 65s
kubectl get deployment,svc --namespace airflow
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/airflow-flower 1/1 1 1 5m1s
deployment.apps/airflow-scheduler 1/1 1 1 5m1s
deployment.apps/airflow-statsd 1/1 1 1 5m1s
deployment.apps/airflow-triggerer 1/1 1 1 5m1s
deployment.apps/airflow-webserver 1/1 1 1 5m1s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/airflow-flower ClusterIP 10.48.14.186 5555/TCP 5m1s
service/airflow-postgresql ClusterIP 10.48.2.227 5432/TCP 5m1s
service/airflow-postgresql-headless ClusterIP None 5432/TCP 5m1s
service/airflow-redis ClusterIP 10.48.8.106 6379/TCP 5m1s
service/airflow-statsd ClusterIP 10.48.3.22 9125/UDP,9102/TCP 5m1s
service/airflow-webserver ClusterIP 10.48.6.184 8080/TCP 5m1s
service/airflow-worker ClusterIP None 8793/TCP 5m1s
성공적으로 웹서버 실행 후 로그인.
Airflow deployment는 Helm 차트의 values.yaml의 설정을 따른다. 따라서 설정을 바꾸기 위해 values.yml 파일을 복사한다.
바꿀 설정은:
- CeleryExcuter를 LocalExcutor로 변경
- ClusterIP 서비스를 LoadBalancer로 변경
참고로 Dag는 어떻게 연결해야 하나 values.yaml을 살펴보다 git-sync 설정을 확인했다.
# Git sync
dags:
persistence:
# Enable persistent volume for storing dags
enabled: false
# Volume size for dags
size: 1Gi
# If using a custom storageClass, pass name here
storageClassName:
# access mode of the persistent volume
accessMode: ReadWriteOnce
## the name of an existing PVC to use
existingClaim:
gitSync:
enabled: false
# git repo clone url
# ssh examples ssh://git@github.com/apache/airflow.git
# git@github.com:apache/airflow.git
# https example: https://github.com/apache/airflow.git
repo: https://github.com/apache/airflow.git
branch: v2-2-stable
rev: HEAD
depth: 1
# the number of consecutive failures allowed before aborting
maxFailures: 0
# subpath within the repo where dags are located
# should be "" if dags are at repo root
subPath: "tests/dags"
# if your repo needs a user name password
# you can load them to a k8s secret like the one below
# ---
# apiVersion: v1
# kind: Secret
# metadata:
# name: git-credentials
# data:
# GIT_SYNC_USERNAME:
# GIT_SYNC_PASSWORD:
# and specify the name of the secret below
#
# credentialsSecret: git-credentials
#
#
# If you are using an ssh clone url, you can load
# the ssh private key to a k8s secret like the one below
# ---
# apiVersion: v1
# kind: Secret
# metadata:
# name: airflow-ssh-secret
# data:
# # key needs to be gitSshKey
# gitSshKey:
# and specify the name of the secret below
# sshKeySecret: airflow-ssh-secret
#
# If you are using an ssh private key, you can additionally
# specify the content of your known_hosts file, example:
#
# knownHosts: |
# ,
# ,
# interval between git sync attempts in seconds
wait: 60
containerName: git-sync
uid: 65533
수정한 values.yaml 파일을 이용해서 upgrade -install을 실행하면 수정된 부분만 다시 배포되는 것을 로그로 확인했다.
helm upgrade --install airflow apache-airflow/airflow --namespace airflow -f values.yaml --debug
history.go:56: [debug] getting history for release airflow
upgrade.go:142: [debug] preparing upgrade for airflow
upgrade.go:150: [debug] performing update for airflow
upgrade.go:322: [debug] creating upgraded release for airflow
client.go:218: [debug] checking 24 resources for changes
client.go:501: [debug] Looks like there are no changes for ServiceAccount "airflow-create-user-job"
client.go:501: [debug] Looks like there are no changes for ServiceAccount "airflow-migrate-database-job"
client.go:501: [debug] Looks like there are no changes for ServiceAccount "airflow-scheduler"
client.go:501: [debug] Looks like there are no changes for ServiceAccount "airflow-statsd"
client.go:501: [debug] Looks like there are no changes for ServiceAccount "airflow-triggerer"
client.go:501: [debug] Looks like there are no changes for ServiceAccount "airflow-webserver"
client.go:501: [debug] Looks like there are no changes for Secret "airflow-postgresql"
client.go:501: [debug] Looks like there are no changes for Secret "airflow-airflow-metadata"
client.go:510: [debug] Patch Secret "airflow-webserver-secret-key" in namespace airflow
client.go:510: [debug] Patch ConfigMap "airflow-airflow-config" in namespace airflow
client.go:501: [debug] Looks like there are no changes for Role "airflow-pod-launcher-role"
client.go:501: [debug] Looks like there are no changes for Role "airflow-pod-log-reader-role"
client.go:510: [debug] Patch RoleBinding "airflow-pod-launcher-rolebinding" in namespace airflow
client.go:501: [debug] Looks like there are no changes for RoleBinding "airflow-pod-log-reader-rolebinding"
client.go:501: [debug] Looks like there are no changes for Service "airflow-postgresql-headless"
client.go:501: [debug] Looks like there are no changes for Service "airflow-postgresql"
client.go:239: [debug] Created a new Service called "airflow-scheduler" in airflow
client.go:501: [debug] Looks like there are no changes for Service "airflow-statsd"
client.go:510: [debug] Patch Service "airflow-webserver" in namespace airflow
client.go:510: [debug] Patch Deployment "airflow-statsd" in namespace airflow
client.go:510: [debug] Patch Deployment "airflow-triggerer" in namespace airflow
client.go:510: [debug] Patch Deployment "airflow-webserver" in namespace airflow
client.go:510: [debug] Patch StatefulSet "airflow-postgresql" in namespace airflow
client.go:239: [debug] Created a new StatefulSet called "airflow-scheduler" in airflow
client.go:267: [debug] Deleting ServiceAccount "airflow-flower" in namespace airflow...
client.go:267: [debug] Deleting ServiceAccount "airflow-redis" in namespace airflow...
client.go:267: [debug] Deleting ServiceAccount "airflow-worker" in namespace airflow...
client.go:267: [debug] Deleting Secret "airflow-airflow-result-backend" in namespace airflow...
client.go:267: [debug] Deleting Service "airflow-flower" in namespace airflow...
client.go:267: [debug] Deleting Service "airflow-redis" in namespace airflow...
client.go:267: [debug] Deleting Service "airflow-worker" in namespace airflow...
client.go:267: [debug] Deleting Deployment "airflow-flower" in namespace airflow...
client.go:267: [debug] Deleting Deployment "airflow-scheduler" in namespace airflow...
client.go:267: [debug] Deleting StatefulSet "airflow-redis" in namespace airflow...
client.go:267: [debug] Deleting StatefulSet "airflow-worker" in namespace airflow...
client.go:299: [debug] Starting delete for "airflow-run-airflow-migrations" Job
client.go:328: [debug] jobs.batch "airflow-run-airflow-migrations" not found
client.go:128: [debug] creating 1 resource(s)
client.go:529: [debug] Watching for changes to Job airflow-run-airflow-migrations with timeout of 5m0s
client.go:557: [debug] Add/Modify event for airflow-run-airflow-migrations: ADDED
client.go:596: [debug] airflow-run-airflow-migrations: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
client.go:557: [debug] Add/Modify event for airflow-run-airflow-migrations: MODIFIED
client.go:299: [debug] Starting delete for "airflow-create-user" Job
client.go:328: [debug] jobs.batch "airflow-create-user" not found
client.go:128: [debug] creating 1 resource(s)
client.go:529: [debug] Watching for changes to Job airflow-create-user with timeout of 5m0s
client.go:557: [debug] Add/Modify event for airflow-create-user: ADDED
client.go:596: [debug] airflow-create-user: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
# ... todfur
You can get Fernet Key value by running the following:
echo Fernet Key: $(kubectl get secret --namespace airflow airflow-fernet-key -o jsonpath="{.data.fernet-key}" | base64 --decode)
###########################################################
# WARNING: You should set a static webserver secret key #
###########################################################
You are using a dynamically generated webserver secret key, which can lead to
unnecessary restarts of your Airflow components.
Information on how to set a static webserver secret key can be found here:
https://airflow.apache.org/docs/helm-chart/stable/production-guide.html#webserver-secret-key
LocaExcutor는 Redis 브로커와 Flower UI가 필요하지 않기 때문에 이 두 pod는 삭제되고 airflow-scheduler 서비스가 생성되었다.
git-sync 부분을 나의 깃허브로 연결하고 다시 파드를 업데이트했다.
수정한 부분:
# Git sync
dags:
persistence:
# Enable persistent volume for storing dags
enabled: true
# Volume size for dags
size: 1Gi
# If using a custom storageClass, pass name here
storageClassName:
# access mode of the persistent volume
accessMode: ReadWriteOnce
## the name of an existing PVC to use
existingClaim:
gitSync:
enabled: true
repo: https://github.com/ymmu/airflow_test.git
# repo: https://github.com/apache/airflow.git
branch: dag_test
# branch: v2-2-stable
업데이트
helm upgrade --install airflow apache-airflow/airflow --namespace airflow -f values.yaml --debug
# 패치 확인
client.go:510: [debug] Patch Secret "airflow-webserver-secret-key" in namespace airflow
client.go:510: [debug] Patch ConfigMap "airflow-airflow-config" in namespace airflow
client.go:239: [debug] Created a new PersistentVolumeClaim called "airflow-dags" in airflow
client.go:510: [debug] Patch Deployment "airflow-statsd" in namespace airflow
client.go:510: [debug] Patch Deployment "airflow-triggerer" in namespace airflow
client.go:510: [debug] Patch Deployment "airflow-webserver" in namespace airflow
client.go:510: [debug] Patch StatefulSet "airflow-postgresql" in namespace airflow
client.go:510: [debug] Patch StatefulSet "airflow-scheduler" in namespace airflow
airflow ui에서 나의 dag 파일들을 볼 수 있다.
또 persistence storage도 true로 해두었는데 cluster storage에 dag 스토리지가 생성된 걸 확인할 수 있었다.
로그도 이 설정이 있던데. 이렇게 storage에 기록하면 이 기록을 sql같은 다른 툴에 담거나 볼 수 있게 하는 방법도 찾아봐야겠다.
깃 싱크가 잘 되는지 파일 하나를 추가한 다음 UI를 다시 확인해 보았고 잘 싱크되는 것을 확인했다. yaml파일에는 60초 주기로 싱크해놓게 되어있었다.