Awx-operator: AWX Operator Fails to Install AWX Containers/Instance

Created on 5 May 2021 · 13Comments · Source: ansible/awx-operator

ISSUE TYPE

AWX Operator fails to perform installation.

SUMMARY

Had an instance of 17.0.1 running, don't care if the data persists either.

Performed data migration following Data Migration instructions.

Performed install of AWX Operator following INSTALL.md

minikube v1.18.1 was installed during this time following links from INSTALL.md

ENVIRONMENT

AWX version: 19.1.0
Operator version: 0.9.0
Kubernetes version: 1.20.2
AWX install method: operator

STEPS TO REPRODUCE

Follow INSTALL.md

EXPECTED RESULTS

Expected to see pods/AWX instance

ACTUAL RESULTS

minikube kubectl apply -- -f myawx.yml

After 30 minutes only the orchestrator is running, tailing the logs shows a looping error.

ADDITIONAL INFORMATION

xxx@yyy:~$ minikube kubectl get pods
NAME                            READY   STATUS    RESTARTS   AGE
awx-operator-5595d6fc57-hdj9d   1/1     Running   0          29m

xxx@yyy:~$ minikube version
minikube version: v1.18.1

AWX-OPERATOR LOGS

{
  "level": "error",
  "ts": 1620223136.6924627,
  "logger": "logging_event_handler",
  "msg": "",
  "name": "custom.name.awx", 
  "namespace": "default",
  "gvk": "awx.ansible.com/v1beta1,Kind=AWX",
  "event_type": "runner_on_failed",
  "job": "2601737961087659062",
  "EventData.Task": "Create Database if no database is specified",
  "EventData.TaskArgs": "",
  "EventData.FailedTaskPath": "/opt/ansible/roles/installer/tasks/database_configuration.yml:68",
  "error": "[playbook task failed]",
  "stacktrace": "github.com/go-logr/zapr.(*zapLogger).Error\n\tpkg/mod/github.com/go-logr/[email protected]/zapr.go:128\ngithub.com/operator-framework/operator-sdk/pkg/ansible/events.loggingEventHandler.Handle\n\tsrc/github.com/operator-framework/operator-sdk/pkg/ansible/events/log_events.go:87"
}

needs_info worked_for_me

Source

bandwiches

Most helpful comment

Here is the snippet of the error i'm seeing which i believe is exactly like @bandwiches error.

{"level":"error","ts":1620310325.2259731,"logger":"controller-runtime.controller","msg":"Reconciler error","controller":"awx-controller","request":"default/awx","error":"event runner on failed","stacktrace":"github.com/go-logr/zapr.(zapLogger).Error\n\tpkg/mod/github.com/go-logr/[email protected]/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).reconcileHandler\n\tpkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:258\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).processNextWorkItem\n\tpkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).worker\n\tpkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\tpkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\tpkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\tpkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\tpkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:90"}
{"level":"info","ts":1620310327.3738635,"logger":"logging_event_handler","msg":"[playbook task]","name":"awx","namespace":"default","gvk":"awx.ansible.com/v1beta1, Kind=AWX","event_type":"playbook_on_task_start","job":"261049867304784443","EventData.Name":"installer : Patching labels to AWX kind"}

exodusprime1337 on 6 May 2021

👍2

All 13 comments

Blew the entire thing away and restarted fresh. Service pods are stuck 0/4 pending. It's been an additional 45 minutes now.

This is the exact task that continuously fails over and over again with no real output/log.

TASK [installer : Apply deployment resources] **********************************
task path: /opt/ansible/roles/installer/tasks/resources_configuration.yml:34

Output

{
    "level":"error",
    "ts":1620229749.932002,
    "logger":"controller-runtime.controller",
    "msg":"Reconciler error",
    "controller":"awx-controller",
    "request":"default/awx",
    "error":"event runner on failed",
    "stacktrace": 
        "github.com/go-logr/zapr.(*zapLogger).Error
            pkg/mod/github.com/go-logr/[email protected]/zapr.go:128
        sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
            pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:258
        sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
            pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:232
        sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
            pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:211
        k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
            pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155
        k8s.io/apimachinery/pkg/util/wait.BackoffUntil
            pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156
        k8s.io/apimachinery/pkg/util/wait.JitterUntil
            pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133
        k8s.io/apimachinery/pkg/util/wait.Until
            pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:90"
}

Logs
awx-web: stale after image with the following two lines

[INFO] SIGTERM: Shutting down servers then terminating
[INFO] plugin/health: Going into lameduck mode for 5s

redis: stale after image downloads
awx-task: nothing
awx-ee: nothing

bandwiches on 5 May 2021

@bandwiches we will need more information to understand what is going on.
Please send us the following:

kubectl get awx -o yaml awx

kubectl describe deployment awx

kubectl describe statefulset awx-postgres

kubectl get pods 

kubectl get events

Thanks!

tchellomello on 6 May 2021

I'm seeing the same thing in minikube. here is the output from mine. Saw the same error in kubernetes on centos7 as well. Fresh install with all latest binaries.

describe_awx.txt
describe_stateful.txt
events.txt
get_awx.txt
pods.txt

exodusprime1337 on 6 May 2021

@bandwiches we will need more information to understand what is going on.
Please send us the following:
kubectl get awx -o yaml awx

kubectl describe deployment awx

kubectl describe statefulset awx-postgres

kubectl get pods 

kubectl get events
Thanks!

For the sake of clarity, I feel I should state that I'm using minikube since it is recommended by the AWX install guide.

get_awx.txt
describe_deployment_awx.txt
describe_statefulset.txt
get_pods.txt
get_events.txt

(Edit) I see a CPU warning (insufficient CPU) for the AWX pod. I have to say, this is a dedicated VM w/2 CPU and 2GB RAM. This VM has had no issues running AWX v15 and v17. New install method introduced in v19 all of a sudden complains about resources? Understandable that this could change from version to version, but it would be nice to know minimal system requirements now that it's an issue.

bandwiches on 6 May 2021

Here is the snippet of the error i'm seeing which i believe is exactly like @bandwiches error.

exodusprime1337 on 6 May 2021

👍2

@exodusprime1337

Spot on.

bandwiches on 6 May 2021

👍1

I have the exact same error, but on a bare-metal kubernetes cluster:
AWX version: 19.1.0
Operator version: 0.9.0
Kubernetes version: v1.21.0 with containerd 1.4.4
AWX install method: operator

abcqwertz on 7 May 2021

@bandwiches we will need more information to understand what is going on.
Please send us the following:
kubectl get awx -o yaml awx

kubectl describe deployment awx

kubectl describe statefulset awx-postgres

kubectl get pods 

kubectl get events
Thanks!
For the sake of clarity, I feel I should state that I'm using minikube since it is recommended by the AWX install guide.

get_awx.txt
describe_deployment_awx.txt
describe_statefulset.txt
get_pods.txt
get_events.txt

(Edit) I see a CPU warning (insufficient CPU) for the AWX pod. I have to say, this is a dedicated VM w/2 CPU and 2GB RAM. This VM has had no issues running AWX v15 and v17. New install method introduced in v19 all of a sudden complains about resources? Understandable that this could change from version to version, but it would be nice to know minimal system requirements now that it's an issue.

For your case, it looks the issue is related with the CPU (like you mentioned)

NAME                            READY   STATUS    RESTARTS   AGE
awx-5b58db49c-9gslf             0/4     Pending   0          7m3s
awx-operator-5595d6fc57-92txg   1/1     Running   0          10m
awx-postgres-0                  1/1     Running   0          7m14s

LAST SEEN   TYPE      REASON                    OBJECT                                          MESSAGE
87s         Warning   FailedScheduling          pod/awx-5b58db49c-9gslf                         0/1 nodes are available: 1 Insufficient cpu.

Looking at your deployment, we can see it's using the default resource limits:

   awx-web:
    Image:      quay.io/ansible/awx:19.1.0
    Port:       8052/TCP
    Host Port:  0/TCP
    Requests:
      cpu:     1
      memory:  2Gi

....

   awx-task:
    Image:      quay.io/ansible/awx:19.1.0
    Port:       <none>
    Host Port:  <none>
    Args:
      /usr/bin/launch_awx_task.sh
    Requests:
      cpu:     500m
      memory:  1Gi

....

Please note the suggested values (memory and cpu) are still the same (see https://github.com/ansible/awx-operator/pull/93/files) and you can override it to fulfill your needs. That should the job for you. Please let us know.

tchellomello on 11 May 2021

I'm seeing the same thing in minikube. here is the output from mine. Saw the same error in kubernetes on centos7 as well. Fresh install with all latest binaries.

describe_awx.txt
describe_stateful.txt
events.txt
get_awx.txt
pods.txt

Same thing here @exodusprime1337

LAST SEEN   TYPE      REASON                    OBJECT                                          MESSAGE
2s          Warning   FailedScheduling          pod/awx-5b58db49c-bfwnt                         0/1 nodes are available: 1 Insufficient memory.
21m         Normal    SuccessfulCreate          replicaset/awx-5b58db49c                        Created pod: awx-5b58db49c-bfwnt

   awx-web:
    Image:      quay.io/ansible/awx:19.1.0
    Port:       8052/TCP
    Host Port:  0/TCP
    Requests:
      cpu:     1
      memory:  2Gi

    Requests:
      cpu:     500m
      memory:  1Gi

If you run kubectl get nodes <NODE_NAME> -o yaml, you shall see the amount of memory for your node:

  allocatable:
    cpu: 7800m
    ephemeral-storage: "222240964241"
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 31547268Ki
    pods: "250"
  capacity:
    cpu: "8"
    ephemeral-storage: 235495Mi
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 32173956Ki
    pods: "250"


> kubectl top nodes                                                                                                                                                                                                                                                                       
NAME     CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%     
p70      763m         9%     12685Mi         41%

tchellomello on 11 May 2021

@tchellomello thanks for the update there. I'm following you, but I have a serious concern about the AWX install tutorial since it gives a bare minimum config and that leads to this result. Perhaps there should be more cross-communication between the two packages to ensure that the minimal config is actually the bare minimum? These settings are never mentioned in install doc.

Edit -
Per your link, I noticed both of these.
awx_v1beta1_molecule.yml (cpu: 500m, memory: 128M // cpu: 500m, memory: 128M)
installer\defaults\main.yml (cpu: 1000m, memory: 1Gi // cpu: 500m, memory: 2Gi)

One issue is that AWX INSTALL.md doesn't have any mention of minimal requirements thus making the transition from v17 to v19 even harder since what worked before, may no longer work as "default". While I understand requirements may change, it would also be nice to know that the minimal requirements/default have changed.

bandwiches on 11 May 2021

@bandwiches I hear you, I agree that the documentation has lots of room to improve, and please if you see any place that could use some enhancement, do not hesitate to submit a PR.

In regards to the https://github.com/ansible/awx-operator/blob/devel/deploy/crds/awx_v1beta1_molecule.yaml, that is used on the molecule tests here -> https://github.com/ansible/awx-operator/blob/devel/molecule/test-local/converge.yml#L31 so that is totally different scenario and should not necessarily be consistent as for this test we don't need to allocate that mount of memory and cpu.

tchellomello on 11 May 2021

@bandwiches I hear you, I agree that the documentation has lots of room to improve, and please if you see any place that could use some enhancement, do not hesitate to submit a PR.

I would love to, except I think the awx repo is outpacing awx-operator and making the inconsistencies impossible to fix.

In regards to your response about system settings - understood and that's fair, no qualms about that.

I was running into another issue once I was able to resolve the resources issue and I feel it's actually still appropriate here. The awx-service was not externally reachable by default (regardless of Ingress or NodePort). The issue was actually related to IPTABLES not adding a rule to allow the destination port for the service.

minikube service awx-service --url returns the IP:PORT, but that PORT is never allowed through iptables. Adding a rule to the DOCKER chain on the dport jumping to ACCEPT fixed this.

Second issue - minikube service IP. I don't see anywhere that this is configurable, however I'll admit that I may be overlooking it given how many different repo's I've had to visit today. This actually presents 2 issues (1) now we're required to route to the host first for the underlying subnet access and (2) there's no consideration for organizational overlap if that subnet is already in use. I believe the default underlying network is 192.168.49.0/24 which is huge for a bridge/transit network and increases the risk of overlap.

bandwiches on 11 May 2021

Hi bandwiches
Great thanks for your hint, I have the same issue to deploy ansible awx on k3s cluseter in a VM, and no idea what happen and how to trouble shooting, regarding to your post, finally I increase my ansible awx VM host memory and CPU core, and the problem get fix.