Terraform-aws-github-runner: Scale Up lambda reports no errors but does not spin up a new runner

Created on 1 Mar 2021 · 15Comments · Source: philips-labs/terraform-aws-github-runner

The scale-up lambda logs its invocation to Cloudwatch with nothing abnormal in the output - at least nothing that's obviously an error - but no new instances are created and the jobs remain queued. Because of the lack of error, I'm a little stuck regarding where to look next.

START RequestId: b6d27abc-24a7-5f67-a7a9-220b3a8f2e0a Version: $LATEST
--
{
Records: [
{
messageId: 'c5118c89-b1db-4a81-9fd1-c3211020f447',
receiptHandle: 'AQEBVpllIHtC29mzlvsdPt7y3HfIZHfGThi4dwb2ecHzqupGCRBtFBVFWNa9KKd7M3VwcyiVf6/uqKh/czW305hG9gkqvsnnDj1sdUIqXdzky6+z8ZJnylM/ekUA1bmv7bJna0K5Gbkr+2p1o5UcRoaZnr1EfijnlxabX2ft2JyxNvhVEjVJGEhJMOwIJmXnzlelKAqGh0gz+jde1hecenob2hS9aKEf+8pk6kJViSC0jZvb9S1hcBfHoNTsmP5z45+WzeyTeFDmcO3QmAeIsl4cj4fCwimpQvV1OyE8oBZ5QjE=',
body: '{     "id": 2005872726,     "repositoryName": "redacted",     "repositoryOwner": "redacted",     "eventType": "check_run",     "installationId": 15044875 }',
attributes: {
ApproximateReceiveCount: '1',
SentTimestamp: '1614617562674',
SequenceNumber: '18860086169754095872',
MessageGroupId: '2005872726',
SenderId: 'AROAYDZX6OHXHIADI55JV:gh-ci-webhook',
MessageDeduplicationId: '47a99738074ab0818b7881eee096ec21a5b82226764304d9ab69d90ff39ea349',
ApproximateFirstReceiveTimestamp: '1614617592695'
},
messageAttributes: {},
md5OfBody: 'd5e6cdc10ecd1a37128c56a1ed6bb90f',
eventSource: 'aws:sqs',
eventSourceARN: 'arn:aws:sqs:eu-west-1:redacted:gh-ci-queued-builds.fifo',
awsRegion: 'eu-west-1'
}
]
}

Anyone have any ideas?

Source

rjcoupe

👍3

Most helpful comment

I was on v0.10.0 so didn't hold much hope, but v0.11.0 does appear to fix the problem. Bizarre!

rjcoupe on 2 Mar 2021

👍2 🎉1

All 15 comments

I am seeing the same on my end and am suspecting it is related to the recent degraded performance incident for GitHub actions.

When trying to filter the list of queued workflows on our repo, we got the following error and an empty list when there clearly are queued workflows:
We are having problems searching workflow runs. The results may not be complete.

I think the lambda relies on this to return queued workflows to spin up an instance.

eky5006 on 1 Mar 2021

👍3

Seeing exactly the same thing fwiw.

I was trying to figure out if there's an easy way to manually force a scale up. It seems like the idle config is only checked during scale downs? I'm unfamiliar with the code so may have missed something.

samgiles on 1 Mar 2021

I spent a bit of time on similar problem, I found that the required tags for my EC2 by Policy were causing it to fail. I was able to find it by looking at CloudTrail API errors.

rlove on 2 Mar 2021

Thanks for the responses thus far, everyone.

@rlove I can't find anything in Cloudtrail to suggest the scaleup lambda is doing anything at all, error or otherwise.
@samgiles Yes this was something I was looking into as well; I couldn't (in limited time, admittedly) craft a test event that'd force the scaleup lambda into action.
@eky5006 That'd make sense, but I'm still seeing the same problem and according to https://www.githubstatus.com/incidents/xn0sd2x4nd7f the issue is resolved. Are you seeing any better at your end?

rjcoupe on 2 Mar 2021

I have the same problem.
INFO Repo < repo name > has 0 queued workflow runs even though there are queued jobs. And this API https://docs.github.com/en/rest/reference/actions#list-workflow-runs-for-a-repository returns queued workflows properly.
It started happening yesterday and still doesn't work.

bartoszjedrzejewski on 2 Mar 2021

INFO Repo < repo name > has 0 queued workflow runs

@bartoszjedrzejewski Where are you seeing that output?

rjcoupe on 2 Mar 2021

@rjcoupe in scale up cloudwatch logs. What version are you on? I think it is because I'm on 0.8.1. I'm trying to update right now. My coleague doesn't have this problem, he is on 0.10

bartoszjedrzejewski on 2 Mar 2021

I had the same issue, the outage left some lingering registered runners. I de-registered them from my GitHub Organization and now the runners are scaling up as expected.

Hope this helps somebody.

kieranbrown on 2 Mar 2021

Updating lambdas from 0.8.1 to 0.11.0 fixed my problem.

bartoszjedrzejewski on 2 Mar 2021

Hi, We had the same issue yesterday and upgrading lambdas from 0.8.1 to 0.10.0 also solved it.

catalinmer on 2 Mar 2021

🎉1

I was on v0.10.0 so didn't hold much hope, but v0.11.0 does appear to fix the problem. Bizarre!

rjcoupe on 2 Mar 2021

👍2 🎉1

@gertjanmaas any idea, looks like related the outage yesterday.

npalm on 2 Mar 2021

Could be related to the outage yesterday. In our case certain repositories didn't send an event to the webhook, which caused jobs to be queued and no instance to be created, but it could have affected any of the APIs we use.

The outage has been fixed, so if that was the cause this should be resolved.

gertjanmaas on 3 Mar 2021

Nope, it's happening again as of this morning with no changes made to the AWS resources. Seems the correct behaviour was a fluke.

rjcoupe on 4 Mar 2021

Just learned that we have seen problems off and on with all of the actions today not just the dynamic self-hosted runners. I think there are stability issues happening on GitHub.

rlove on 5 Mar 2021

Was this page helpful?

0 / 5 - 0 ratings