Terraform-aws-github-runner: Scale Up lambda reports no errors but does not spin up a new runner

Created on 1 Mar 2021  ·  15Comments  ·  Source: philips-labs/terraform-aws-github-runner

The scale-up lambda logs its invocation to Cloudwatch with nothing abnormal in the output - at least nothing that's obviously an error - but no new instances are created and the jobs remain queued. Because of the lack of error, I'm a little stuck regarding where to look next.

START RequestId: b6d27abc-24a7-5f67-a7a9-220b3a8f2e0a Version: $LATEST
--
{
Records: [
{
messageId: 'c5118c89-b1db-4a81-9fd1-c3211020f447',
receiptHandle: 'AQEBVpllIHtC29mzlvsdPt7y3HfIZHfGThi4dwb2ecHzqupGCRBtFBVFWNa9KKd7M3VwcyiVf6/uqKh/czW305hG9gkqvsnnDj1sdUIqXdzky6+z8ZJnylM/ekUA1bmv7bJna0K5Gbkr+2p1o5UcRoaZnr1EfijnlxabX2ft2JyxNvhVEjVJGEhJMOwIJmXnzlelKAqGh0gz+jde1hecenob2hS9aKEf+8pk6kJViSC0jZvb9S1hcBfHoNTsmP5z45+WzeyTeFDmcO3QmAeIsl4cj4fCwimpQvV1OyE8oBZ5QjE=',
body: '{     "id": 2005872726,     "repositoryName": "redacted",     "repositoryOwner": "redacted",     "eventType": "check_run",     "installationId": 15044875 }',
attributes: {
ApproximateReceiveCount: '1',
SentTimestamp: '1614617562674',
SequenceNumber: '18860086169754095872',
MessageGroupId: '2005872726',
SenderId: 'AROAYDZX6OHXHIADI55JV:gh-ci-webhook',
MessageDeduplicationId: '47a99738074ab0818b7881eee096ec21a5b82226764304d9ab69d90ff39ea349',
ApproximateFirstReceiveTimestamp: '1614617592695'
},
messageAttributes: {},
md5OfBody: 'd5e6cdc10ecd1a37128c56a1ed6bb90f',
eventSource: 'aws:sqs',
eventSourceARN: 'arn:aws:sqs:eu-west-1:redacted:gh-ci-queued-builds.fifo',
awsRegion: 'eu-west-1'
}
]
}

Anyone have any ideas?

Most helpful comment

I was on v0.10.0 so didn't hold much hope, but v0.11.0 does appear to fix the problem. Bizarre!

All 15 comments

I am seeing the same on my end and am suspecting it is related to the recent degraded performance incident for GitHub actions.

When trying to filter the list of queued workflows on our repo, we got the following error and an empty list when there clearly are queued workflows:
We are having problems searching workflow runs. The results may not be complete.

I think the lambda relies on this to return queued workflows to spin up an instance.

Seeing exactly the same thing fwiw.

I was trying to figure out if there's an easy way to manually force a scale up. It seems like the idle config is only checked during scale downs? I'm unfamiliar with the code so may have missed something.

I spent a bit of time on similar problem, I found that the required tags for my EC2 by Policy were causing it to fail. I was able to find it by looking at CloudTrail API errors.

Thanks for the responses thus far, everyone.

@rlove I can't find anything in Cloudtrail to suggest the scaleup lambda is doing anything at all, error or otherwise.
@samgiles Yes this was something I was looking into as well; I couldn't (in limited time, admittedly) craft a test event that'd force the scaleup lambda into action.
@eky5006 That'd make sense, but I'm still seeing the same problem and according to https://www.githubstatus.com/incidents/xn0sd2x4nd7f the issue is resolved. Are you seeing any better at your end?

I have the same problem.
INFO Repo < repo name > has 0 queued workflow runs even though there are queued jobs. And this API https://docs.github.com/en/rest/reference/actions#list-workflow-runs-for-a-repository returns queued workflows properly.
It started happening yesterday and still doesn't work.

INFO Repo < repo name > has 0 queued workflow runs

@bartoszjedrzejewski Where are you seeing that output?

@rjcoupe in scale up cloudwatch logs. What version are you on? I think it is because I'm on 0.8.1. I'm trying to update right now. My coleague doesn't have this problem, he is on 0.10

I had the same issue, the outage left some lingering registered runners. I de-registered them from my GitHub Organization and now the runners are scaling up as expected.

Hope this helps somebody.

Updating lambdas from 0.8.1 to 0.11.0 fixed my problem.

Hi, We had the same issue yesterday and upgrading lambdas from 0.8.1 to 0.10.0 also solved it.

I was on v0.10.0 so didn't hold much hope, but v0.11.0 does appear to fix the problem. Bizarre!

@gertjanmaas any idea, looks like related the outage yesterday.

Could be related to the outage yesterday. In our case certain repositories didn't send an event to the webhook, which caused jobs to be queued and no instance to be created, but it could have affected any of the APIs we use.

The outage has been fixed, so if that was the cause this should be resolved.

Nope, it's happening again as of this morning with no changes made to the AWS resources. Seems the correct behaviour was a fluke.

Just learned that we have seen problems off and on with all of the actions today not just the dynamic self-hosted runners. I think there are stability issues happening on GitHub.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

cmcconnell1 picture cmcconnell1  ·  7Comments

mcaulifn picture mcaulifn  ·  13Comments

mkryva picture mkryva  ·  17Comments

Kostiantyn-Vorobiov picture Kostiantyn-Vorobiov  ·  6Comments

npalm picture npalm  ·  11Comments