Mongoose: Don't reconnect after disconnection

Created on 26 Oct 2016 · 47Comments · Source: Automattic/mongoose

After upgrading from 4.4.11 to >=4.6.1 we get random disconnections and mongoose never reconnects.

mongoose version: 4.6.5
mongodb version: 3.2.9

This is the way I create the connections:

    var uri = 'mongodb://USENAME:PASSWORD@host1:port1,host2:port2/database?authSource=admin';

    var options = {};
    options.server = {
      auto_reconnect: true,
      poolSize: 5,
      socketOptions: { keepAlive: 1, connectTimeoutMS: 30000 },
      reconnectTries: 3000
    };

    options.replset = {
      auto_reconnect: true,
      poolSize: 5,
      socketOptions: { keepAlive: 1, connectTimeoutMS: 30000 },
      reconnectTries: 3000
    };

    var db = mongoose.createConnection(uri, options);

    mongoose.connection.on('error', function(err) {
      console.log('MONGODB ERROR MONGOOSE LEVEL ' + server, err);
    });

    db.on('connecting', function() {
      console.info('MONGODB ' + server + ' connecting.');
    });

    db.on('error', function(err) {
      console.log('MONGODB ERROR ' + server, err);
    });

    db.on('close', function(err) {
      console.log('MONGODB CLOSE ' + server, err);
    });

    db.on('connected', function() {
      console.info('MONGODB ' + server + ' connected successfully.');
    });

    db.once('open', function callback() {
      console.info('MONGODB ' + server + ' opened successfully.');
    });

    db.on('reconnected', function() {
      console.info('MONGODB ' + server + ' reconnected.');
    });

    db.on('timeout', function() {
      console.info('MONGODB ' + server + ' timeout.');
    });

    db.on('disconnected', function() {
      console.info('MONGODB ' + server + ' disconnected');
    });

This is the sequence of events I get:

pid:3429 MONGODB geo_uri connected successfully.
pid:3429 MONGODB geo_uri opened successfully.
pid:3429 MONGODB dashboards_db connected successfully.
pid:3429 MONGODB dashboards_db opened successfully.
pid:3429 MONGODB tweet_analytics_db connected successfully.
pid:3429 MONGODB tweet_analytics_db opened successfully.
pid:3429 MONGODB fullcontact_enrichment_db disconnected
pid:3429 MONGODB ERROR fullcontact_enrichment_db { [MongoError: no valid replicaset members found]
  name: 'MongoError',
  message: 'no valid replicaset members found' }
pid:3429 MONGODB uri connected successfully.
pid:3429 MONGODB uri opened successfully.
pid:3429 MONGODB sync_reports_db connected successfully.
pid:3429 MONGODB sync_reports_db opened successfully.
pid:3429 MONGODB uri disconnected
pid:3429 MONGODB CLOSE uri

Then none of the disconnected databases reconnect. Due to the random nature of the problem, I won't be able to provide a code to reproduce it. I suspect about something saturated at node.js level or so. Independently of the cause of the disconnection, is there anything I can do to try to reconnect myself?

Source

aartiles

👍2

Most helpful comment

@Koslun the fork should be unnecessary now. The fix is pretty much just setting socketTimeout to 0 for the database socketOptions now.

Heres what my socket options look like now.

    var opts = {
      server: {
        socketOptions: {
          keepAlive: 1,
          socketTimeout: 0
        }
      },
      replSet: {
        socketOptions: {
          keepAlive: 1,
          socketTimeout: 0
        }
    }

jakesjews on 13 Feb 2017

👍6 🎉3

All 47 comments

+1, had 2 outages in the last 5 days related to this issue. My current workaround is to make my process explicitly crash (process.exit(0)) when disconnected event is emitted. Then, process is restarted and connection properly opened again.

My guess is that mongoose tries to reconnect, but fails within the reconnectTries * reconnectInterval interval, and error event is not emitted due to this regression (https://github.com/Automattic/mongoose/pull/4653). What I don’t know is why mongoose/mongo get randomly disconnected, I didn’t have this behavior before.
What mongo hosting are you running? mLab?

loris on 27 Oct 2016

We host it ourselves in AWS. Unfortunately process.exit(0) is not an option for us.

aartiles on 27 Oct 2016

I have applied the changes at #4653 and got the same behavior. Disconnection after a couple of hours:

2016-10-27T11:26:42 pid:5276 MONGODB sync_reports_db connected successfully.
2016-10-27T11:26:42 pid:5276 MONGODB sync_reports_db opened successfully.
.... 2 hours later
2016-10-27T13:45:45 pid:5276 MONGODB sync_reports_db disconnected
2016-10-27T13:45:45 pid:5276 MONGODB CLOSE sync_reports_db

aartiles on 27 Oct 2016

No error event emitted? (should be, 30 seconds after the disconnected one)

loris on 27 Oct 2016

No, have a look at the code in the issue description. Unless I'm doing something wrong, there is an event handler in place for the error event. In fact, the process is still running and mongoose hasn't triggered any other event.

aartiles on 27 Oct 2016

We've been having the same problems in the last few days since 4.6.5 – random disconnects which causes the node process to get stuck. But no error event. Reverting to 4.5.3 works.

@loris is this related to https://github.com/Automattic/mongoose/commit/f7ebee0c992c45cdb27ba7f0675556b980cddaad in 4.6.6?

mck- on 4 Nov 2016

@mck Yes, https://github.com/Automattic/mongoose/commit/f7ebee0c992c45cdb27ba7f0675556b980cddaad fix the error event not being emitted when mongodb connection retry mechanism fail. However, I have no idea why random disconnects are happening in the first place, any idea @vkarpov15?

loris on 4 Nov 2016

fwiw, it happened for us 40-50% of the time if we tried to do a save/update operation (writing about 650kb)

mck- on 4 Nov 2016

Yeah I don't really have many good ideas. You can try calling close() on the connection and then calling connect() yourself again. @loris do you have a similar experience where a heavy save/update seems to cause this?

vkarpov15 on 11 Nov 2016

We are facing this as well. On one of our services we're getting an error event (connection N to ds0XXXXX-a0.mongolab.com:XXXXX timed out), followed by disconnected event. And connection never reestablishes. On another service we get disconnected event after a heavy request on the DB, that is removing 2M records. And then it's unable to reconnect (mongoose 4.6.6, DB version 3.0.12).

lushchick on 11 Nov 2016

It just happened to us again, a few minutes ago, running mongodb on mLab too (not sure it is related tho). We ran a heavy query, it timed out (unhandledRejection { MongoError: connection 0 to ds****-a0.mongolab.com:**** timed out }, the disconnected event was properly emitted, but nothing after, no error, no reconnected etc, thus this web server kept running and receiving HTTP requests, but they all timed out until we restarted it because it kept running mongoose queries which buffered and never returned.

Our mongoose setup (4.6.5 on node 7.0.0):

const mongoConnectOpts = { reconnectTries: 10, reconnectInterval: 500, socketOptions: { keepAlive: 300000, connectTimeoutMS: 30000 } };
mongoose.connect(process.env.MONGODB_URI, { server: mongoConnectOpts, replset: mongoConnectOpts });
mongoose.connection.on('error', err => {
  console.log({ event: 'mongoose:error', ...err });
  process.exit(0);
});
mongoose.connection.on('connected', () => console.log({ event: 'mongoose:connected' }));
mongoose.connection.on('disconnected', () => console.log({ event: 'mongoose:disconnected' }));
mongoose.connection.on('reconnected', () => console.log({ event: 'mongoose:reconnected' }));

Temporary workaround would be to process.exit(0) also on disconnected event to force the web server to restart and setup a new mongodb connection. Any ideas?

loris on 11 Nov 2016

Yeah I've heard of this sort of thing happening with mlab before. TBH in this case I'd just crash the server and restart, really slow spotty networking tends to cause trouble for the mongodb driver and I'd imagine debugging this would involve coordinating with mlab.

vkarpov15 on 15 Nov 2016

I’m not sure what is wrong now (mongoose, mongo driver, mlab or heroku), but for the past few days, running a web request which execute an heavy mongoose query (taking more than 30s to respond) will trigger a heroku request timeout (this is a mechanism in heroku which timeout any web request taking more than 30s). Once that web request has timed out, any following web request on that server which requires a mongoose query will timeout too. What is a real problem is that ZERO event is triggered by mongoose (no error, close, disconnected, etc...), so we have no way to detect the disconnection and restart the server. Here is how we setup mongoose:

mongoose.Promise = global.Promise;
mongoose.set('debug', process.env.NODE_ENV === 'development');
const mongoConnectOpts = { reconnectTries: 10, reconnectInterval: 500, socketOptions: { keepAlive: 300000, connectTimeoutMS: 30000 } };
mongoose.connect(process.env.MONGODB_URI, { server: mongoConnectOpts, replset: mongoConnectOpts });
mongoose.connection.on('error', err => {
  logfmt.log({ event: 'mongoose:error', ...err });
  process.exit(0);
});
mongoose.connection.on('connected', () => logfmt.log({ event: 'mongoose:connected', uri: _.truncate(process.env.MONGODB_URI) }));
mongoose.connection.on('disconnected', () => {
  logfmt.log({ event: 'mongoose:disconnected' });
  process.exit(0);
});
mongoose.connection.on('close', () => logfmt.log({ event: 'mongoose:close' }));
mongoose.connection.on('reconnected', () => logfmt.log({ event: 'mongoose:reconnected' }));

// Setup Redis cache (Default cache TTL: 60 seconds)
cachegoose(mongoose, { engine: 'redis', client: redisCache }, process.env.NODE_ENV === 'development');

loris on 17 Nov 2016

@Ioris can you see if this still happens if you disable cachegoose?

vkarpov15 on 19 Nov 2016

@vkarpov15 Will try to check this out but not easy since bug only happens production and we can’t afford to run production with cache disabled.
@aartiles @mck- @lushchick do you guys use mongoose cache plugin?

loris on 19 Nov 2016

Its possible this might be related to https://github.com/christkv/mongodb-core/issues/148. I have been running into a similar issue when a member of a replicaset becomes unavailable.

jakesjews on 19 Nov 2016

We don't use mongoose cache plugin.

aartiles on 20 Nov 2016

I am still looking into it, what I can tell you so far:

I was able to reproduce the issue on my local machine: I start a loop running a fast query every seconds and logging results, some delay later, I run an heavy query (like an unindexed query on a million rows collection). All queries (the fast ones too) will get locked/timed out, no mongoose event will be fired.
The issue appeared in 4.6.5, downgrading to 4.6.4 fix the issue (the fast queries are still getting executed while the heavy query is running)
It looks like it has nothing to do with:
- the mongoose cache plugin
- running a replica set or not

loris on 22 Nov 2016

I did more digging and the issue came with upgrading to [email protected], and the buggy commit there is https://github.com/mongodb/node-mongodb-native/pull/1418
it looks like they fixed a typo, but that eventName (with the typo) is being used by some deps

loris on 22 Nov 2016

@loris Do you know where the remaining references with the wrong event name are? I can make a PR to fix it but I'm not able to find any.

jakesjews on 22 Nov 2016

@jakesjews Can’t find any references too, so I might be wrong about the cause I might have checked out mismatching versions on my machine, I will get back to you

loris on 22 Nov 2016

Having a similar issue: a failure to reconnect after a disconnect. This manifests in the client by hanging on any mongoose db operation (timing out via operation buffering).

@jakesjews my issue is unique to replica sets as well and does not occur with a single node connection. Looking more into the core replset.js.

Enabled debug logging within the node-mongodb-native to see if the HA reconnection was working as expected, it appears to be.

attemptReconnect for replset with id successful resuming topologyMonitor 1

Despite the node-mongodb-native driver claiming it successfully performed its attemptReconnect, mongoose will never emit a connected or reconnected event like it does when a single node non-replset reconnects.

As @loris mentioned, process.exit(0) -> restart of service would work (in my case) since the issue is directly related to reconnection to a replica-set but again not ideal.

[email protected]
[email protected]

kog13 on 22 Nov 2016

@mck- Found the same as you, downgrading to 4.5.3 fixed the reconnect issue for my replicaset connection.

I can upgrade all the way to 4.5.10 before it starts failing at 4.6.0 where the replicaset reconnect issue occurs for the first time. Current guess is that it could be related to the upgrade to mongodb driver 2.2.9.

kog13 on 23 Nov 2016

@loris could you provide your test case so that we can try it?

aartiles on 23 Nov 2016

I’m still looking at it, I was wrong previously about the typo fix commit. It looks like the culprit is https://github.com/christkv/mongodb-core/pull/146/commits/09caa9d1e5423acd2f8f154f7b7430028e77e57f
Providing a test case is a bit complicated since it only happens this way:

mongoose 4.6.8, connecting to my localhost mongodb (3.2) with default settings
2 express routes, one running a long-running mongoose query (several seconds), one running a fast-running mongoose query (the issue doesn’t occur when running the mongoose queries directly in node, with setInterval/setTimeout test case for instance, so my guess is that it is related to how pool connection is handled)
If I execute the long-running express route, then try to hit the fast-running, the latter will keep running without returning
Setting poolSize to 50 instead of the default one, fix the issue
Checking out the previous commit from mongodb-core fix the issue also (the fast-running express route returns in a few ms while the long running route is processing) (with the default pool size)
So I guess https://github.com/christkv/mongodb-core/pull/146/commits/09caa9d1e5423acd2f8f154f7b7430028e77e57f changed something in how a single long-running mongoose can use every available connection in a pool

loris on 23 Nov 2016

A fix just landed in mongo-core which might fix this issue.

jakesjews on 23 Nov 2016

@loris yeah increasing pool size will help. Looks like you're running into the slow train problem where the mongodb server can't handle more than poolSize requests simultaneously. Increasing pool size will help, just don't increase it too much, or you'll start seeing performance issues with WiredTiger.

vkarpov15 on 26 Nov 2016

Hi,
any update on this? I'm seeing the same issue, auto_reconnect not working when using a replica set. I just tried mongoose 4.7.0 with mongodb 2.2.11 and still not reconnecting. I'm using mongod version 3.2.10.

I'm running everything on a single host (laptop) with three instances of mongod running on different ports with different database directories. Doesn't seem like that should be an issue, but I'm new to mongo/mongoose/node/javascript. My node app with mongoose is also running on the same host.

I can reproduce this by simply shutting down all the mongod processes
(launchctl stop mongod01; launchctl stop mongod02; launchctl stop mongod03)
wait for the connection closed message then restart (replace "stop" with "start" in launchctl commands). My app never reconnects to mongo.

If I do the same test with a single instance of mongod not configured as a replica set, mongoose reconnects fine.

glad to provide logs or try patches if that's helpful.

joeldodson on 30 Nov 2016

After doing some digging I think I might have found one source of the problem. It looks like when autoreconnect is true the connection buffer isn't supposed to become active https://github.com/Automattic/mongoose/blob/master/lib/drivers/node-mongodb-native/connection.js#L153 on connection close events. However there is no more autoReconnect property being set inside the replset class in mongodb-native https://github.com/mongodb/node-mongodb-native/blob/2.2/lib/replset.js so any close event from one the nodes causes the buffer to become permanently enabled. I've had some luck with the commit https://github.com/eflexsystems/mongoose/commit/5ac12727f34b41791f94643b66c8cc88aff4d66a but I want to give it some more time to see if it caused any other issues before I make a pull request.

jakesjews on 30 Nov 2016

@joeldodson your are describing the same issue that I experienced. Just a heads up releases >= 4.6.0 seem to contain the issue. I would try 4.5.10 in the meanwhile, it has been reconnecting to a replset and single connection just fine for me.

kog13 on 1 Dec 2016

Thanks @jakesjews and @kog13

I tried 4.5.10 and mongoose does reconnect after restarting the replica set. However the db.readyState doesn't seem to be cleared after all instances in the replica set were stopped. We're checking that to determine whether to reject any requests (so they don't queue up waiting for the app to reconnect to the DB). Also, I didn't get the disconnected or connection closed notifications with 4.5.10.

I already have logic that sits in a loop with a 5 second timer to try to connect if the DB is not available when the app starts. We've tried calling that in the db.on('closed', function(){...}) handler and it seems to work fine. My concern though is whether explicitly trying to connect will have any conflicts with any reconnect logic under the hood. Since reconnect doesn't seem to be happening for replica sets, I'm thinking it's okay. Also we set auto_reconnect to false in the connection options for both the server and the replset.

@jakesjews - I tried the patch you mentioned above but still didn't get the reconnect. Maybe I missed something but it looked like that patch was more about ensuring the close event is generated and the readyState is updated.

I'm glad to try any more patches for auto_reconnect if anyone has any. I'll keep digging around as well.

thanks.

joeldodson on 2 Dec 2016

@joeldodson in addition to the patch above you'll also need to depend on the latest release on mongo-core which has fixes for making sure the replica set monitor connection stays alive. If you try out my fork it should have it already.

I think I am in a pretty good place with regards to timeouts and failover now with mongoose. If anyone else wants to try it out you will want to use my fork of mongoose in addition to setting socketTimeout to 0 in the socketOptions.

The reason for setting socketTimeout to 0 is that there is a bug in mongo-core which I still need to file an issue for. The issue is caused by the dynamically shrinking/expanding connection pool. The way the pool works is that every time it adds a new connection that connection will close after 30 seconds of inactivity. Timeout events trigger the removal from the pool and also run a check comparing the number of timeouts against a limit of 30 connection attempts. There is a heartbeat which runs every 2 seconds and clears out the number of timeouts but if 30 or more requests are queued up in parallel it will cause all of them to timeout in between the heartbeat and destroy the connection pool. For now setting timeout on the connections to 0 prevents connections from being pruned back in the pool if they are inactive and avoids the issue. If you want to replicate the issue try setting a connection pool size to something around 50 and running 50 parallel queries. After that happens the pool will be destroyed in about 30 seconds. Note that the timeout doesn't affect the heartbeat checking the replicaset since that has its own timeout.

I've been really swamped with work recently so I haven't had a chance to collect all these changes together but I hope to get to it soon.

jakesjews on 2 Dec 2016

Thanks again @jakesjews. I pulled in your mongoose and mongodb-core repos. The reconnect did work. I didn't get the 'connected' and 'reconnected' events though that I get when using a single instance of Mongo. Also the readyState doesn't seem to get reset, it's still 0 even after the reconnect.

I'm glad to help with testing or collecting logs.

joeldodson on 2 Dec 2016

Still having the issue with [email protected]

loris on 6 Dec 2016

Anything new regarding this matter?

YouriT on 25 Jan 2017

👍4

Experiencing the issues as well here. Had to revert mongoose version upgrade because of connection issues. A query taking a few seconds currently timeouts our connection, where this previously was not the case.

This issue must be dealt with otherwise the package is not useable.
Probably will try out @jakesjews solution when I have the time. Until then using 4.4.X

Serfenia on 13 Feb 2017

If the problem is a bug with mongo-core, this isn't really an issue with mongoose. If you're getting the issue with the most recent version of mongoose, can you file an issue in the mongodb-core repo?

varunjayaraman on 13 Feb 2017

Actually it is an issue with mongoose since it has been updated to depend on the version of mongodb-core having an issue. Maybe mongoose should be restored to a previous version of mongodb-core.

loris on 13 Feb 2017

@jakesjews Seeing that [email protected] depends on [email protected], which in turn depends on [email protected], applying the fixes from your fork would be all that is required to fix this issues with seemingly no side-effects?

And looking at your fork, this commit is now the only change that is necessary for a 4.7.x and/or 4.8.x PR?

Koslun on 13 Feb 2017

@Koslun the fork should be unnecessary now. The fix is pretty much just setting socketTimeout to 0 for the database socketOptions now.

Heres what my socket options look like now.

    var opts = {
      server: {
        socketOptions: {
          keepAlive: 1,
          socketTimeout: 0
        }
      },
      replSet: {
        socketOptions: {
          keepAlive: 1,
          socketTimeout: 0
        }
    }

jakesjews on 13 Feb 2017

👍6 🎉3

@jakesjews Ok, thanks for the swift clarification :).

Do you know in which version of mongodb-core we will see a fix to this bug? Or alternatively have an issue, PR or commit we can track?

Koslun on 13 Feb 2017

👍2

Any updates regarding @Koslun comment?

Serfenia on 30 Mar 2017

Any updates for this issue?

tiant167 on 16 May 2017

👍3

The issue is still open from 2016 :open_mouth:

I do wonder if this issue can still happen with mongoose 5.x with mongodb driver 3.3.4 and MongoDB server at version 4.x ? :thinking:

szabolcs-szilagyi on 10 Dec 2019

Does anyone know, if the reconnectTries gets reset when reconnection is successful.

Fox instance, if reconnectTries is set to 30, and once disconnected mongoose tries 5 times and connection is successful.
Next time the connection is lost, What will be the counter for retries?
Will it try to reconnect 30 times ?
Or 25 times?