We're seeing a persistent memory issue since one week ago on Saturday, and I'm compiling information about it here to investigate.
Wondering if it's related to this controller method for the dashboard.
Noting @icarito's comment:
I wonder jywarren because I had edited docker-compose-production.yml to use fewer processes (didn't make a PR for it). So it could be we just made it fit that way.
And this graph:
We're seeing a lot of SMTP test errors too:
Link: | https://intelligence.rackspace.com/cloud/entities/en45StuOyk/checks/chXoX9GHhF/alarm/alycd3HZyu
Yes load is very high too. From the htop
and especially iotop
it appears mailman
is quite active. It's the culprit for sure! Prior to May 22th we ran it a few times a day - if we can run it every few minutes minute or so (not every second!) - it would be fine!
I, [2019-05-07T23:56:44.702410 #1] INFO -- : Mailman v0.7.0 started
I, [2019-05-08T21:33:03.762360 #1] INFO -- : Mailman v0.7.0 started
I, [2019-05-09T07:47:27.518491 #1] INFO -- : Mailman v0.7.0 started
I, [2019-05-09T08:18:47.825703 #1] INFO -- : Mailman v0.7.0 started
I, [2019-05-10T08:14:53.010705 #1] INFO -- : Mailman v0.7.0 started
I, [2019-05-10T21:45:50.739207 #1] INFO -- : Mailman v0.7.0 started
I, [2019-05-11T17:38:51.647335 #1] INFO -- : Mailman v0.7.0 started
I, [2019-05-13T03:33:15.682877 #1] INFO -- : Mailman v0.7.0 started
I, [2019-05-14T05:51:40.603184 #1] INFO -- : Mailman v0.7.0 started
I, [2019-05-14T05:53:20.857041 #1] INFO -- : Mailman v0.7.0 started
I, [2019-05-14T05:55:00.356772 #1] INFO -- : Mailman v0.7.0 started
I, [2019-05-14T05:56:40.487219 #1] INFO -- : Mailman v0.7.0 started
I, [2019-05-15T01:43:42.908744 #1] INFO -- : Mailman v0.7.0 started
I, [2019-05-16T10:13:45.703985 #1] INFO -- : Mailman v0.7.0 started
I, [2019-05-18T12:57:16.194957 #1] INFO -- : Mailman v0.7.0 started
I, [2019-05-22T08:49:27.019569 #1] INFO -- : Mailman v0.7.0 started
I, [2019-05-22T08:49:55.827419 #1] INFO -- : Mailman v0.7.0 started
I, [2019-05-22T08:50:18.722700 #1] INFO -- : Mailman v0.7.0 started
I, [2019-05-22T08:50:41.709075 #1] INFO -- : Mailman v0.7.0 started
I, [2019-05-22T08:51:00.124271 #1] INFO -- : Mailman v0.7.0 started
I, [2019-05-22T08:51:17.146210 #1] INFO -- : Mailman v0.7.0 started
I, [2019-05-22T08:51:33.745494 #1] INFO -- : Mailman v0.7.0 started
I, [2019-05-22T08:51:51.387282 #1] INFO -- : Mailman v0.7.0 started
I, [2019-05-22T08:52:09.145006 #1] INFO -- : Mailman v0.7.0 started
I, [2019-05-22T08:52:31.266559 #1] INFO -- : Mailman v0.7.0 started
I, [2019-05-22T08:53:03.176998 #1] INFO -- : Mailman v0.7.0 started
I, [2019-05-22T08:53:26.991989 #1] INFO -- : Mailman v0.7.0 started
I, [2019-05-22T08:53:54.074275 #1] INFO -- : Mailman v0.7.0 started
I, [2019-05-22T08:54:13.905343 #1] INFO -- : Mailman v0.7.0 started
I, [2019-05-22T08:54:37.736641 #1] INFO -- : Mailman v0.7.0 started
I, [2019-05-22T08:54:57.357057 #1] INFO -- : Mailman v0.7.0 started
I, [2019-05-22T08:55:15.522535 #1] INFO -- : Mailman v0.7.0 started
I, [2019-05-22T08:55:34.343241 #1] INFO -- : Mailman v0.7.0 started
I, [2019-05-22T08:55:51.964241 #1] INFO -- : Mailman v0.7.0 started
I, [2019-05-22T08:56:10.016964 #1] INFO -- : Mailman v0.7.0 started
I, [2019-05-22T08:56:42.822692 #1] INFO -- : Mailman v0.7.0 started
I, [2019-05-22T08:56:59.826809 #1] INFO -- : Mailman v0.7.0 started
I, [2019-05-22T08:57:16.178517 #1] INFO -- : Mailman v0.7.0 started
I, [2019-05-22T08:57:35.871196 #1] INFO -- : Mailman v0.7.0 started
I, [2019-05-22T08:57:59.731422 #1] INFO -- : Mailman v0.7.0 started
I, [2019-05-22T08:58:16.353160 #1] INFO -- : Mailman v0.7.0 started
I, [2019-05-22T08:58:33.608591 #1] INFO -- : Mailman v0.7.0 started
I, [2019-05-22T08:58:50.037296 #1] INFO -- : Mailman v0.7.0 started
I, [2019-05-22T08:59:06.912680 #1] INFO -- : Mailman v0.7.0 started
I, [2019-05-22T08:59:32.287362 #1] INFO -- : Mailman v0.7.0 started
I, [2019-05-22T08:59:59.201948 #1] INFO -- : Mailman v0.7.0 started
I, [2019-05-22T09:00:18.739067 #1] INFO -- : Mailman v0.7.0 started
I, [2019-05-22T09:00:42.144910 #1] INFO -- : Mailman v0.7.0 started
I, [2019-05-22T09:01:03.495556 #1] INFO -- : Mailman v0.7.0 started
I, [2019-05-22T09:01:20.493712 #1] INFO -- : Mailman v0.7.0 started
I, [2019-05-22T09:01:37.089192 #1] INFO -- : Mailman v0.7.0 started
I, [2019-05-22T09:01:53.921571 #1] INFO -- : Mailman v0.7.0 started
I, [2019-05-22T09:02:14.509227 #1] INFO -- : Mailman v0.7.0 started
The log is filled with cycles of these, no error:
I, [2019-06-02T02:35:26.270644 #1] INFO -- : Mailman v0.7.0 started
I, [2019-06-02T02:35:26.270851 #1] INFO -- : Rails root found in ., requiring environment...
I, [2019-06-02T02:35:56.930267 #1] INFO -- : POP3 receiver enabled ([email protected]@pop.gmail.com).
I, [2019-06-02T02:35:56.938850 #1] INFO -- : Polling enabled. Checking every 5 seconds.
Looks like mailman is crashing and being immediately respawn!
icarito@rs-plots2:/srv/plots_container/plots2$ docker ps
CONTAINER ID IMAGE COMMANDCREATED STATUS PORTS NAMES
8d13c675568e containers_mailman "script/mailman_serv…"4 days ago Up 14 seconds containers_mailman_1
f423dec91ebe containers_web "/bin/bash -c 'sleep…"4 days ago Up 4 days 127.0.0.1:4001->4001/tcp containers_web_1
24f7b43efebc containers_sidekiq "bundle exec sidekiq…"4 days ago Up 4 days containers_sidekiq_1
070511ab43d1 redis:latest "docker-entrypoint.s…"4 days ago Up 4 days 6379/tcp containers_redis_1
6ea8f0498b2c mariadb:10.2 "docker-entrypoint.s…"4 days ago Up 3 days 3306/tcp containers_db_1
I've decided to stop this container for tonight in order to monitor the effect on performance.
I think we may also look at what gems updatea were merged in the days leading up to this code publication. Thanks!
That's so weird about mailman, I will look at the config but I don't remember any changes to the rate.
Oh you know what? We set it to retry 3 times. Maybe these are overlapping now? It could at least have increased the rate of attempts since it retries 3 times for every scheduled run.
Ok modified it for 20 seconds, which should mean max a retry every 5 seconds --
https://github.com/publiclab/plots2/commit/a40ea5650f2ce9ec80ee2324cea2d8c9bd98e382
That'll be the same rate as before when we added retries.
OK, now working on analysis after a few hours:
https://oss.skylight.io/app/applications/GZDPChmcfm1Q/1559574420/6h/endpoints
Overall looks good. But, on closer look, it's ramping up in load time:
Comparing the latter portion where it's starting to go back up:
to the earlier just after the reboot:
And then to this from a couple weeks ago before all our trouble:
Then finally just after we started seeing issues on the 22-23rd of May:
Overall it's not conclusive.
Resources:
One of the tough things about this is that it's right around where these two commits happened:
I'd like to think it relates to the addition of the retry 3 times
code in https://github.com/publiclab/plots2/commit/2bc7b498ef3a05bc090ef26f316a30ec0104bcc6, which I tried tweaking today. But actually load times are still slowly growing.
This could mean that a) something else is driving it, or b) the "rescue/retry" cycle itself could be causing memory leak buildup?
shall i comment out the rescue/retry code entirely?
maybe the hanging waiting for mysql to pick up is actually taking up threads?
I'll try this. Site is almost unresponsive.
I removed the retry
here: https://github.com/publiclab/plots2/commit/faa5a12e86bf7944dca43134f649947f03ca96a6
Deploying... it'll take a while.
Hmm it really doesn't seem solved... https://oss.skylight.io/app/applications/GZDPChmcfm1Q/1559577660/8h13m/endpoints
Ok I wonder if the container setup affected the mailman container at all? Because at this point we've reverted all the likely stuff from the mailman script.
OK, overnight it peaked and went back down a bit. But our problematic ones are still quite high, with peaks at about 20 seconds:
The stats range calls are taking up to 40+ seconds!
They're also taking forever on cache generation:
Could we be seeing an issue with the cache read/write?
@icarito could there be like an issue on the read/write io or something on cache generation? I'm just not sure why it would take this long to pack all the data into the cache.
Leaky gems -- check off if we're OK
Non-leaky but memory issues in any case:
I'm still seeing this massive cache generation time for stats_controller#range
and wondering if we need to tweak where cache is stored. It looks like the default is file storage (and I checked, we have cache files in /plots2/tmp/cache/
. Would we be helped at all by switching to in-memory caching or memcached
, both of which are apparently pretty simple changes?
https://guides.rubyonrails.org/v3.2/caching_with_rails.html#activesupport-cache-memorystore
Also looking at https://www.skylight.io/support/performance-tips
I'll look at the email configuration now but if it doesn't yield anything I'll merge this, turning off the begin/rescue
loop: #5840
OK our next step for https://github.com/publiclab/plots2/pull/5841 is to develop a monitoring strategy for if mailman goes down.
Deploying with the new email credentials, AND the begin/rescue
removal. However, I think it's worth redeploying with the begin/rescue
re-instated if the memory leak is solved, because it could have been the email credential issues.
Latest error:
mailman_1 | /app/app/models/comment.rb:265:in add_comment': undefined methodbody' for nil:NilClass (NoMethodError) mailman_1 | from /app/app/models/comment.rb:218:in receive_mail' mailman_1 | from script/mailman_server:31:inblock (2 levels) in <main>' mailman_1 | from /usr/local/bundle/gems/mailman-0.7.0/lib/mailman/router.rb:66:in instance_exec' mailman_1 | from /usr/local/bundle/gems/mailman-0.7.0/lib/mailman/router.rb:66:inroute' mailman_1 | from /usr/local/bundle/gems/mailman-0.7.0/lib/mailman/message_processor.rb:23:in block in process' mailman_1 | from /usr/local/bundle/gems/mailman-0.7.0/lib/mailman/middleware.rb:33:inblock in run' mailman_1 | from /usr/local/bundle/gems/mailman-0.7.0/lib/mailman/middleware.rb:38:in run' mailman_1 | from /usr/local/bundle/gems/mailman-0.7.0/lib/mailman/message_processor.rb:22:inprocess' mailman_1 | from /usr/local/bundle/gems/mailman-0.7.0/lib/mailman/receiver/pop3.rb:43:in block in get_messages' mailman_1 | from /usr/local/lib/ruby/2.4.0/net/pop.rb:666:ineach' mailman_1 | from /usr/local/lib/ruby/2.4.0/net/pop.rb:666:in each_mail' mailman_1 | from /usr/local/bundle/gems/mailman-0.7.0/lib/mailman/receiver/pop3.rb:42:inget_messages' mailman_1 | from /usr/local/bundle/gems/mailman-0.7.0/lib/mailman/application.rb:133:in block in polling_loop' mailman_1 | from /usr/local/bundle/gems/mailman-0.7.0/lib/mailman/application.rb:130:inloop' mailman_1 | from /usr/local/bundle/gems/mailman-0.7.0/lib/mailman/application.rb:130:in polling_loop' mailman_1 | from /usr/local/bundle/gems/mailman-0.7.0/lib/mailman/application.rb:83:inrun' mailman_1 | from /usr/local/bundle/gems/mailman-0.7.0/lib/mailman/application.rb:11:in run' mailman_1 | from script/mailman_server:22:in<main>'
That's here:
ug, finally relaly publishing the comment.rb fix....
OK, we're waiting to see if the email queue flushes out and we see some return to normalcy then...
I left a comment on https://publiclab.org/notes/mimiss/06-04-2019/workshop-viii to test
Hi @jywarren I've been giving this a second look and have a theory.
First here is a graph for RAM use for the past 3 months:
It is apparent from this graph that we have been growing in memory consumption for the past three months!
I went back a whole year:
Aparently, in 2019, our application has increase its memory requirements quite a bit.
The theory is that following the trajectory of memory consumption we have been having, we may have reached a threshold where we have consumed available RAM and have begun relying on Swap, which is slowing things down considerably.
The memory increase could well be the size of some of our tables (rusers
i'm looking at). This may have a relation to #5524 .
We will have to implement some optimizations, migrate the database to a different host, or add more RAM.
Pruning the database of spam users is also highly recommended.
I'm still leaning towards memory exhaustion due to app/site growth, which is causing high IO load due to swap memory "thrashing" to disk.
I've checked our passenger-memory-stats
from the web container and think that we can further reduce the process pool:
I will try this as a first move to remediate performance.
I found that in Feb 2018 we had calculated that we could run 11 processes (because our app took 500mb to run).
The formula is:
max_app_processes = (TOTAL_RAM * 0.75) / RAM_PER_PROCESS
= 6000Mb / 750Mb
= 8
but we are also running Skylightd, plus a process for fetching tweet comments, plus Sidekick, and also want to run mailman process.
The majority of RAM use is in the web container:
From both the above images I derive that we can still spare a one process, especially if this will make responses quicker.
Moving to 4 process pool size.
First optimization done.
Promising first 30 minutes!
Ooh!
On Sat, Jun 8, 2019, 8:47 PM Sebastian Silva notifications@github.com
wrote:
First optimization done.
Promising first 30 minutes!
[image: imagen]
https://user-images.githubusercontent.com/199755/59154753-46635b00-8a3f-11e9-87b7-51e660e4a148.png—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/publiclab/plots2/issues/5817?email_source=notifications&email_token=AAAF6J7GXQIQPVWFTWGYJRLPZR4KJA5CNFSM4HSA3N32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXIC5AQ#issuecomment-500182658,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAAF6J65RCLAEFO6H6RJLSTPZR4KJANCNFSM4HSA3N3Q
.
OK, so mitigations list would be:
rusers
- perhaps building on work in https://github.com/publiclab/plots2/issues/5450memcached
Hey @jywarren and @icarito,
First, (and I say this without joking): this thread actually turned out to be quite a good read. It had all the elements, a mystery, a hunt, dead ends, close calls, etc.
Anyway.
Regarding the rusers table relative to #5450 and #5524, there an _enormous_ grouping in rusers that occurred between 4/26/13 - 1/1/14.
4/26/2013: 1366934400
1/1/2014: 1388534400
UID Range: 59627 - 420114
Users: 360466
Do you want to try the first query of the test run you described in #5450 on a portion of that group?
users who have not posted any node, comment, like, or subscription and have not logged in ever
As you said, this would be an easy query since not logging in ever would cover all of the criteria that came before it.
For reference on equivalent portion size to your proposed last 6 months in the other email: In the last month we've marked ~250 first-time postings as spam. So, let's say that in the last 6 months we had ~1500 banned users due to spam.
Oh, and I guess that brings up a good point. If you want to rid yourself of spam users you can just find all of the users who have content marked as spam and then delete the users who posted them.
As was briefly touched on in one of the issues, it might be good to have users with first-time-content marked as spam immediately deleted from the database.
Hi @skilfullycurled thank you for your input! So a majority of rusers is from 2013-2014 : That means to me that while it can help to reduce RAM usage, actually, our major tables are rsessions and impressions.
rsessions is over 30GB.
@jywarren and @skilfullycurled - it would be great to come up with a strategy to reduce this and / or optimize queries using this table!
Also I think memcached isnt a good fit for this issue as it should consume more ram not less..
Although one can limit the memory use of memcached, I'll still try it!
Nope, from the docs above:
If you’re running multiple Ruby on Rails server processes (which is the case if you’re using mongrel_cluster or Phusion Passenger), then your Rails server process instances won’t be able to share cache data with each other. This cache store is not appropriate for large application deployments, but can work well for small, low traffic sites with only a couple of server processes or for development and test environments.
Look like it's not too hard to solve _rsessions_:
https://stackoverflow.com/questions/10088619/how-to-clear-rails-sessions-table
@jywarren let's do this!
@icarito, I'm not sure this was ever done, but I had access to the database in 2016 and I notified everyone that the user sessions took up more space then the actual rest of the database by far. I was told they'd be flushed so either they were not or the problem remains that the database just continues to keep the sessions.
To give a feeling, as of 2016, the plots database _compressed_ as bz2 was 1.9GB (no time right now to decompress for actual size), _uncompressed_, with the sessions removed, it was 518 MB
Thanks @skilfullycurled !!! I think I remember your input from 2016, I don't know how we missed flushing that, but our database dumps today are over 8GB compressed, likely mostly sessions.
I'll wait for confirmation from @jywarren - I'd like to run the following in production today and then we can make it into a rake task or a cron job:
DELETE FROM sessions WHERE updated_at < DATE_SUB(NOW(), INTERVAL 1 DAY);
I got too curious, the uncompressed file is 6.8GB so minus the 518MB that puts us at 6.3GB. 😆
The rsessions is actually my favorite dataset that I have. It's completely use-_less_ , but I just love that it's as large if not larger than datasets that I have that are use-_ful_! If anyone has any ideas for what to do with it, let me know!
icarito (@icarito:matrix.org) got it from here https://stackoverflow.com/questions/10088619/how-to-clear-rails-sessions-table
icarito (@icarito:matrix.org) it should log out every session that' not been active in the past day or week - we can tweak it
Just taking notes here. Sounds great.
Unstable seems to take a loong time... can try
DELETE ... FROM ... WHERE ... LIMIT x
And execute as many times as needed in production.
After 7 hours this is still ongoing in staging. Clearly this is not how we want to do this in production in one single batch. Another thing is that after deletion, the table will be fragmented and the file size will not decrease for rsessions table. The table needs to be dumped and recreated in order to release server resources.
My plan for doing this is the following:
where updated_at > DATE_SUB(NOW(), INTERVAL 7 DAY)
I'll try this in stable
staging instance.
Ok awesome Sebastian and I would guess that this may have positive
implications for the expected improvements to our db performance after this
mitigation is complete, if even flushing this table can take this long...
On Mon, Jun 17, 2019, 9:50 PM Sebastian Silva notifications@github.com
wrote:
I'll try this in stable staging instance.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/publiclab/plots2/issues/5817?email_source=notifications&email_token=AAAF6JYXKGLL2V7TV7OMNNDP3A5NVA5CNFSM4HSA3N32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODX45TEI#issuecomment-502913425,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAAF6J7KQJ4OXJCRONW5G73P3A5NVANCNFSM4HSA3N3Q
.
Bringing in @cesswairimu so she can try her query on stable again when @icarito is finished. That should tell us if the issues in #5917 are only related to #5490 (which is fixed) or if they are also related to #5524.
unstable
is still deleting...
Leaving some notes here while doing testing in staging stable instance.
root@tycho:/srv/plots_staging/plots2# time docker-compose exec db bash -c "mysqldump --databases plots --tables rsessions --where='updated_at > DATE_SUB(NOW(), INTERVAL 7 DAY)' -h 127.0.0.1 -u plots --password=plots" > /tmp/rsessions.sql
MariaDB [plots]> rename table rsessions to rsessions_prob
Mysql2::Error: Table 'plots.rsessions' doesn't exist: SELECT
rsessions.* FROM
rsessionsWHER...
root@tycho:/srv/plots_staging/plots2# time cat /tmp/rsessions.sql | docker-comp
ose exec -T db bash -c "mysql -h 127.0.0.1 -u plots plots --password=plots"
MariaDB [plots]> drop table rsessions_prob;
Query OK, 0 rows affected (2.75 sec)
Tested https://stable.publiclab.org to login..
I'm ready to try this in production!
unstable
is still deleting...
Doing operation on live production database:
MariaDB [plots]> drop table rsessions_prob;
Query OK, 0 rows affected (43.39 sec)
Tested https://publiclab.org - session was retained!
:tada:
mitigation done! Hopefully this will free us!
i'll leave it for tonight, site looks speedy to me... :stuck_out_tongue_closed_eyes: hopefully this is it!
OK, so mitigations list would be:
[x] reduce process pool
[ ] move db to google cloud db solution
[x] reduce
rsessions
[ ]
switch tomemcached
Hmm, it was very fast this morning, but overall I don't see a huge difference! 😞
Nooooooooooooo! Well, there's only one other explanation and that's ghosts. I'll open up another issue and look into finding an exorcist or ghostbusters gem.
I think actually there's been improvement on I/O use because using a 30GB table is heavy - if you look closely the peaks seem related to Statscontroller... maybe we could do the stats work on staging? I can make it copy production database regularly say weekly?
Hey @icarito, I was wondering if you could answer some "educational" questions for me :
if you look closely the peaks seem related to Statscontroller...
Why would this be? Due to the caching? I can only think of three people who would be using it and I'm one of them and I haven't been.
maybe we could do the stats work on staging?
I've been hearing...er...seeing you use the word "staging" a lot lately. What is that and how does it play into the site/workflow? If it's a part of the docs, let me know which one and I'll take a crack at understanding it first.
I can make it copy production database regularly say weekly?
I think that'd be good. It's not so much that the freshest data are important, but between the Q&A system being changed and the recent tags migration, I suppose weekly is a good idea since it will catch any structural changes as they come in. @cesswairimu, what do you think?
This was a really awesome thread to read. Yeah its a great idea having the stats in stage and copying weekly is fine too :+1:
I have had this thought of in future making the stats queries a script that creates a sql view and its deleted and recreated daily/or weekly by a job and maybe this can live in stage also. Would like to hear your thoughts on this and if this can help the memory leaks in any way.
Hey @icarito, can we increase the RAM of the server? Maybe that'll help in speeding up the website until we improve our query response rate?
Thanks!
Thanks for your replies! I am thankful for the work that you are doing and for replying to this issue and reading thru our efforts! I don't want to sound accusing or anything! I'm just looking at the data and trying to improve our site's reliability.
For instance we got a peak this morning: https://www.skylight.io/app/applications/GZDPChmcfm1Q/1560920940/5m/endpoints
We also see peaks every night (6AM UTC) on backup for a couple of hours.
Regarding staging and production, currently we have three instances:
Instance | URL | Build log | Workspace
-----------|-------|------------|-------------
| unstable | https://unstable.publiclab.org/ | https://jenkins.laboratoriopublico.org/view/Staging/job/Plots-Unstable/ | https://jenkins.laboratoriopublico.org/view/Staging/job/Plots-Unstable/ws/
| stable | https://stable.publiclab.org/ | https://jenkins.laboratoriopublico.org/view/Staging/job/Plots-Stable/ | https://jenkins.laboratoriopublico.org/view/Staging/job/Plots-Stable/ws/
| production | https://publiclab.org/ | n/a | n/a
You are right that documentation wise we should do a better job describing this process. Currently i found some docs here https://github.com/publiclab/plots2/blob/master/doc/TESTING.md#testing-branches but it's not clear at all that these branches build when we push to those branches.
The database is currently updated manually every so often but it should be simple to automate it now that we have daily database dumps. I will set it up and ping you!
This doesn't mean we shouldn't implement more solutions, next I think a threaded webserver (Puma) could help!
That is a good question! We are in the process of moving our hosting to
new provider and were hoping to deploy as a container cluster in the new
hosting provider.
Since running in containers isn't immediately trivial (because our app
container isn't immutable) - an alternative to start is that we could
move the database first to make room.
I don't think we should increase our hosting usage in our current host
as we are barely within our allowed quota, but @jywarren can confirm?
Thanks for your work!
On 19/06/19 11:23, Gaurav Sachdeva wrote:
>
Hey @icarito https://github.com/icarito, can we increase the RAM of
the server? Maybe that'll help in speeding up the website until we
improve our query response rate?Thanks!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/publiclab/plots2/issues/5817?email_source=notifications&email_token=AABQYS3R6ENGBU4FYJXVNXTP3JMPBA5CNFSM4HSA3N32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYCM7NI#issuecomment-503631797,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AABQYS7LYPEKQ4QEANK5PRLP3JMPBANCNFSM4HSA3N3Q.
Actually I wonder if we could temporarily boost our ram in that container
until we do the move and if it would help short term. I think we'd be ok
with that cost increasing!
On Wed, Jun 19, 2019, 12:59 PM Sebastian Silva notifications@github.com
wrote:
That is a good question! We are in the process of moving our hosting to
new provider and were hoping to deploy as a container cluster in the new
hosting provider.Since running in containers isn't immediately trivial (because our app
container isn't immutable) - an alternative to start is that we could
move the database first to make room.I don't think we should increase our hosting usage in our current host
as we are barely within our allowed quota, but @jywarren can confirm?Thanks for your work!
On 19/06/19 11:23, Gaurav Sachdeva wrote:
>Hey @icarito https://github.com/icarito, can we increase the RAM of
the server? Maybe that'll help in speeding up the website until we
improve our query response rate?Thanks!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<
https://github.com/publiclab/plots2/issues/5817?email_source=notifications&email_token=AABQYS3R6ENGBU4FYJXVNXTP3JMPBA5CNFSM4HSA3N32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYCM7NI#issuecomment-503631797
,
or mute the thread
<
https://github.com/notifications/unsubscribe-auth/AABQYS7LYPEKQ4QEANK5PRLP3JMPBANCNFSM4HSA3N3Q
.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/publiclab/plots2/issues/5817?email_source=notifications&email_token=AAAF6J4GPT5S2JYJCMGJWP3P3JQVRA5CNFSM4HSA3N32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYCQFCY#issuecomment-503644811,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAAF6J4ERAAUV6JD3HUDZKDP3JQVRANCNFSM4HSA3N3Q
.
Oh, @icarito, no, no, I didn't sense any accusation, not at all. I read, "this is what's happening" and I was just saying "that's odd, why would it be doing that if no one was on it...?" Along the same lines, I didn't mean to imply the documentation was poor. Only that you didn't have to explain it if there was any.
And hey, it's not an entirely unfounded accusation : ) although I am having a bit of fun pretending that I've been framed and I've gone underground and have to prove my innocence but that's a whole other screenplay that I'm working on.
Thankfully these lurid and baseless accusations ; ) on both are parts have been cleared up and we can get back to the business at hand.
Related question: Why would the stats controller be active if no one was using it or is that the mystery?
Regarding the staging, thanks for the explanation. To make sure I've got, is saying...
I'll try this in stable staging instance.
...interchangeable with saying, "I'll try this on stable.publiclab.org"?
To the stable.publiclab.org Q -- yes! And that's built off of any push to
the master
branch - hope that helps!
On Wed, Jun 19, 2019 at 3:19 PM Benjamin Sugar notifications@github.com
wrote:
Oh, @icarito https://github.com/icarito, no, no, I didn't sense any
accusation, not at all. I read, "this is what's happening" and I was just
saying "that's odd, why would it be doing that if no one was on it...?"
Along the same lines, I didn't mean to imply the documentation was poor.
Only that you didn't have to explain it if there was any.And hey, it's not an entirely unfounded accusation : ) although I am
having a bit of fun pretending that I've been framed and I've gone
underground and have to prove my innocence but that's a whole other
screenplay that I'm working on.Thankfully these lurid and baseless accusations ; ) on both are parts have
been cleared up and we can get back to the business at hand.Related question: Why would the stats controller be active if no one was
using it or is that the mystery?Regarding the staging, thanks for the explanation. To make sure I've got,
is saying...I'll try this in stable staging instance.
...interchangeable with saying, "I'll try this on stable.publiclab.org"?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/publiclab/plots2/issues/5817?email_source=notifications&email_token=AAAF6J23U74QTJEVCLT6FLDP3KBBFA5CNFSM4HSA3N32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYDAD5Y#issuecomment-503710199,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAAF6J2RLJGI3ESQKZARV6DP3KBBFANCNFSM4HSA3N3Q
.
@jywarren, yup! Got now. Thank you!
Thanks for the clarification @skilfullycurled !
It is indeed a mystery why StatsController is so active?
Brief moments ago we had another peak that knocked us down for few minutes:
The trigger in this case was actually the Full Text Search.
But one can see that even in this brief timeslice (3 min), StatsController was called 21 times.
I think this may be significantly affecting our baseline performance. If this usage is not known, then perhaps crawlers are hitting these endpoints? Maybe a robots.txt or some access control would fix it?
@jywarren thanks for the clarification, I'll look into doing it asap then.
Actually here's Statscontroller details for previous timeslice:
Shall we robots.txt all stats routes? So /stats* basically?
On Thu, Jun 20, 2019 at 12:21 AM Sebastian Silva notifications@github.com
wrote:
Actually here's Statscontroller details for previous timeslice:
[image: imagen]
https://user-images.githubusercontent.com/199755/59818278-d4b1c980-92e8-11e9-9b9e-46900a253bd8.png—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/publiclab/plots2/issues/5817?email_source=notifications&email_token=AAAF6J7GBBZKJQY6TCZMQE3P3MARXA5CNFSM4HSA3N32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYD6ZTY#issuecomment-503835855,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAAF6J7PGJ5YZZHPLWPIJ73P3MARXANCNFSM4HSA3N3Q
.
OK, i did, and also exempted /api/* - we had already blocked /stats/range*
but now it's all /stats*
https://github.com/publiclab/plots2/commit/aa93dc3465b0cbaaee41ac7bec5e690437a27f5d
On Thu, Jun 20, 2019 at 2:45 PM Jeffrey Warren jeff@unterbahn.com wrote:
Shall we robots.txt all stats routes? So /stats* basically?
On Thu, Jun 20, 2019 at 12:21 AM Sebastian Silva notifications@github.com
wrote:Actually here's Statscontroller details for previous timeslice:
[image: imagen]
https://user-images.githubusercontent.com/199755/59818278-d4b1c980-92e8-11e9-9b9e-46900a253bd8.png—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/publiclab/plots2/issues/5817?email_source=notifications&email_token=AAAF6J7GBBZKJQY6TCZMQE3P3MARXA5CNFSM4HSA3N32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYD6ZTY#issuecomment-503835855,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAAF6J7PGJ5YZZHPLWPIJ73P3MARXANCNFSM4HSA3N3Q
.
So you don't think it's the caching?
The cache is use-generated, that is, it generates when a) it's expired, AND
b) a new request comes in. So something has to be requesting it for the
cache to generate... if I can resolve a couple unrelated issues and merge
their PRs, i'll start a new publication to production tonight (otherwise
tomorrow) and we can see if the robots.txt helps at all?
On Thu, Jun 20, 2019 at 4:53 PM Benjamin Sugar notifications@github.com
wrote:
So you don't think it's the caching?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/publiclab/plots2/issues/5817?email_source=notifications&email_token=AAAF6JZ5WFKAP5ZCICW67VLP3PUZBA5CNFSM4HSA3N32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYGSO4I#issuecomment-504178545,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAAF6J4MBMWM6WIOH6VJCY3P3PUZBANCNFSM4HSA3N3Q
.
statscontroller is called 5.5 times per minute
via @icarito - so on tonight's update we can see if robots.txt changes help this.
Hey @jywarren, I saw that robot.txt update commit was pushed to stable some days ago. Any improvement you noticed?
Yes, would love an update! Not sure I grabbed the correct data, but here's some images from skylight of before the commit, after the commit, and the last ~24 hours. The red line indicates when the commit was made. Looks on the surface like the answer is yes, but it may not be significant, or I might be interpreting the data incorrectly.
Yes i think a full analysis would be great. But the short answer is that
we've almost halved our average problem response time for all site requests
from 5.5+ to 3 or less. It's really a huge improvement. It was a
combination of a) almost doubling RAM from 8-15GB, b) blocking a marketing
bot in robots.txt, and c) blocking it in nginx configs as well (i think by
IP address range). The tough part is how much the bot/stats_controller was
part of it, because we didn't want to hold back the overall site upgrade.
The timing was:
In any case we're doing really well now. Load average is <4 instead of ~8,
and we have 6 instead of 4 CPUs.
On Tue, Jun 25, 2019 at 5:32 PM Benjamin Sugar notifications@github.com
wrote:
Yes, would love an update! Not sure I grabbed the correct data, but here's
some images from skylight of before the commit, after the commit, and the
last ~24 hours. The red line indicates when the commit was made. Looks on
the surface like the answer is yes, but it may not be significant, or I
might be interpreting the data incorrectly.[image: robots_txt]
https://user-images.githubusercontent.com/950291/60135129-05718300-976f-11e9-8fe7-3ca1c081abe3.png—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/publiclab/plots2/issues/5817?email_source=notifications&email_token=AAAF6J6ALZMY2QMSC7TZQHDP4KFEXA5CNFSM4HSA3N32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYRVAIQ#issuecomment-505630754,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAAF6J4E2Z2E47A4T6OWUCDP4KFEXANCNFSM4HSA3N3Q
.
Closing this now!
Most helpful comment
Yes i think a full analysis would be great. But the short answer is that
we've almost halved our average problem response time for all site requests
from 5.5+ to 3 or less. It's really a huge improvement. It was a
combination of a) almost doubling RAM from 8-15GB, b) blocking a marketing
bot in robots.txt, and c) blocking it in nginx configs as well (i think by
IP address range). The tough part is how much the bot/stats_controller was
part of it, because we didn't want to hold back the overall site upgrade.
The timing was:
read or respected
In any case we're doing really well now. Load average is <4 instead of ~8,
and we have 6 instead of 4 CPUs.
On Tue, Jun 25, 2019 at 5:32 PM Benjamin Sugar notifications@github.com
wrote: