TLDR The converse
Since its contemporary rewrite to scala3, Lichess is the order of manner more CPU than routine, however the problem takes a while to kick in and has new patterns.
The core of Lichess is a monolith written in scala.
It serves hundreds of pages per second, performs hundreds of chess moves per second,
while doing loads of more than a few things. Or no longer it’s miles a immense program working on a immense server with plenty of CPU and reminiscence.
Unless now, lila became as soon as written in scala 2.13.10, and it became as soon as working neatly and performing neatly. We will call it lila2.
About a month within the past, I started migrating it to scala 3.2.1 – we will call it lila3. It became as soon as fun and I could write about it… After complications are solved.
On the 2022-11-22 at 7AM UTC, I deployed lila3 to production, and it labored rather neatly, and performed rather neatly.
All the arrangement in which by the next days, things went without concern as I deployed a brand unique version every morning.
But after I let it bolt for more than 24h, the CPU utilization started rising dramatically. It became evident that lila3 could presumably no longer take care of more than 48h of uptime.
In comparability, lila2 could presumably without converse no longer sleep for 2 consecutive weeks.
lila2 CPU utilization, every spike is 24h. Vertical red lines are deploys/restarts.
lila3 CPU utilization, every spike is 24h. CPU utilization will enhance on the second day of runtime.
A nearer come all the arrangement in which by at CPU utilization
The CPU utilization shouldn’t be any longer routine, even all by the worst times. As a change, it’d be routine for a minute or two; then be crazy for a minute or two. Then lend a hand to routine.
Shut-up come all the arrangement in which by. Round 20% is routine CPU utilization. We look it spike as much as 80% for extended sessions – that is irregular.
Reminiscence stress / GC converse?
I kind no longer speak the rubbish collector is to blame. All the arrangement in which by the worst times, when the JVM nearly maxed out 90 CPUs, the GC became as soon as easiest the order of 3s of CPU time per minute.
Allocations and reminiscence utilization also come all the arrangement in which by rather routine.
What about exceptions or log messages?
There are none linked. We now own intensive logging in lila and no unique errors are popping up. No timeouts both.
I needed to revert to the scala2 version of Lichess. That is terribly unhappy, because the outlet between lila2 and lila3 is immense, and daily we expend with lila2 in production makes it better.
At this level I kind no longer know diagnose and repair the problem with lila3.
Must you know about trim JVM deployments and own a theory about what’s going to be causing this, please let me know.
It’s seemingly you’ll well order the Lichess programming discord channel or our electronic mail take care of.
Please focal level on the problem at hand and its snappily resolution. I am no longer procuring for prolonged-term structure discussions comely now, we will kind these after production is wholesome again.
I could update this post with more recordsdata and recordsdata, as of us quiz for them.
For monitoring recordsdata, thread dumps and GC logs, look the head of this post.
lila2 and lila3 are compiled with java 17, and the prod server runs the same JVM:
# java -version
openjdk version "17" 2021-09-14
OpenJDK Runtime Environment (build 17+35-2724)
OpenJDK 64-Bit Server VM (build 17+35-2724, mixed mode, sharing)
# uname -a
Linux manta 5.4.0-107-generic #121-Ubuntu SMP Thu Mar 24 16:04:27 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
JVM args are the same as before:
-Xms30g -Xmx30g -XX:+UseG1GC
lila2 runs scala 2.13.10, lila3 runs scala 3.2.1.
SBT dependencies will also be display in the lila github repo