[World Community Grid] hitparade van 11 november
Nieuws uit Canada:
November 11, 2025
- Database maintenance over Friday/Saturday completed without issue. We have resolved an issue with the backup scripts, effectively increased memory used to service database queries and added some new indices. We expect better performance from the BOINC database going forward.
- However, the disk remains slower than initial benchmarking when we stood up the database. We will monitor and reach out to hosting to see if the Ceph placement group expansion (that caused the stuck blocks of that particular disk when the placement group the result table lives on) got stuck in a "peering" state. We were informed that we should expect temporary, possibly intermittent slow IO during this Ceph maintenance window. If we can get faster disks for the BOINC database (which would require restoring the database to a new volume as we did to migrate) we will consider a maintenance window. Right now, we are optimistic the issues revealed in the new system by hanging database queries and database crashes can all be resolved with patches the new BOINC daemons, and current performance will be sufficient.
- As mentioned, this event identified several issues with the new BOINC daemons.
- MCM1 workunit creation proceeds in the Kafka topic even though the database is down, the mcm1_create_work daemon for it's Kafka partition on science01...science06 tries to commit it's part of the batch, database isn't there, so it doesn't do anything, but it does commit it's offset/pointer into the batch plan topic and move on to consume the next batch plan. That means every 10-15m while the database is down, a batch is effectively skipped. We were able to fix that, and have restarted MCM1 batch creation at roughly 5:00 p.m. EST, November 10th, 2025.
- We believe we have finally architected a fix for the pending validation backlog issue. This requires some non-trivial plumbing in the MCM1 batch assimilator, a Kafka connector deployed on the BOINC database node, and transitioner code changes.
- Workunit supply may remain artificially lower while we roll out the new batch assimilator builds and monitor the transitioner -> Kafka event consumption and result table interaction.
- We were able to resolve the issue with computing preferences not being updated from the website to BOINC client and vice versa. Generally, when the BOINC database goes down, so does the event listener that handles these messages on the webserver.
- We are still working on resolving the validation backlog from over the break, with the result table bricked during the Ceph maintenance we architected a "trust the filesystem" solution, and we are hopeful that this issue will be resolved this week.
- MAM1 was initially planned to be resumed in beta30 last week, to see if 7.07 fairly schedules work and respects --nthreads, which is a blocking issue in promoting the beta application to production. Depending on the error rate and behaviour on BOINC clients, we would then consider the stable code paths for the first production batches. Given our increased control over batch parameters with the new Kafka topic that uses a protobuf schema to fill out the workunit and result table entires, we intend to run work in production on Linux as soon as the beta30 application is stable with an error rate lower than MCM1 excepting the GLIBC dependency, which is typically the only repeated error we see from clients on the current LibTorch code path. We will then rely on iterating the beta30 application to 7.08 and 7.09 to get GPU and Windows support, and Parquet IO for input and uploaded results.
More...
More...
More...
More...
Links
World Community Grid homepage
World Community Grid stats van alle Teams
World Community Grid stats van DPC
World Community Grid stats van DPC voor BOINC clients
Bron DPCH
[
Voor 11% gewijzigd door
KuuKe op 12-11-2025 06:11
]
Kuuke's Sterrenbeelden | 英俊的兔子