[World Community Grid] hitparade van week 42

maandag 13 oktober 2025 06:08

Acties:

KuuKe

Moderator DPC

professioneel gifmenger

World community grid

Topicstarter

Mede-auteurs:

Jis

AtHomer

Suicyder

Inhoudsopgave World Community Grid hitparades week 42

DPC World Community Grid hitparade van 12 oktober 2025 - geen stats
DPC World Community Grid hitparade van 13 oktober 2025 - geen stats
DPC World Community Grid hitparade van 14 oktober 2025 - geen stats
DPC World Community Grid hitparade van 15 oktober 2025 - geen stats
DPC World Community Grid hitparade van 16 oktober 2025 - geen stats
DPC World Community Grid hitparade van 17 oktober 2025 - geen stats
DPC World Community Grid hitparade van 18 oktober 2025 - geen stats

TEAM gegenereerd met: Weektopic TEAM Generator

Kuuke's Sterrenbeelden | 英俊的兔子

dinsdag 14 oktober 2025 06:43

Acties:

KuuKe

Moderator DPC

professioneel gifmenger

World community grid

Topicstarter

October 13, 2025
Happy Thanksgiving to our Canadian volunteers and partners.
Work on finishing deployment setup will resume tomorrow.

Kuuke's Sterrenbeelden | 英俊的兔子

donderdag 16 oktober 2025 04:16

Acties:

Jis

World community grid

Topicstarter

October 15, 2025
-Testing the validators right now, been a lot of iterations on these.
-As soon as the validator works, we will deploy across the six partitions and clear the backlog. Then we can check the transitioner interaction. If that is all good, we can finally start sending new work.
-Going to finalize object storage for the archive - instead of previous tape backup.

https://www.cs.toronto.edu/~juris/jlab/wcg.html

https://u24.gov.ua/

zondag 19 oktober 2025 06:58

Acties:

KuuKe

Moderator DPC

professioneel gifmenger

World community grid

Topicstarter

Aha, weer een update:

October 18, 2025
We are sending small batches of workunits out starting tonight with batch IDs in the range 9999900+ for MCM1 to test the new distributed partition-aware batch upserting app-specific create_work daemons. The few volunteers who get these workunits before we start releasing larger batches as we gain confidence that the new system is working as expected may notice these workunits have a much smaller number of signatures and run much faster than normal. These are still meaningful workunits, but key parameters such as number of signatures to test per workunit were reduced so we could get feedback quckly.
Similar to ARP1, we have moved all workunit templating and preparation to WCG servers for MCM1. We did this for the MAM1 beta (beta30) already, but we were able to move the rendering of workunit templates per batch into the create_work daemon C++ code directly, where it consumes a protobuf schema from Kafka/Redpanda's schema registry that it then hydrates to produce all workunits for the batch according to the desired parameters it consumes from the "plan" topic via Kafka. Hence, "app-specific" above. Then, it updates the BOINC database in bulk instead of calling BOINC's create_work() function. Metadata is local, partitioned, replicated in Kafka for durability, each batch writes files to that nodes' 1/6th of the buckets from the BOINC dir_hier fanout directory and commits 1/6th of the batch records to the database in non-overlapping ranges per 10k workunits per batch.
The new validators are working and deployed. In our new distributed, partitioned approach, validators process workunits local to their host ONLY, uploads are partitioned according to the fanout directory assigned by BOINC, routed to the correct backend node by HAProxy corresponding to the BOINC fanout buckets. We split the buckets between nodes, instead of using them to fanout across the filesystem and avoid massive numbers of files in a single BOINC upload path, we fan out across the cluster and read/write these buckets in tmpfs so Apache serves downloads and accepts uploads in-memory, validators read in-memory, Kafka/Redpanda gets a copy of uploads into a disk-persisted, replicated topic for durability so if a node goes down and we lose the in-memory cache of downloads and uploads, we can replay and recover.
By subscribing to a Kafka topic containing the count of uploads, a reduction on upload events emitted to Kafka topics from the new file_upload_handlers for only the local buckets of that partition, file locations pertaining to a pair of workunits, and emits success or failure to another queue for downstream "assimilation". We have written and are testing a batch applier that collects successful validation events on each partition, and batch updates the BOINC database so that the transitioner and scheduler can work together to evaluate the state of those workunits. Once we are confident the batch updates work as expected from the applier, users should start seeing workunits pending validation clear to valid.
We are not running file_deleter or db_purge at the moment, they need to be rearchitected to match the new setup, or at minimum assessed to make sure it makes sense to start them unchanged. We have no concerns about running out of space in the database or on disk at the moment, only making mistakes, so we will get around to assessing what if anything needs to change about file_deleter and db_purge soon but not now. Likely, they will also take advantage of per-workunit event data from Redpanda/Kafka instead of just talking to the BOINC database and operate on local partitions across the cluster. But as we are producing events for every workunit's full lifecycle to Kafka topics we have a level of visibility and control we were never able to achieve with the legacy system, and we were able to set up prometheus node_exporter, tap into docker stats endpoints per node across the cluster, and likewise for Redpanda/Kafka with the helpful https://github.com/redpanda-data/observability repo to get a Grafana dashboard going that will let us do many things, such as serve up server status pages, and improve the stats pages.

https://www.cs.toronto.edu/~juris/jlab/wcg.html

Kuuke's Sterrenbeelden | 英俊的兔子

Onderwerpen

Inhoudsopgave World Community Grid hitparades week 42