From cf32c72b8fa0ee8bdc309005296cda0704fa792d Mon Sep 17 00:00:00 2001 From: "Peter J. Holzer" Date: Sun, 27 Nov 2022 10:19:37 +0100 Subject: [PATCH] Think about scheduling measurements and processing them --- doc/multiqueue | 35 +++++++++++++++++++++++++++++++++ doc/processing.pipe | 48 +++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 83 insertions(+) create mode 100644 doc/multiqueue create mode 100644 doc/processing.pipe diff --git a/doc/multiqueue b/doc/multiqueue new file mode 100644 index 0000000..67a231f --- /dev/null +++ b/doc/multiqueue @@ -0,0 +1,35 @@ +1 Each job has a period p and a maximum delay d (<= p) + +At startup we start every job serially + (potential problem: What if this takes longer than the minimum + period? We could sort the jobs by p asc) +Alternatively enqueue every job at t=now + (potential problem: This may clump the jobs together more than + necessary) + +In any case: + +If a job just finished and there are runnable jobs, we start the next +one in the queue. + +At every tick (1/second?) we check whether there are runnable jobs. +For each runnable job we compute an overdue score «(now - t) / d». +If the maximum score is >= random.random() we start that job. +This is actually incorrect. Need to adjust for the ticks. Divide score +by «d / tick_length»? But if we do that we have no guarantuee that the +job will be started with at most d delay. We need a function which +exceeds 1 at this point. +«score = 1 / (t + d - now)» works. It's a uniform distribution, which is +probably not ideal. I think I want the CDF to rise steeper at the start. +But I can adust that later if necessary. + +We reschedule that job. + at t + p? + at now + p? + at x + p where x is computed from the last n start times? + I think this depends on how we schedule them initially: If we + started them serially they are probably already well spaced out, so + t + p is a good choice. If we all scheduled them immediately, it + isn't. The second probably drifts most. The third seems reasonable + in all cases. + diff --git a/doc/processing.pipe b/doc/processing.pipe new file mode 100644 index 0000000..2f0d311 --- /dev/null +++ b/doc/processing.pipe @@ -0,0 +1,48 @@ +have a list of dependencies + should probably be specified in a rather high level notation with + wildcards and substitutions and stuff. + + E.g something like this + measure=bytes_used -> measure=time_until_full + This would match all series with measure=bytes_used and use them to + compute a new series with measure=time_until_full and all other + dimensions unchanged. + + Thats looks too simple, but do I need anything more complicated? + Obviously I also need to specify the filter. And I may not even need + to specify the result as the filter can determine that itself. + Although that may be bad for reusability. + Similar for auxiliary inputs. In the example above we also need the + corresponding measure=bytes_usable timeseries. The filter can + determine that itself, but maybe it's better to specify that in the + rule? + + At run-time we expand the rules and just use the ids. + +I think we want to decouple the processing from the data aquisition, so +the web service should just write the changed timeseries into a queue. +Touch a file with the key as the name in a spool dir. The processor can +then check if there is anything in the spool dir and process it. The +result of the filter is then again added to the spool dir (make sure +there are no circular dependencies! Hmm, that's up to the user I guess? +Or each generated series could have a rate limit?) + +In addition to filters (which create new data) we also need some kind of +alerting system. That could just be a filter which produces no data but +does something else instead, like sending an email. So I'm not sure +whether it makes sense to distinguish these. + +We should record all the used inputs (recursively) for each generated +series (do we actually want to store the transitive closure or just the +direct inputs? We can expand that when necessary.). Doing that for each +datapoint is overkill, but we should mark each input with a "last seen" +timestamp so that we can ignore or scrub inputs which are no longer +used. + +Do we need a negative/timeout trigger? I.e. if a timeseries which is +used as an input is NOT updated in time, trigger the filter anyway so +that it can take appropriate action? If we have that how to we filter +out obsolete measurements? We don't want to get alerted that we haven't +gotten any disk usage data from a long-discarded host for 5 years. For +now I think we rely on other still active checks to fail if a +measurement fails to run.