Think about scheduling measurements and processing them

2022-11-27 10:19:37 +01:00 · 2022-11-27 10:19:37 +01:00 · cf32c72b8f
parent dd1b3056ac
commit cf32c72b8f
2 changed files with 83 additions and 0 deletions
--- a/doc/multiqueue
+++ b/doc/multiqueue
@ -0,0 +1,35 @@
 1 Each job has a period p and a maximum delay d (<= p)
 At startup we start every job serially
    (potential problem: What if this takes longer than the minimum
    period? We could sort the jobs by p asc)
 Alternatively enqueue every job at t=now
    (potential problem: This may clump the jobs together more than
    necessary)
 In any case:
 If a job just finished and there are runnable jobs, we start the next
 one in the queue.
 At every tick (1/second?) we check whether there are runnable jobs.
 For each runnable job we compute an overdue score «(now - t) / d».
 If the maximum score is >= random.random() we start that job.
 This is actually incorrect. Need to adjust for the ticks. Divide score
 by «d / tick_length»? But if we do that we have no guarantuee that the
 job will be started with at most d delay. We need a function which
 exceeds 1 at this point.
 «score = 1 / (t + d - now)» works. It's a uniform distribution, which is
 probably not ideal. I think I want the CDF to rise steeper at the start.
 But I can adust that later if necessary.
 We reschedule that job.
    at t + p?
    at now + p?
    at x + p where x is computed from the last n start times?
    I think this depends on how we schedule them initially: If we
    started them serially they are probably already well spaced out, so
    t + p is a good choice. If we all scheduled them immediately, it
    isn't. The second probably drifts most. The third seems reasonable
    in all cases.
--- a/doc/processing.pipe
+++ b/doc/processing.pipe
@ -0,0 +1,48 @@
 have a list of dependencies
    should probably be specified in a rather high level notation with
    wildcards and substitutions and stuff.
    E.g something like this
    measure=bytes_used -> measure=time_until_full
    This would match all series with measure=bytes_used and use them to
    compute a new series with measure=time_until_full and all other
    dimensions unchanged.
    Thats looks too simple, but do I need anything more complicated?
    Obviously I also need to specify the filter. And I may not even need
    to specify the result as the filter can determine that itself.
    Although that may be bad for reusability.
    Similar for auxiliary inputs. In the example above we also need the
    corresponding measure=bytes_usable timeseries. The filter can
    determine that itself, but maybe it's better to specify that in the
    rule?
    At run-time we expand the rules and just use the ids.
 I think we want to decouple the processing from the data aquisition, so
 the web service should just write the changed timeseries into a queue.
 Touch a file with the key as the name in a spool dir. The processor can
 then check if there is anything in the spool dir and process it. The
 result of the filter is then again added to the spool dir (make sure
 there are no circular dependencies! Hmm, that's up to the user I guess?
 Or each generated series could have a rate limit?)
 In addition to filters (which create new data) we also need some kind of
 alerting system. That could just be a filter which produces no data but
 does something else instead, like sending an email. So I'm not sure
 whether it makes sense to distinguish these.
 We should record all the used inputs (recursively) for each generated
 series (do we actually want to store the transitive closure or just the
 direct inputs? We can expand that when necessary.). Doing that for each
 datapoint is overkill, but we should mark each input with a "last seen"
 timestamp so that we can ignore or scrub inputs which are no longer
 used.
 Do we need a negative/timeout trigger? I.e. if a timeseries which is
 used as an input is NOT updated in time, trigger the filter anyway so
 that it can take appropriate action? If we have that how to we filter
 out obsolete measurements? We don't want to get alerted that we haven't
 gotten any disk usage data from a long-discarded host for 5 years. For
 now I think we rely on other still active checks to fail if a
 measurement fails to run.