From cf32c72b8fa0ee8bdc309005296cda0704fa792d Mon Sep 17 00:00:00 2001
From: "Peter J. Holzer" <hjp@hjp.at>
Date: Sun, 27 Nov 2022 10:19:37 +0100
Subject: [PATCH] Think about scheduling measurements and processing them

---
 doc/multiqueue      | 35 +++++++++++++++++++++++++++++++++
 doc/processing.pipe | 48 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 83 insertions(+)
 create mode 100644 doc/multiqueue
 create mode 100644 doc/processing.pipe

diff --git a/doc/multiqueue b/doc/multiqueue
new file mode 100644
index 0000000..67a231f
--- /dev/null
+++ b/doc/multiqueue
@@ -0,0 +1,35 @@
+1 Each job has a period p and a maximum delay d (<= p)
+
+At startup we start every job serially
+    (potential problem: What if this takes longer than the minimum
+    period? We could sort the jobs by p asc)
+Alternatively enqueue every job at t=now
+    (potential problem: This may clump the jobs together more than
+    necessary)
+
+In any case:
+
+If a job just finished and there are runnable jobs, we start the next
+one in the queue.
+
+At every tick (1/second?) we check whether there are runnable jobs.
+For each runnable job we compute an overdue score «(now - t) / d».
+If the maximum score is >= random.random() we start that job.
+This is actually incorrect. Need to adjust for the ticks. Divide score
+by «d / tick_length»? But if we do that we have no guarantuee that the
+job will be started with at most d delay. We need a function which
+exceeds 1 at this point.
+«score = 1 / (t + d - now)» works. It's a uniform distribution, which is
+probably not ideal. I think I want the CDF to rise steeper at the start.
+But I can adust that later if necessary.
+
+We reschedule that job.
+    at t + p?
+    at now + p?
+    at x + p where x is computed from the last n start times?
+    I think this depends on how we schedule them initially: If we
+    started them serially they are probably already well spaced out, so
+    t + p is a good choice. If we all scheduled them immediately, it
+    isn't. The second probably drifts most. The third seems reasonable
+    in all cases.
+
diff --git a/doc/processing.pipe b/doc/processing.pipe
new file mode 100644
index 0000000..2f0d311
--- /dev/null
+++ b/doc/processing.pipe
@@ -0,0 +1,48 @@
+have a list of dependencies
+    should probably be specified in a rather high level notation with
+    wildcards and substitutions and stuff.
+
+    E.g something like this
+    measure=bytes_used -> measure=time_until_full
+    This would match all series with measure=bytes_used and use them to
+    compute a new series with measure=time_until_full and all other
+    dimensions unchanged.
+
+    Thats looks too simple, but do I need anything more complicated?
+    Obviously I also need to specify the filter. And I may not even need
+    to specify the result as the filter can determine that itself.
+    Although that may be bad for reusability.
+    Similar for auxiliary inputs. In the example above we also need the
+    corresponding measure=bytes_usable timeseries. The filter can
+    determine that itself, but maybe it's better to specify that in the
+    rule?
+
+    At run-time we expand the rules and just use the ids.
+
+I think we want to decouple the processing from the data aquisition, so
+the web service should just write the changed timeseries into a queue.
+Touch a file with the key as the name in a spool dir. The processor can
+then check if there is anything in the spool dir and process it. The
+result of the filter is then again added to the spool dir (make sure
+there are no circular dependencies! Hmm, that's up to the user I guess?
+Or each generated series could have a rate limit?)
+
+In addition to filters (which create new data) we also need some kind of
+alerting system. That could just be a filter which produces no data but
+does something else instead, like sending an email. So I'm not sure
+whether it makes sense to distinguish these.
+
+We should record all the used inputs (recursively) for each generated
+series (do we actually want to store the transitive closure or just the
+direct inputs? We can expand that when necessary.). Doing that for each
+datapoint is overkill, but we should mark each input with a "last seen"
+timestamp so that we can ignore or scrub inputs which are no longer
+used.
+
+Do we need a negative/timeout trigger? I.e. if a timeseries which is
+used as an input is NOT updated in time, trigger the filter anyway so
+that it can take appropriate action? If we have that how to we filter
+out obsolete measurements? We don't want to get alerted that we haven't
+gotten any disk usage data from a long-discarded host for 5 years. For
+now I think we rely on other still active checks to fail if a
+measurement fails to run.