Think about scheduling measurements and processing them
This commit is contained in:
parent
dd1b3056ac
commit
cf32c72b8f
|
@ -0,0 +1,35 @@
|
|||
1 Each job has a period p and a maximum delay d (<= p)
|
||||
|
||||
At startup we start every job serially
|
||||
(potential problem: What if this takes longer than the minimum
|
||||
period? We could sort the jobs by p asc)
|
||||
Alternatively enqueue every job at t=now
|
||||
(potential problem: This may clump the jobs together more than
|
||||
necessary)
|
||||
|
||||
In any case:
|
||||
|
||||
If a job just finished and there are runnable jobs, we start the next
|
||||
one in the queue.
|
||||
|
||||
At every tick (1/second?) we check whether there are runnable jobs.
|
||||
For each runnable job we compute an overdue score «(now - t) / d».
|
||||
If the maximum score is >= random.random() we start that job.
|
||||
This is actually incorrect. Need to adjust for the ticks. Divide score
|
||||
by «d / tick_length»? But if we do that we have no guarantuee that the
|
||||
job will be started with at most d delay. We need a function which
|
||||
exceeds 1 at this point.
|
||||
«score = 1 / (t + d - now)» works. It's a uniform distribution, which is
|
||||
probably not ideal. I think I want the CDF to rise steeper at the start.
|
||||
But I can adust that later if necessary.
|
||||
|
||||
We reschedule that job.
|
||||
at t + p?
|
||||
at now + p?
|
||||
at x + p where x is computed from the last n start times?
|
||||
I think this depends on how we schedule them initially: If we
|
||||
started them serially they are probably already well spaced out, so
|
||||
t + p is a good choice. If we all scheduled them immediately, it
|
||||
isn't. The second probably drifts most. The third seems reasonable
|
||||
in all cases.
|
||||
|
|
@ -0,0 +1,48 @@
|
|||
have a list of dependencies
|
||||
should probably be specified in a rather high level notation with
|
||||
wildcards and substitutions and stuff.
|
||||
|
||||
E.g something like this
|
||||
measure=bytes_used -> measure=time_until_full
|
||||
This would match all series with measure=bytes_used and use them to
|
||||
compute a new series with measure=time_until_full and all other
|
||||
dimensions unchanged.
|
||||
|
||||
Thats looks too simple, but do I need anything more complicated?
|
||||
Obviously I also need to specify the filter. And I may not even need
|
||||
to specify the result as the filter can determine that itself.
|
||||
Although that may be bad for reusability.
|
||||
Similar for auxiliary inputs. In the example above we also need the
|
||||
corresponding measure=bytes_usable timeseries. The filter can
|
||||
determine that itself, but maybe it's better to specify that in the
|
||||
rule?
|
||||
|
||||
At run-time we expand the rules and just use the ids.
|
||||
|
||||
I think we want to decouple the processing from the data aquisition, so
|
||||
the web service should just write the changed timeseries into a queue.
|
||||
Touch a file with the key as the name in a spool dir. The processor can
|
||||
then check if there is anything in the spool dir and process it. The
|
||||
result of the filter is then again added to the spool dir (make sure
|
||||
there are no circular dependencies! Hmm, that's up to the user I guess?
|
||||
Or each generated series could have a rate limit?)
|
||||
|
||||
In addition to filters (which create new data) we also need some kind of
|
||||
alerting system. That could just be a filter which produces no data but
|
||||
does something else instead, like sending an email. So I'm not sure
|
||||
whether it makes sense to distinguish these.
|
||||
|
||||
We should record all the used inputs (recursively) for each generated
|
||||
series (do we actually want to store the transitive closure or just the
|
||||
direct inputs? We can expand that when necessary.). Doing that for each
|
||||
datapoint is overkill, but we should mark each input with a "last seen"
|
||||
timestamp so that we can ignore or scrub inputs which are no longer
|
||||
used.
|
||||
|
||||
Do we need a negative/timeout trigger? I.e. if a timeseries which is
|
||||
used as an input is NOT updated in time, trigger the filter anyway so
|
||||
that it can take appropriate action? If we have that how to we filter
|
||||
out obsolete measurements? We don't want to get alerted that we haven't
|
||||
gotten any disk usage data from a long-discarded host for 5 years. For
|
||||
now I think we rely on other still active checks to fail if a
|
||||
measurement fails to run.
|
Loading…
Reference in New Issue