Changelog for Oban Pro v1.4

πŸš… Streamlined Workflows

Workflows now use automatic scheduling to run workflow jobs in order, without any polling or snoozing. Jobs with upstream dependencies are automatically marked as on_hold and scheduled far into the future. After the upstream dependency executes they're made available to run, with consideration for retries, cancellations, discards, and deletion.

  • An optimized, debounced query system uses up to 10x fewer queries to execute slower workflows.

  • Queue concurrency limits don't impact workflow execution, and even complex workflows can quickly run in parallel with global_limit: 1 and zero snoozing.

  • Cancelled, deleted, and discarded dependencies are still handled according to their respective ignore_* policies.

All of the previous workflow "tuning" options like waiting_limit and waiting_snooze are gone, as they're not needed to optimize workflow execution. Finally, older "in flight" workflows will still run with the legacy polling mechanism to ensure backwards compatibility.

β€οΈβ€πŸ©ΉπŸŒŸ Upgrading Workflows in v1.4

Because workflow staging is now reactive, you should add an optimized partial index to improve the performance of workflow dependency queries:

defmodule MyApp.Repo.Migrations.AddWorkflowIndexes do
  use Ecto.Migration

  defdelegate change, to: Oban.Pro.Migrations.Workflow
end

⏲️ Job Execution Deadlines

Jobs that shouldn't run after some period of time can be marked with a deadline. After the deadline has passed the job will be pre-emptively cancelled on its next run, or optionally during its next run if desired.

defmodule DeadlinedWorker do
  use Oban.Pro.Worker, deadline: {1, :hour}

  @impl true
  def process(%Job{args: args}) do
    # If this doesn't run within an hour, it's cancelled
  end
end

Deadlines may be set at runtime as well:

DeadlinedWorker.new(args, deadline: {30, :minutes})

In either case, the deadline is always relative and computed at runtime. That also allows the deadline to consider schedulingβ€”a job scheduled to run 1 hour from now with a 1 hour deadline will expire 2 hours in the future.

πŸ§™ Automatic Crontab Syncing

Synchronizing persisted entries manually required two deploys: one to flag it with deleted: true and another to clean up the entry entirely. That extra step isn't ideal for applications that don't insert or delete jobs at runtime.

To delete entries that are no longer listed in the crontab automatically set the sync_mode option to :automatic:

[
  sync_mode: :automatic,
  crontab: [
    {"0 * * * *", MyApp.BasicJob},
    {"0 0 * * *", MyApp.OtherJob}
  ]
]

To remove unwanted entries, simply delete them from the crontab:

 crontab: [
   {"0 * * * *", MyApp.BasicJob},
-  {"0 0 * * *", MyApp.OtherJob}
 ]

With :automatic sync, the entry for MyApp.OtherJob will be deleted on the next deployment.

β€οΈβ€πŸ©ΉπŸŒŸ Upgrading DynamicCron in v1.4

Changes to DynamicCron require a migration to add the new insertions column. You must re-run the Oban.Pro.Migrations.DynamicCron migration when upgrading.

defmodule MyApp.Repo.Migrations.UpdateObanCron do
  use Ecto.Migration

  defdelegate change, to: Oban.Pro.Migrations.DynamicCron
end

v1.4.13 β€” 2024-09-13

  • [Smart] Backport unique key generation from from v1.5.0-rc.3

    The crypto hash variant of uniq_key and uniq_bmp state information are now injected into a job's meta to aid in migration to enhanced uniqueness in Pro v1.5. That allows a system to gradually populate the job's table with accurate unique keys and state maps, so the unique index can be built for v1.5 without any loss in unique checks.

v1.4.12 β€” 2024-08-07

Bug Fixes

  • [Worker] Update use of parameterized types for Ecto v3.12 release

    The recently released Ecto v3.12 changed the signature for parameterized types, e.g. enums, which broke enum use in structured args. This updates the type signature to match.

  • [Smart] Inject conf during after_cancelled/2 callbacks

    The after_cancelled/2 callback was passed a job without the current Oban configuration, which broke workflow functions such as all_jobs/2.

    Now the conf is injected accordingly, and error messages from unhandled after_cancelled/2 execeptions include the stacktrace.

v1.4.11 β€” 2024-08-07

Version constraints are loosened to allow Oban v2.18.

Bug Fixes

  • [Smart] Take a queue mutex with pending batch or workflow handlers.

    Flush handlers for batches and workflows must be staggered to prevent overlapping transactions. Without a mutex to stagger fetching the transactions on different nodes can't see all completed jobs and handlers can misfire.

    This is backported from a similar change in v1.5.

  • [Batch] Always cast state and callback values to text in batch.

    Without explicit typecasting in unioned queries, batch operations may cause an unknown type error in PostgreSQL 16.1 and above.

v1.4.10 β€” 2024-06-25

Bug Fixes

  • [Smart] Always cast uniq_key to an integer in unique dupes query.

    The raw uniq_key type is always an integer, but it was injected without any type information. That caused an indeterminate_datatype error on the latest Aurora versions.

  • [Worker] Correct scheduled_at check for deadline workers.

    A typo in the scheduled_at check prevented correct deadlines for explicitly scheduled jobs.

v1.4.9 β€” 2024-06-05

Enhancements

  • [Workflow] Use microseconds in the orig_scheduled_at value stored in meta

    Original workflow scheduled timestamps are now stored as microseconds rather than seconds to match the precision of scheduled_at. There's fallback support for existing jobs with seconds for a smooth transition.

Bug Fixes

  • [Smart] Prevent tracked global counts from getting out of sync.

    Races between update operations such as scaling or pausing, and job fetching with backoff due to errors, could cause incorrect tracked global counts. Those counts wouldn't self-correct after the initial ack, and eventually the queue would stop fetching new jobs.

  • [Smart] Omit workflow jobs with incomplete dependencies from retrying.

    Using retry_job or retry_all_jobs on an on-hold workflow job may effectively break the workflow. Now held jobs are omitted from retry calls, which are also the mechanism used by Oban Web for "Run now".

  • [Smart] Delete all pending acks after draining jobs in tests.

    While draining, acks were persisted and not cleaned up afterwards. That could lead to incorrect batch callbacks firing, which could insert jobs during the wrong test.

  • [DynamicCron] Prevent missing worker from crashing the plugin on insert.

    A single missing worker would crash the plugin when it reloaded entries for insert. Now a missing worker is handled gracefully, allowing the plugin to keep running normally.

    This was a regression from v1.3, caused by the switch from insert with uniqueness to insert_all.

v1.4.8 β€” 2024-05-30

Enhancements

  • [Batch] Heavily optimize Batch query performance.

    Applying recent learnings from optimizing workflows, this switches Batch queries to specialized, index backed queries for batch callback checks. In a benchmark containing 10 batches of 350k jobs, the optimized queries performed 450x faster (from 90ms down to 0.2ms).

    The batch callback queries are much faster even without an index, but applications that utilize batches should run the new migration.

    defmodule MyApp.Repo.Migrations.AddBatchIndex do
      use Ecto.Migration
    
      defdelegate change, to: Oban.Pro.Migrations.Batch
    end
  • [Worker] Omit nil values when dumping structured args.

    Optional fields with a nil value were always written to the database. That prevented graceful transitions away from a structured field because the optional value would linger. For example, given the following schema:

    args_schema do
      field :foo, :id
      field :bar, :id
    end

    Passing the args %{foo: 123} is stored as {"foo":123}, without an empty "bar" key.

  • [Smart] Record the original scheduled_at timestamp on snooze or retry.

    Snoozing and rescheduling a job on error overwrites the scheduled_at timestamp, which makes it impossible to determine how long a job was truly waited before execution.

    Now snoozing or erroring will inject the original scheduled_at timestamp as unix seconds into the job's meta as orig_scheduled_at.

Bug Fixes

  • [Workflow] Restore tuple type syntax to prevent double wrapped options for all functions that used the new_option type. This fixes a success typing Dialyzer warning.

  • [Testing] Allow testing all batch callbacks.

    The handle_cancelled and handle_retryable callbacks weren't listed as known callbacks and couldn't be used by the perform_callback testing helper.

  • [Batch] Safely handle jobs cancelled by the Basic engine.

    The Batcher incorrectly assumed that meta was available in all cancelled jobs. That's only the case for jobs cancelled by the Smart engine, not the Basic engine.

v1.4.7 β€” 2024-05-15

Enhancements

  • [Workflow] Add workflow_name option to label all jobs in a workflow with a common name.

    In addition to individual step names, now all jobs in a workflow can have a workflow_name that groups them together. For example, to label an ETL workflow:

    workflow = Workflow.new(workflow_name: "etl_pipeline")
  • [Workflow] Add after_cancelled/2 callback to hook into workflow job cancellations.

    The new callback is triggered after a workflow job is cancelled due to upstream jobs being cancelled, deleted, or discarded. This callback is only called when a job is cancelled because of an upstream dependency. It is never called after normal job execution.

    For example, here's a trivial after_cancelled hook that logs a warning when a workflow job is cancelled:

    def MyApp.Workflow do
      use Oban.Pro.Workers.Workflow
    
      require Logger
    
      @impl true
      def after_cancelled(reason, job) do
        Logger.warn("Workflow job #{job.id} cancelled because a dependency was #{reason}")
      end

Bug Fixes

  • [Workflow] Respect the original scheduled time when ignoring cancellations.

    Workflow jobs were made available rather than scheduled to the correct scheduled time when ignore_cancelled or ignore_discarded was triggered.

  • [Workflow] Record an Oban.Pro.WorkflowError error when cancelling dependent jobs.

    Previously, a generic ErlangError was recorded as the cancellation reason, which wasn't helpful for identifying cancelled dependents.

  • [DynamicCron] Split migration into distinct up/0 and down/0 steps to for rollback.

    The migration couldn't be automatically rolled back due to the use of alter table. Now there are separate up and down migrations that the documentation points to, though the change function exists for backward compatibility.

  • [Worker] Indicate that the result passed to after_process/3 may be nil after an error, in addition to the standard job results.

v1.4.6 β€” 2024-04-26

Enhancements

  • [Smart] Use a static pool of ets tables to track jobs for async acking.

    Producers no longer have their own ets tables for acking. Instead, there is a central, static pool of tables to reduce the total number of ets tables, and it ensures that if a producer crashes it won't lose any job acks.

    There's a substantial memory savings from this change. With 1000 queues, this reduced the initial memory used by ets tables 24MB.

Bug Fixes

  • [Workflow] Prevent race condition causing stuck workflow jobs.

    Two separate factors interplayed to cause stuck workflow jobs in queues with numerous concurrent workflows. Changing how handlers were stored, deduplicated, and applied improves the overall reliability of Batch and Workflow handlers.

  • [Workflow] Periodically rescue stuck workflows from the DynamicLifeline.

    The DynamicLifeline will now periodically rescue stuck workflows when upstream dependencies are deleted or downstream deps aren't staged properly. This ensures workflows can always be repaired and resumed without outside intervention.

    This also removes the database function and trigger because they're incompatible with partitioned tables. Moving a row from one partition to another invokes the delete trigger, which makes the trigger rescue mechanism completely incompatible with partitioned tables.

    There's no harm in the trigger for existing single-table setups, and it doesn't need to be removed, but it won't be added by migrations moving forward.

  • [DynamicCron] Prevent query timeouts by selecting truncated insertions once.

    The query duplicated selecting insertions rather than overwriting it. Now the query properly limits insertion reloading.

v1.4.5 β€” 2024-04-22

Bug Fixes

  • [Migrations] Remove use of OR REPLACE in workflow trigger migration.

    The OR REPLACE syntax wasn't added until PG14, which is newer than the minimum supported PG12.

v1.4.4 β€” 2024-04-20

Enhancements

  • [Workflow] Use a highly optimized partial index for workflow deps checks.

    Between an optional partial index and an alternative query that avoids bitmap index scans, workflow deps checks are much faster, and even sub-millisecond in most cases.

    To get the benefits of this query change you must run, or re-run, the Oban.Pro.Migrations.Workflow migration.

  • [Smart] Limit total flush handlers (Batch, Workflow) per fetch transaction.

    When flush handlers accumulate, and they're too slow, they can cause the producer's fetch transaction to fail. Now a fixed number of flush handlers are applied in each transaction, and fetching is scheduled again shortly afterwards to continue flushing if there are additional handlers.

  • [DynamicCron] Only select latest insertion timestamp when reloading crontab entries.

    Reduce the total data pulled from the oban_crons table on reload by only selecting the most recent insertion timestamp. Applications with thousands of dynamic crons could experience timeouts from the extensive data transfer. In addition, the number of recorded insertions is halved from 720 to 360 by default.

Bug Fixes

  • [Workflow] Cascade workflow changes on lifeline job discard.

    Workflow deps were left "on hold" when an upstream job exhausted retries and was discarded by the DynamicLifeline. Now discard events are handled and the appropriate workflow transition is applied after the plugin runs.

  • [DynamicPartitioner] Retain database schema when rolling back DynamicPartitioner migration.

    The CREATE SCHEMA statement is conditional, and there's no guarantee that the schema wasn't already created before rolling it back.

  • [DynamicCron] Drop :guaranteed override from job options when inserting cron jobs.

    A per-entry guaranteed override is supported for checking jobs, but naturally Oban.Job doesn't support it as an option. Now the :guaranteed key is dropped along with :timezone.

  • [DynamicCron] Shift the timezone when comparing last insertion for guaranteed checks.

    Apply per-entry timezones or a globally configured timezone when comparing a parsed time with the last insertion for guaranteed mode.

  • [DynamicCron] Clear recorded insertions on entry expression or timezone updates.

    Expression and per-entry timezone changes now clear prior insertion timestamps to prevent mistaken scheduling.

  • [Testing] Always include base :retryable value in drain_jobs/2 summary output.

    The output of drain_jobs only had a :retryable value when one or more jobs were retryable, not in all cases as the typespec implied.

  • [Testing] Correct :exhausted check for summary output from run_workflow/2.

    Exhausted jobs were incorrectly tracked as discarded when summarizing run_workflow output.

  • [Chain] Safely run chained jobs with :inline testing mode.

    In :inline mode jobs don't hit the database. Without an id chain queries cause an ArgumentError, plus, :inline isn't meant to touch the database at all.

v1.4.3 β€” 2024-04-08

Enhancements

  • [Workflow] Use a database trigger function to handle deleted deps.

    Querying the full oban_jobs table to detect abandonded workflow deps is too slow for sizable workflows. The only consistent intercept for deleted dependencies is the database, so this adds a Oban.Pro.Migrations.Workflow migration that creates a trigger specifically tailored to cleaning up deps after deleting a a workflow job.

    This approach is vastly faster and applied immediately, without waiting for a pruning event. However, it does require an optional migration to safely handle deleted workflow deps.

    See the admonition above on upgrading workflows, or the section on handling deleted deps in the docs for details.

  • [Smart] Handle Batch callback and Workflow deps checks after async acking, rather than forcing synchronous acking with a separate debounce cycle.

    This change improves the overall throughput of queues with Batch and Workflow jobs, while also increasing the responsiveness of batch callback enqeuing and deps staging.

    • Batch and Workflow checks are still grouped to avoid duplicate work
    • Acking is always synchronous, there's no blocking job execution for debounced checks
    • Queries for acking, fetching, and checking all run in a single transaction
    • Debounce options are ignored and no longer documented
  • [Testing] Mark test processes when draining to simplify async acking checks.

Bug Fixes

  • [Smart] Track producer meta changes without fetched jobs.

    An empty clause prevented tracking global changes for empty partitions, e.g. partitions without additional jobs to fetch. This was most noticeable for partitioned queues with a low limit and sparse jobs.

  • [Testing] Reload all jobs after draining in run_workflow/1,2.

    Appended jobs weren't included in the result from run_workflow/1, despite being inserted and executed. Now all jobs from the workflow are considered for summarization, or returned entirely without a summary.

v1.4.2 β€” 2024-03-26

  • [Workflow] Always resolve all workflow deps with single check.

    Workflow deps checks with a mismatched number of deps and finished jobs could be made available erroneously.

  • [Workflow] Reimplement deleted workflow handling for accuracy, performance, and memory.

    The workflow deps anti-join query was incorrect, causing it to return jobs that weren't actually part of on-hold workflows. The number of jobs returned grew with the total historic jobs, which could cause memory issues.

v1.4.1 β€” 2024-03-25

Bug Fixes

  • [Workflow] Fix workflow staging for jobs with multiple deps.

    The final optimization to workflow deps querying introduced a bug that caused jobs with multiple deps to be made available after the first dep completed. This restores the original, correct query.

v1.4.0 β€” 2024-03-21

Enhancements

  • [DynamicCron] Add :sync_mode for automatic entry management.

    Most users expect that when they delete an entry from the crontab it won't keep running after the next deploy. A new sync_mode option allows selecting between automatic and manual entry management.

    In addition, this moves cron entry management into the evaluate handler. Inserting and deleting at runtime can't leverage leadership, because in rolling deployments the new nodes are never the leader.

  • [DynamicCron] Use recorded job insertion timestamps for guaranteed cron.

    A new field on the oban_crons table records each time a cron job is inserted up to a configurable limit. That field is then used for guaranteed cron jobs, and optionally for historic inspection beyond a job's retention period.

  • [DynamicCron] Stop adding a unique option when inserting jobs, regardless of the guaranteed option.

    There's no need for uniqueness checks now that insertions are tracked for each entry. Inserting without uniqueness is significantly faster.

  • [DynamicCron] Inject cron information into scheduled job meta.

    This change mirrors the addition of cron: true and cron_expr: expression added to Oban's Cron in order to make cron jobs easier to identify and report on through tools like Sentry.

  • [Worker] Add :deadline option for auto cancelling jobs

    Jobs that shouldn't run after some period of time can be marked with a deadline. After the deadline has passed the job will be pre-emptively cancelled on its next run, or optionally during its next run if desired.

  • [Workflow] Invert workflow execution to avoid bottlenecks caused by polling and snoozing.

    Workflow jobs no longer poll or wait for upstream dependencies while executing. Instead, jobs with dependencies are "held" until they're made available by a facilitator function. This inverted flow makes fewer queries, doesn't clog queues with jobs that aren't ready, avoids snoozing, and is generally more efficient.

  • [Workflow] Expose functions and direct callback docs to function docs.

    Most workflow functions aren't true callbacks, and shouldn't be overwritten. Now all callbacks point to the public function they wrap. Exposing workflow functions makes it easier to find and link to documentation.

Bug Fixes

  • [DynamicCron] Don't consider the node rebooted until it is leader

    With rolling deploys it is frequent that a node isn't the leader the first time it evaluates. However, the :rebooted flag was set to true on the first run, which prevented reboots from being inserted when the node ultimately acquired leadership.

  • [DynamicQueues] Accept streamline partition syntax for global_limit and rate_limit options.

    DynamicQueues didn't normalize the newer partition syntax before validation. This was an oversight, and a sign that validation had drifted between the Producer and Queue schemas. Now schemas use the same changesets to ensure compatibility.

  • [Smart] Handle partitioning by :worker and :args regardless of order.

    The docs implied partitioning by worker and args was possible, but there wasn't a clause that handled any order correctly.

  • [Smart] Explicitly cast transactional advisory lock prefix to integer.

    Postgres 16.1 may throw an error that it's unable to determine the argument type while taking bulk unique locks.

  • [Smart] Preserve recorded values between retries when sync acking failed.

    Acking a recorded value for a batch or workflow is synchronous, and a crash or timeout failure would lose the recorded value on subsequent attempts. Now the value is persisted between retries to ensure values are always recorded.

  • [Smart] Revert "Force materialized CTE in smart fetch query".

    Forcing a materialized CTE in the fetch query was added for reliability, but it can cause performance regressions under heavy workloads.

  • [Testing] Use configured queues when ensuring all started.

    Starting a supervised Oban in manual mode with tests specified would fail because in :manual testing mode the queues option is overridden to be empty.