Queue Prevention Mechanisms
To prevent infinite loops and runaway job duplication in the queue recovery system, the following mechanisms should be considered:
1. Strict Idempotency & Job ID Preservation (Implemented)
- Mechanism: Ensure that every job has a deterministic
jobIdoruniqueIdthat persists across recovery attempts. - Benefit: Prevents
HttpQueueDriverfrom generating new UUIDs when a request fails, stopping the multiplication of identical jobs. - Status: Fixed in
JobRecoveryDaemon.tsandHttpQueueDriver.ts.
2. Progressive Backoff with Attempt Tracking (Implemented)
- Mechanism: Use an explicit backoff strategy (30s, 60s, 180s) and strictly increment the
attemptscounter during recovery. - Benefit: Ensures that even if the network fails repeatedly, the job eventually hits
maxAttemptsand is moved todead_letterinstead of looping forever. - Status: Fixed in
JobRecoveryDaemon.ts(incrementing_currentAttempts).
3. Internal RPC Network Isolation
- Mechanism: Implement a separate rate limiter or bypass for internal Docker container traffic (e.g., allowlist
172.xIPs). - Benefit: Prevents the "Recovery Daemon" from being blocked by the "Global Rate Limiter" when it tries to save the system, which was the root cause of the timeouts.
- Recommendation: Verify
RateLimiter.tslogic diligently.
4. Circuit Breaker Pattern
- Mechanism: If
HttpQueueDriverdetects a high failure rate (e.g., > 10% of requests failing/timing out), immediately stop all recovery attempts for a cool-down period. - Benefit: Prevents flooding the logs and the database with
pending_recoverytransitions during a partial outage.
5. Transition Velocity Guard (Debounce)
- Mechanism: Before marking a job as
pending_recovery, check thezintrust_job_transitionstable. If the job has transitioned > 5 times in the last minute, force it tomanual_review. - Benefit: "Fails fast" for hyper-active loops that might bypass other checks.
6. Development Safeguards
- Mechanism: Add
pull_countcolumn or logic toJobStateTrackerto strictly enforce the "3 tries" rule at the database level. - Status: Partially covered by
attemptsusage.
7. Automated DLQ Analysis
- Mechanism: A scheduled task that groups
dead_letterjobs by error message and auto-archives them if they match known "ignorable" patterns.