Skip to main content

NoSQL Task Queue Full

The MangoNoSQL point value store uses a background batch write-behind system to efficiently persist data to disk. When the system receives point values faster than it can write them, the internal queue fills up, resulting in "NoSQL Data Lost" events and potential point value data loss. This is a recurring topic in Mango deployments with high-throughput data collection.

Symptoms

  • "NoSQL Data Lost" events appear in the alarm system. These are raised whenever a batch write fails to persist data to disk.
  • Point value history shows gaps or missing data for specific time periods.
  • The Internal Metrics show a steadily increasing write-behind queue size that never fully drains.
  • The Mango log contains warnings about batch write failures or queue overflow.
  • System performance degrades as the queue grows, consuming memory and thread resources.
  • Data lost events occur in bursts, often correlating with periods of high data ingestion or disk I/O contention.

Common Causes

1. Disk I/O Bottleneck

The most common cause. The underlying storage cannot write data fast enough to keep up with the incoming point value rate. This is especially common with:

  • Traditional spinning hard drives (HDD) under heavy write load.
  • Network-attached storage (NAS) or SAN with latency.
  • Virtual machine storage that shares I/O with other VMs.
  • SD cards or eMMC storage on embedded devices (e.g., BeagleBone or Raspberry Pi).

2. Too Many Data Points Logging at High Frequency

A system with thousands of data points all configured to log every value (using "All data" logging) at a fast polling rate generates an enormous volume of writes. For example, 10,000 points polling every second with "All data" logging produces 10,000 writes per second.

3. Batch Write Configuration Too Conservative

The default NoSQL performance settings are designed for moderate workloads. High-throughput systems may need more aggressive write-behind settings (more tasks, larger batches, shorter flush intervals).

4. High-Priority Thread Pool Exhaustion

Batch write-behind tasks run in the high-priority thread pool. If this pool is too small or is saturated by other tasks (data source polling, event processing), write tasks cannot execute, and the queue fills up.

5. Backup or Purge Operations Competing for I/O

NoSQL backup operations (compressing shard files into zip archives) and purge operations (deleting old shard data) generate significant I/O. If these operations coincide with peak data collection, the combined I/O load can overwhelm the disk.

6. Database Corruption

Corrupted NoSQL shard files can cause write failures that prevent the queue from draining. See Database Corruption Recovery for corruption-specific guidance.

Diagnosis

Check for Data Lost Events

Review the Events page filtered to system events or search the log:

grep -i "NoSQL\|data lost\|batch write\|queue.*full" <MA_HOME>/logs/ma.log

Monitor Queue Size via Internal Metrics

Enable the Internal Metrics data source and add points for:

  • NoSQL Write-Behind Queue Size: Number of point values waiting to be written. This should normally be near zero and drain quickly after bursts.
  • NoSQL Batch Write Tasks Active: Number of currently running write-behind tasks.

Chart these over time. A healthy system shows brief spikes that drain quickly. An unhealthy system shows a queue that grows steadily or never returns to zero.

Check Disk I/O Performance

# Monitor disk I/O in real time
iostat -x 2 10

# Watch for:
# - %util near 100% (disk fully saturated)
# - await > 10ms (high latency per I/O operation)
# - w/s very high (many writes per second)
# Check the specific disk where NoSQL data is stored
df -h <MA_HOME>/databases/mangoTSDB/

Check Thread Pool Status

Review Internal Metrics for:

  • High Priority Thread Pool Active Count vs. Max Size: If consistently at or near maximum, the pool is saturated.
  • High Priority Thread Pool Queue Size: If this grows, tasks are waiting for threads.

Review NoSQL Performance Settings

The current settings can be viewed on the NoSQL settings page (System Settings or the module's dedicated page). Key values:

  • Batch process manager period
  • Batch write behind spawn threshold
  • Max batch write behind tasks
  • Maximum and minimum batch size
  • Minimum time to wait before flushing a small batch

Solutions

Solution 1: Upgrade Disk I/O Performance

This is the most impactful solution for I/O-bound systems:

  • Replace HDD with SSD: Solid-state drives provide 10-100x the random write performance of traditional hard drives.
  • Use local storage: Avoid network-attached storage for the NoSQL data directory if possible.
  • Separate the NoSQL directory: Move the NoSQL data to a dedicated high-performance disk:
    # In mango.properties
    db.nosql.location=/mnt/fast-ssd/mangoTSDB
  • Use a RAID array with battery-backed write cache for enterprise deployments.

Solution 2: Reduce Data Volume

  • Switch from "All data" to "When point value changes" logging for points that do not need every single value recorded. This can reduce write volume by 90% or more for slowly changing points.
  • Use "Interval" logging to log at a fixed rate (e.g., every 60 seconds) independent of the polling rate.
  • Add tolerance (deadband) to "When point value changes" logging for numeric points. A tolerance of 0.1 on a temperature sensor prevents logging noise within 0.1 degrees.
  • Disable logging for points that are only needed for real-time display or event detection.

Solution 3: Tune NoSQL Performance Settings

Navigate to the NoSQL settings page and adjust:

SettingDefaultRecommendation for High Throughput
Batch process manager period (ms)100Reduce to 50 for faster queue processing
Batch write behind spawn threshold100Reduce to 50 to create tasks sooner
Max batch write behind tasks5Increase to 10-20 (ensure high-priority thread pool is large enough)
Maximum batch size500Increase to 1000-2000 for more efficient I/O
Minimum batch size10Leave at default
Minimum time to wait before flushing200msReduce to 100ms for faster writes
caution

Increasing Max batch write behind tasks increases the demand on the high-priority thread pool. Ensure you have enough threads allocated. Each additional write task consumes one high-priority thread.

Solution 4: Increase the High-Priority Thread Pool

# In mango.properties
high.prio.pool.size.core=50
high.prio.pool.size.max=100

This ensures enough threads are available for both data collection and write-behind tasks.

Solution 5: Schedule Backups and Purges During Off-Peak Hours

  • Configure NoSQL backup time to a period with minimal data collection (e.g., 3:00 AM if your system has lower activity at night).
  • The SQL purge already runs at 3:05 AM by default. If both operations overlap and cause I/O contention, stagger them.
  • Consider reducing backup frequency if daily backups generate too much I/O (e.g., switch to every-other-day with incremental backups in between).

Solution 6: Distribute NoSQL Data Across Disks

The NoSQL database supports distributing data across multiple drives using symbolic links:

  1. Create a links directory inside the base NoSQL database directory.
  2. Create symbolic links inside links for shard IDs that should be stored on different drives.
  3. Click Reload links on the NoSQL settings page to apply changes.

This distributes I/O across multiple physical disks, effectively multiplying write throughput.

Solution 7: Move to a Faster Database Backend

For extremely high-throughput systems (hundreds of thousands of writes per second), consider:

  • Moving to a dedicated time-series database with Mango configured as a publisher.
  • Using the Persistent TCP or gRPC publisher to offload point value storage to another system.

Prevention

  • Benchmark disk I/O before deploying Mango. Use fio or dd to verify the storage can sustain the expected write rate:
    # Simple sequential write test
    dd if=/dev/zero of=/tmp/testfile bs=1M count=1024 oflag=dsync
  • Plan logging strategy before adding large batches of points. Calculate the expected write rate: (number of points) * (logging frequency) = writes per second.
  • Monitor NoSQL queue size continuously with alarm thresholds. Set an alarm when the queue exceeds 1,000 values for more than 5 minutes.
  • Use SSD storage for any deployment with more than 1,000 actively logging data points.
  • Enable incremental NoSQL backups to reduce I/O impact compared to full backups.
  • Review performance settings after adding new data sources or significantly increasing the point count.