NoSQL Task Queue Full

The MangoNoSQL point value store uses a background batch write-behind system to efficiently persist data to disk. When the system receives point values faster than it can write them, the internal queue fills up, resulting in "NoSQL Data Lost" events and potential point value data loss. This is a recurring topic in Mango deployments with high-throughput data collection.

Symptoms

"NoSQL Data Lost" events appear in the alarm system. These are raised whenever a batch write fails to persist data to disk.
Point value history shows gaps or missing data for specific time periods.
The Internal Metrics show a steadily increasing write-behind queue size that never fully drains.
The Mango log contains warnings about batch write failures or queue overflow.
System performance degrades as the queue grows, consuming memory and thread resources.
Data lost events occur in bursts, often correlating with periods of high data ingestion or disk I/O contention.

Common Causes

1. Disk I/O Bottleneck

The most common cause. The underlying storage cannot write data fast enough to keep up with the incoming point value rate. This is especially common with:

Traditional spinning hard drives (HDD) under heavy write load.
Network-attached storage (NAS) or SAN with latency.
Virtual machine storage that shares I/O with other VMs.
SD cards or eMMC storage on embedded devices (e.g., BeagleBone or Raspberry Pi).

2. Too Many Data Points Logging at High Frequency

A system with thousands of data points all configured to log every value (using "All data" logging) at a fast polling rate generates an enormous volume of writes. For example, 10,000 points polling every second with "All data" logging produces 10,000 writes per second.

3. Batch Write Configuration Too Conservative

The default NoSQL performance settings are designed for moderate workloads. High-throughput systems may need more aggressive write-behind settings (more tasks, larger batches, shorter flush intervals).

4. High-Priority Thread Pool Exhaustion

Batch write-behind tasks run in the high-priority thread pool. If this pool is too small or is saturated by other tasks (data source polling, event processing), write tasks cannot execute, and the queue fills up.

5. Backup or Purge Operations Competing for I/O

NoSQL backup operations (compressing shard files into zip archives) and purge operations (deleting old shard data) generate significant I/O. If these operations coincide with peak data collection, the combined I/O load can overwhelm the disk.

6. Database Corruption

Corrupted NoSQL shard files can cause write failures that prevent the queue from draining. See Database Corruption Recovery for corruption-specific guidance.

Diagnosis

Check for Data Lost Events

Review the Events page filtered to system events or search the log:

grep -i "NoSQL\|data lost\|batch write\|queue.*full" <MA_HOME>/logs/ma.log

Monitor Queue Size via Internal Metrics

Enable the Internal Metrics data source and add points for:

NoSQL Write-Behind Queue Size: Number of point values waiting to be written. This should normally be near zero and drain quickly after bursts.
NoSQL Batch Write Tasks Active: Number of currently running write-behind tasks.

Chart these over time. A healthy system shows brief spikes that drain quickly. An unhealthy system shows a queue that grows steadily or never returns to zero.

Check Disk I/O Performance

# Monitor disk I/O in real time
iostat -x 2 10

# Watch for:
# - %util near 100% (disk fully saturated)
# - await > 10ms (high latency per I/O operation)
# - w/s very high (many writes per second)

# Check the specific disk where NoSQL data is stored
df -h <MA_HOME>/databases/mangoTSDB/

Check Thread Pool Status

Review Internal Metrics for:

High Priority Thread Pool Active Count vs. Max Size: If consistently at or near maximum, the pool is saturated.
High Priority Thread Pool Queue Size: If this grows, tasks are waiting for threads.

Review NoSQL Performance Settings

The current settings can be viewed on the NoSQL settings page (System Settings or the module's dedicated page). Key values:

Batch process manager period
Batch write behind spawn threshold
Max batch write behind tasks
Maximum and minimum batch size
Minimum time to wait before flushing a small batch

Solutions

Solution 1: Upgrade Disk I/O Performance

This is the most impactful solution for I/O-bound systems:

Replace HDD with SSD: Solid-state drives provide 10-100x the random write performance of traditional hard drives.
Use local storage: Avoid network-attached storage for the NoSQL data directory if possible.
Separate the NoSQL directory: Move the NoSQL data to a dedicated high-performance disk:
```
# In mango.properties
db.nosql.location=/mnt/fast-ssd/mangoTSDB
```
Use a RAID array with battery-backed write cache for enterprise deployments.

Solution 2: Reduce Data Volume

Switch from "All data" to "When point value changes" logging for points that do not need every single value recorded. This can reduce write volume by 90% or more for slowly changing points.
Use "Interval" logging to log at a fixed rate (e.g., every 60 seconds) independent of the polling rate.
Add tolerance (deadband) to "When point value changes" logging for numeric points. A tolerance of 0.1 on a temperature sensor prevents logging noise within 0.1 degrees.
Disable logging for points that are only needed for real-time display or event detection.

Solution 3: Tune NoSQL Performance Settings

Navigate to the NoSQL settings page and adjust:

Setting	Default	Recommendation for High Throughput
Batch process manager period (ms)	500	Reduce to 100-200 for faster queue processing
Batch write behind spawn threshold	100000	Reduce to 10000-50000 to create tasks sooner
Max batch write behind tasks	10	Increase to 20-30 (ensure high-priority thread pool is large enough)
Maximum inserts per point per batch	5000	Increase further for very high per-point throughput
Minimum inserts per point per batch	1	Leave at default
Minimum data flush interval (ms)	1	Leave at default

caution

Increasing Max batch write behind tasks increases the demand on the high-priority thread pool. Ensure you have enough threads allocated. Each additional write task consumes one high-priority thread.

Solution 4: Increase the High-Priority Thread Pool

Navigate to Administration > System Settings > Thread Pools and increase the High Priority core and maximum thread counts. Increasing these values ensures enough threads are available for both data collection and write-behind tasks.

The recommended values depend on your workload and hardware. As a starting point, set the maximum to at least twice the value of Max batch write behind tasks so write tasks do not have to compete with data source polling threads.

Solution 5: Schedule Backups and Purges During Off-Peak Hours

Configure NoSQL backup time to a period with minimal data collection (e.g., 3:00 AM if your system has lower activity at night).
The SQL purge already runs at 3:05 AM by default. If both operations overlap and cause I/O contention, stagger them.
Consider reducing backup frequency if daily backups generate too much I/O (e.g., switch to every-other-day with incremental backups in between).

Solution 6: Distribute NoSQL Data Across Disks

The NoSQL database supports distributing data across multiple drives using symbolic links:

Create a links directory inside the base NoSQL database directory.
Create symbolic links inside links for shard IDs that should be stored on different drives.
Click Reload links on the NoSQL settings page to apply changes.

This distributes I/O across multiple physical disks, effectively multiplying write throughput.

Solution 7: Move to a Faster Database Backend

For extremely high-throughput systems (hundreds of thousands of writes per second), consider:

Moving to a dedicated time-series database with Mango configured as a publisher.
Using the Persistent TCP or gRPC publisher to offload point value storage to another system.

Prevention

Benchmark disk I/O before deploying Mango. Use fio or dd to verify the storage can sustain the expected write rate:
```
# Simple sequential write test
dd if=/dev/zero of=/tmp/testfile bs=1M count=1024 oflag=dsync
```
Plan logging strategy before adding large batches of points. Calculate the expected write rate: (number of points) * (logging frequency) = writes per second.
Monitor NoSQL queue size continuously with alarm thresholds. Set an alarm when the queue exceeds 1,000 values for more than 5 minutes.
Use SSD storage for any deployment with more than 1,000 actively logging data points.
Enable incremental NoSQL backups to reduce I/O impact compared to full backups.
Review performance settings after adding new data sources or significantly increasing the point count.

High CPU Usage — CPU bottlenecks that can slow down the NoSQL write-behind queue
Out of Memory — Memory issues that may be caused by large queued data backlogs
Data Source Performance — Tuning poll intervals and throughput to prevent queue overflow
Internal Data Source — Monitor NoSQL queue depth and write throughput metrics

Symptoms​

Common Causes​

1. Disk I/O Bottleneck​

2. Too Many Data Points Logging at High Frequency​

3. Batch Write Configuration Too Conservative​

4. High-Priority Thread Pool Exhaustion​

5. Backup or Purge Operations Competing for I/O​

6. Database Corruption​

Diagnosis​

Check for Data Lost Events​

Monitor Queue Size via Internal Metrics​

Check Disk I/O Performance​

Check Thread Pool Status​

Review NoSQL Performance Settings​

Solutions​

Solution 1: Upgrade Disk I/O Performance​

Solution 2: Reduce Data Volume​

Solution 3: Tune NoSQL Performance Settings​

Solution 4: Increase the High-Priority Thread Pool​

Solution 5: Schedule Backups and Purges During Off-Peak Hours​

Solution 6: Distribute NoSQL Data Across Disks​

Solution 7: Move to a Faster Database Backend​

Prevention​

Related Pages​