NoSQL Task Queue Full
The MangoNoSQL point value store uses a background batch write-behind system to efficiently persist data to disk. When the system receives point values faster than it can write them, the internal queue fills up, resulting in "NoSQL Data Lost" events and potential point value data loss. This is a recurring topic in Mango deployments with high-throughput data collection.
Symptoms
- "NoSQL Data Lost" events appear in the alarm system. These are raised whenever a batch write fails to persist data to disk.
- Point value history shows gaps or missing data for specific time periods.
- The Internal Metrics show a steadily increasing write-behind queue size that never fully drains.
- The Mango log contains warnings about batch write failures or queue overflow.
- System performance degrades as the queue grows, consuming memory and thread resources.
- Data lost events occur in bursts, often correlating with periods of high data ingestion or disk I/O contention.
Common Causes
1. Disk I/O Bottleneck
The most common cause. The underlying storage cannot write data fast enough to keep up with the incoming point value rate. This is especially common with:
- Traditional spinning hard drives (HDD) under heavy write load.
- Network-attached storage (NAS) or SAN with latency.
- Virtual machine storage that shares I/O with other VMs.
- SD cards or eMMC storage on embedded devices (e.g., BeagleBone or Raspberry Pi).
2. Too Many Data Points Logging at High Frequency
A system with thousands of data points all configured to log every value (using "All data" logging) at a fast polling rate generates an enormous volume of writes. For example, 10,000 points polling every second with "All data" logging produces 10,000 writes per second.
3. Batch Write Configuration Too Conservative
The default NoSQL performance settings are designed for moderate workloads. High-throughput systems may need more aggressive write-behind settings (more tasks, larger batches, shorter flush intervals).
4. High-Priority Thread Pool Exhaustion
Batch write-behind tasks run in the high-priority thread pool. If this pool is too small or is saturated by other tasks (data source polling, event processing), write tasks cannot execute, and the queue fills up.
5. Backup or Purge Operations Competing for I/O
NoSQL backup operations (compressing shard files into zip archives) and purge operations (deleting old shard data) generate significant I/O. If these operations coincide with peak data collection, the combined I/O load can overwhelm the disk.
6. Database Corruption
Corrupted NoSQL shard files can cause write failures that prevent the queue from draining. See Database Corruption Recovery for corruption-specific guidance.
Diagnosis
Check for Data Lost Events
Review the Events page filtered to system events or search the log:
grep -i "NoSQL\|data lost\|batch write\|queue.*full" <MA_HOME>/logs/ma.log
Monitor Queue Size via Internal Metrics
Enable the Internal Metrics data source and add points for:
- NoSQL Write-Behind Queue Size: Number of point values waiting to be written. This should normally be near zero and drain quickly after bursts.
- NoSQL Batch Write Tasks Active: Number of currently running write-behind tasks.
Chart these over time. A healthy system shows brief spikes that drain quickly. An unhealthy system shows a queue that grows steadily or never returns to zero.
Check Disk I/O Performance
# Monitor disk I/O in real time
iostat -x 2 10
# Watch for:
# - %util near 100% (disk fully saturated)
# - await > 10ms (high latency per I/O operation)
# - w/s very high (many writes per second)
# Check the specific disk where NoSQL data is stored
df -h <MA_HOME>/databases/mangoTSDB/
Check Thread Pool Status
Review Internal Metrics for:
- High Priority Thread Pool Active Count vs. Max Size: If consistently at or near maximum, the pool is saturated.
- High Priority Thread Pool Queue Size: If this grows, tasks are waiting for threads.
Review NoSQL Performance Settings
The current settings can be viewed on the NoSQL settings page (System Settings or the module's dedicated page). Key values:
- Batch process manager period
- Batch write behind spawn threshold
- Max batch write behind tasks
- Maximum and minimum batch size
- Minimum time to wait before flushing a small batch
Solutions
Solution 1: Upgrade Disk I/O Performance
This is the most impactful solution for I/O-bound systems:
- Replace HDD with SSD: Solid-state drives provide 10-100x the random write performance of traditional hard drives.
- Use local storage: Avoid network-attached storage for the NoSQL data directory if possible.
- Separate the NoSQL directory: Move the NoSQL data to a dedicated high-performance disk:
# In mango.properties
db.nosql.location=/mnt/fast-ssd/mangoTSDB - Use a RAID array with battery-backed write cache for enterprise deployments.
Solution 2: Reduce Data Volume
- Switch from "All data" to "When point value changes" logging for points that do not need every single value recorded. This can reduce write volume by 90% or more for slowly changing points.
- Use "Interval" logging to log at a fixed rate (e.g., every 60 seconds) independent of the polling rate.
- Add tolerance (deadband) to "When point value changes" logging for numeric points. A tolerance of 0.1 on a temperature sensor prevents logging noise within 0.1 degrees.
- Disable logging for points that are only needed for real-time display or event detection.
Solution 3: Tune NoSQL Performance Settings
Navigate to the NoSQL settings page and adjust:
| Setting | Default | Recommendation for High Throughput |
|---|---|---|
| Batch process manager period (ms) | 100 | Reduce to 50 for faster queue processing |
| Batch write behind spawn threshold | 100 | Reduce to 50 to create tasks sooner |
| Max batch write behind tasks | 5 | Increase to 10-20 (ensure high-priority thread pool is large enough) |
| Maximum batch size | 500 | Increase to 1000-2000 for more efficient I/O |
| Minimum batch size | 10 | Leave at default |
| Minimum time to wait before flushing | 200ms | Reduce to 100ms for faster writes |
Increasing Max batch write behind tasks increases the demand on the high-priority thread pool. Ensure you have enough threads allocated. Each additional write task consumes one high-priority thread.
Solution 4: Increase the High-Priority Thread Pool
# In mango.properties
high.prio.pool.size.core=50
high.prio.pool.size.max=100
This ensures enough threads are available for both data collection and write-behind tasks.
Solution 5: Schedule Backups and Purges During Off-Peak Hours
- Configure NoSQL backup time to a period with minimal data collection (e.g., 3:00 AM if your system has lower activity at night).
- The SQL purge already runs at 3:05 AM by default. If both operations overlap and cause I/O contention, stagger them.
- Consider reducing backup frequency if daily backups generate too much I/O (e.g., switch to every-other-day with incremental backups in between).
Solution 6: Distribute NoSQL Data Across Disks
The NoSQL database supports distributing data across multiple drives using symbolic links:
- Create a
linksdirectory inside the base NoSQL database directory. - Create symbolic links inside
linksfor shard IDs that should be stored on different drives. - Click Reload links on the NoSQL settings page to apply changes.
This distributes I/O across multiple physical disks, effectively multiplying write throughput.
Solution 7: Move to a Faster Database Backend
For extremely high-throughput systems (hundreds of thousands of writes per second), consider:
- Moving to a dedicated time-series database with Mango configured as a publisher.
- Using the Persistent TCP or gRPC publisher to offload point value storage to another system.
Prevention
- Benchmark disk I/O before deploying Mango. Use
fioorddto verify the storage can sustain the expected write rate:# Simple sequential write test
dd if=/dev/zero of=/tmp/testfile bs=1M count=1024 oflag=dsync - Plan logging strategy before adding large batches of points. Calculate the expected write rate:
(number of points) * (logging frequency) = writes per second. - Monitor NoSQL queue size continuously with alarm thresholds. Set an alarm when the queue exceeds 1,000 values for more than 5 minutes.
- Use SSD storage for any deployment with more than 1,000 actively logging data points.
- Enable incremental NoSQL backups to reduce I/O impact compared to full backups.
- Review performance settings after adding new data sources or significantly increasing the point count.
Related Pages
- High CPU Usage — CPU bottlenecks that can slow down the NoSQL write-behind queue
- Out of Memory — Memory issues that may be caused by large queued data backlogs
- Data Source Performance — Tuning poll intervals and throughput to prevent queue overflow
- Internal Data Source — Monitor NoSQL queue depth and write throughput metrics