Sometimes it can be challenging to analyze why your database or data processing system is slow. I/O is one of the biggest factors contributing to sluggish performance. Therefore, beyond understanding the algorithms in play, it’s essential to grasp the I/O aspects as well. This post will introduce several practical strategies to reduce I/O in your system.
Understanding I/O in Data Systems
To better understand your data system and its necessary optimizations, it’s essential to consider how these systems interact with I/O.
Let’s revisit the setup from my previous blog: we have two datasets, T1 with N records and a join column k1, and T2 with M records and a join column k2 (with N > M). Assume that T1 data is stored in B1 data blocks and T2 in B2 blocks.
Any data operation—whether it’s a join, grouping, or transformation—happens in RAM. The more RAM you have, the better, particularly for in-memory database systems. The challenge lies in optimizing the transfer of information from storage to RAM and vice versa.
In most systems, RAM pages are eventually written to data blocks, which host one or more pages. Spilling RAM pages to storage ensures durability and prevents data loss in case of system crashes. Some platforms replicate these blocks to active or standby replicas, which accept, sync, and write new blocks to disk.
When blocks are read from storage into RAM, they are stored in a buffer cache. If a needed page isn’t found in the buffer cache, a page miss occurs, triggering a physical I/O operation—a key metric to track. A high rate of physical I/Os often indicates that your access pattern doesn’t align with how the data is stored (clustering factor), or the buffer cache area is too small for your system’s workloads.
Guidelines for Minimizing Physical I/Os
The key is to minimize physical I/Os. Right? Here’s how:
1. Minimize I/O by Correct Data Modeling
Good data modeling aligns access patterns with how your data is organized in storage, reducing buffer cache misses and avoiding unnecessary physical I/Os. Two common methods for minimizing the number of scanned storage blocks are indexing and partitioning.
If the order of how you access data correlates with how it is stored, you minimize physical I/O. The clustering factor measures how many I/Os the system will perform due to this relationship. For instance, if T1 and T2 are ordered by k1 and k2 respectively, a join only needs to scan through all pages once, resulting in an I/O complexity of O(B1 + B2). Conversely, if T2 is ordered in reverse, each record in T1 necessitates reading B2 blocks from T2, potentially in the worst case leading to O(N * B2) I/O complexity (assuming no index on T2).
2. Optimize for Random Access Patterns
For random access—like in real-time requests—use an index (e.g., a hash or B-tree) tailored to your needs. A hash index is particularly efficient for equality operations.
3. Optimize for Full Scan Access Patterns
When analyzing large datasets, partition by columns frequently used in WHERE or GROUP BY clauses (e.g., WHERE state IN (‘CA’, ‘NY’), GROUP BY user_id). This approach helps prune data and reduces unnecessary block scans.
Sometimes, you may need to replicate and structure datasets differently to support various access patterns. For example, one dataset could be partitioned by user_id while another by state.
Consider using column stores for analytical purposes, as they store data from different columns in separate blocks, aiding in compression and reducing the number of blocks needed for each data retrieval.
4. Be Aware of Additional I/O Due to Indexes
If an index is non-clustered, it exists separately from the dataset, meaning the query engine may require multiple physical I/Os—first to access the index and then to retrieve the actual data record. Inserting and modifying records also necessitates index updates, which result in additional I/O.
5. Apply Bulk Operations
Many data systems allow bulk loading of data directly into data blocks in one transaction, which is far more efficient than inserting records one by one. In many systems, bulk operations don’t involve the buffer cache at all.
6. Compaction Operations
Compaction (or VACUUM) defragments data and compacts it across fewer blocks, decreasing I/O when accessing data. However, be mindful that compaction can be CPU and I/O intensive and should be scheduled wisely.
7. Compression
Similar to compaction, compression reduces the size of data stored in blocks, minimizing I/O and network usage. However, be aware that this may require additional computational resources to decompress data.
