Rewrite descriptions of some of the sorting algorithms for clarity

nessex · Nov 17, 2023 · c1c1389 · c1c1389
1 parent 28ca257
commit c1c1389
Show file tree

Hide file tree

Showing 6 changed files with 42 additions and 17 deletions.
diff --git a/src/sorts/comparative_sort.rs b/src/sorts/comparative_sort.rs
@@ -2,6 +2,13 @@
 //! whole numbers to support all the same use-cases as the original radix sort including
 //! sorting across multiple keys or partial keys etc.
 //!
+//! The purpose of this sort is to ensure that the library can provide a simpler interface. Without
+//! this sort, users would have to implement both `RadixKey` for the radix sort, _and_ `Ord` for
+//! the comparison sort. With this, only `RadixKey` is required.
+//!
+//! While the performance generally sucks, it is still faster than setting up for a full radix sort
+//! in situations where there are very few items.
+//!
 //! ## Characteristics
 //!
 //!  * in-place
@@ -12,7 +19,7 @@
 //!
 //! This is even slower than a typical comparison sort and so is only used as a fallback for very
 //! small inputs. However for those very small inputs it provides a significant speed-up due to
-//! having essentially no overhead.
+//! having essentially no overhead (from count arrays, buffers etc.) compared to a radix sort.
 
 use crate::sorter::Sorter;
 use crate::RadixKey;

diff --git a/src/sorts/out_of_place_sort.rs b/src/sorts/out_of_place_sort.rs
@@ -7,7 +7,7 @@
 //! ### Standard out_of_place_sort
 //!
 //! This implementation is a very simple out-of-place counting sort. The only notable optimization
-//! is to process data in chunks to take some advantage of multiple execution ports in the CPU.
+//! is to process data in chunks to take some advantage of multiple execution ports in each CPU core.
 //!
 //! ### out_of_place_sort_with_counts
 //!
@@ -27,15 +27,15 @@
 //! the stable ordering of values.
 //!
 //! This provides a significant performance benefit when there are many identical values as
-//! typically a pair of identical would prevent the CPU from using multiple execution ports. With
-//! this variant however, the CPU can safely and independently work on two identical values at the
+//! typically a pair of identical values would prevent the CPU from using multiple execution ports.
+//! With this variant however, the CPU can safely and independently work on two identical values at the
 //! same time as there is no overlapping variable access in either the output array or the prefix
 //! sums array.
 //!
 //! ### lr_out_of_place_sort_with_counts
 //!
-//! As with the other variants, this combines the left-right optimization with counting the next
-//! level.
+//! As with the other with_counts variant, this combines the left-right optimization with counting
+//! the next level.
 //!
 //! ## Characteristics
 //!

diff --git a/src/sorts/recombinating_sort.rs b/src/sorts/recombinating_sort.rs
@@ -1,12 +1,15 @@
 //! `recombinating_sort` is a multi-threaded, out-of-place, unstable radix sort unique to rdst. It
-//! operates on a set of tiles, which aresub-sections of the original data of roughly the same size.
+//! operates on a set of tiles, which are sub-sections of the original data of roughly the same size.
 //!
 //! It works by:
 //!  1. Sorting each tile out-of-place into a temp array
 //!  2. Calculating prefix sums of each tile
 //!  3. Splitting the output array based upon the aggregated counts of all tiles
 //!  4. Writing out the final data for each global count ("country" in regions sort terminology) in parallel
 //!
+//! Because each thread operates on separate tiles, and then separate output buckets, this is parallel from start to finish.
+//! The intermediate tiles mean this requires 2n memory relative to the input, plus some memory for each set of counts, and incurs two copies for each item.
+//!
 //! ## Characteristics
 //!
 //!  * out-of-place

diff --git a/src/sorts/regions_sort.rs b/src/sorts/regions_sort.rs
@@ -35,8 +35,8 @@
 //!
 //! ## Notes
 //!
-//! This may not be entirely the same as the algorithm described by the research paper. Some things
-//! did not seem to matter, and have been omitted for performance reasons.
+//! This may not be entirely the same as the algorithm described by the research paper. Some steps
+//! did not seem to provide any value, and have been omitted for performance reasons.
 
 use crate::sorter::Sorter;
 use crate::sorts::ska_sort::ska_sort;

diff --git a/src/sorts/scanning_sort.rs b/src/sorts/scanning_sort.rs
@@ -1,10 +1,25 @@
-//! `scanning_sort` is a custom algorithm for rdst which is a multi-threaded, MSB first radix sort.
+//! `scanning_sort` is a custom algorithm for rdst. It is a multi-threaded, MSB first radix sort.
 //!
-//! Scanning sort works by scanning over the output buckets, picking up data that shouldn't be there
-//! and putting it in a per-thread temporary store. It then writes any appropriate data it currently
-//! holds in that thread-local store to the current output bucket. After that, the thread moves on
-//! to the next available bucket (each one is mutex locked) and repeats the process until all output
-//! buckets are completely filled with the correct data.
+//! Scanning sort works by:
+//!
+//!  1. Chunk the input array into buckets based on the counts for this level
+//!  2. Create a worker for each rayon global thread pool thread (roughly, one per core)
+//!  2. Create a temporary thread-local buffer for each worker (one vec for each radix)
+//!  3. Each thread:
+//!  3.1. Iterates over the buckets, trying to gain a mutex lock on one
+//!  3.2. On first lock of the bucket, it partitions the bucket into [correct data | incorrect data] in-place
+//!  3.3. Scan over the contents of the bucket, picking up data that shouldn't be there and putting it in the thread-local buffer
+//!  3.4. Writes any buffered contents that _should_ be in this bucket, into the bucket
+//!  3.5. Repeats 3 until all buckets are completely filled with the correct data
+//!
+//! Along the way, each output bucket has a read head and a write head, which is a pointer to the latest content read and written respectively.
+//! When the read head reaches the end of the bucket, there is no more content to be buffered by any worker.
+//! When the write head reaches the end of the bucket, that bucket contains all data that should be there, and is marked completed.
+//! Once there are no more buckets that can be locked by the worker (all remaining buckets are locked), each worker exits.
+//! Once all buckets are completed, and all workers have exited, the sort is finished.
+//!
+//! Thread-local buffers can hold up to 128 values for each radix, or 32,768 values in total. There's one per thread, so the total amount of memory can add up to quite a lot.
+//! 128 values was chosen based upon performance numbers from benchmarking, and is not currently configurable.
 //!
 //! ## Characteristics
 //!

diff --git a/src/sorts/ska_sort.rs b/src/sorts/ska_sort.rs
@@ -17,8 +17,8 @@
 //!
 //! ## Performance
 //!
-//! This is generally slower than `lsb_sort` for smaller inputs, but for larger inputs the memory
-//! efficiency of this algorithm makes it take the lead.
+//! This is generally slower than `lsb_sort` for smaller types T or smaller input arrays. For larger
+//! types or inputs, the memory efficiency of this algorithm can make it faster than `lsb_sort`.
 
 use crate::sorter::Sorter;
 use crate::utils::*;