This repository implements a parallel sum algorithm across multiple programming languages (C++, Go, Haskell, and Python), demonstrating various parallel programming paradigms and their performance characteristics in a controlled environment.
- Platform: Linux (WSL2) x86_64
- Processor: 6 physical cores, 12 logical cores
- Memory: 7.61 GB
- Compiler/Runtime Versions:
- GCC (C++17)
- Go 1.11+
- GHC 8.0+
- Python 3.8.10
- Uses RAII principles with
std::thread
and modern C++ features - Leverages
std::accumulate
for vectorized operations - Key optimizations:
partial_sum = std::accumulate(numbers.begin() + start_index, numbers.begin() + end_index, 0LL);
- Thread-safe design through data partitioning
- Performance: Best mean execution time (167.33ms ± 121.23ms)
- Utilizes goroutines and channels for communication
- Efficient memory management through Go's runtime
- Performance: Second-best mean execution time (203.35ms ± 179.97ms)
- Employs Software Transactional Memory (STM)
- Pure functional approach with MVars for synchronization
- Performance: Third place (745.29ms ± 1129.14ms)
- GIL-constrained threading model
- Numpy-optimized operations where possible
- Performance: Baseline reference (2446.33ms ± 3862.51ms)
- C++: Lowest variability (CV: 72.45%)
- Go: Moderate variability (CV: 88.50%)
- Haskell: High variability (CV: 151.51%)
- Python: Highest variability (CV: 157.89%)
- ANOVA results: F(3, 596) = 42.04, p < 1.18e-24
- Tukey HSD Analysis:
- C++ vs Python: Significant (p < 0.001)
- Go vs Python: Significant (p < 0.001)
- Haskell vs Python: Significant (p < 0.001)
- C++ vs Go: Non-significant (p = 0.9987)
- All languages perform similarly
- Mean execution times within 5% range
- Low coefficient of variation (<2%)
- C++ and Go maintain consistent performance
- Python shows linear degradation
- Haskell begins showing increased variance
- C++: Best scaling (493.23ms at n=25M)
- Go: Linear scaling (701.25ms at n=25M)
- Haskell: Exponential growth (3885.12ms at n=25M)
- Python: Significant degradation (12866.13ms at n=25M)
- C++: Lowest memory footprint (0.49% mean usage)
- Go: Efficient garbage collection (0.67% mean usage)
- Haskell: Higher memory overhead (3.28% mean usage)
- Python: Significant memory usage (2.84% mean usage)
- Linear scaling up to physical core count (6)
- Diminishing returns beyond logical core count (12)
- False sharing effects visible in large datasets
- Visible in performance jumps at:
- L1 cache boundary (~32KB)
- L2 cache boundary (~256KB)
- L3 cache boundary (~12MB)
-
Algorithmic Complexity
- Theoretical: O(n/p), where p = thread count
- Practical: Limited by memory bandwidth
- Cache coherency overhead significant at thread boundaries
-
Memory Access Patterns
- Sequential access benefits from hardware prefetching
- Thread-local summation minimizes false sharing
- NUMA effects visible in large datasets
-
Synchronization Overhead
- Minimal in C++ (final reduction only)
- Channel-based in Go (negligible impact)
- STM overhead in Haskell (significant at scale)
- GIL contention in Python (major bottleneck)
- Execution Pipeline
def run_single_test(self, executable, size, num_threads, lang):
process = psutil.Popen(
cmd,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE
)
metrics_samples.append(self._measure_process_metrics(process))
- Real-time metrics collection
- Process isolation per test
- Resource monitoring (CPU, memory, threads)
- Compilation Strategy
- C++:
-O3 -pthread
optimizations - Go: Native build with race detector
- Haskell:
-O2 -threaded
RTS options - Python: JIT compilation through CPython
- Sampling Framework
- Sample size: 150 runs per language
- Confidence level: 95%
- Power analysis: β > 0.95
- Variance Analysis
Language Mean (ms) Std Dev CV (%)
----------------------------------------
C++ 167.33 121.64 72.45
Go 203.35 180.57 88.50
Haskell 745.29 1132.93 151.51
Python 2446.33 3875.45 157.89
- Distribution Characteristics
- Non-normal distributions (Shapiro-Wilk test)
- C++: W = 0.5886, p < 8.92e-19
- Go: W = 0.6002, p < 1.59e-18
- Haskell: W = 0.6073, p < 2.27e-18
- Python: W = 0.6411, p < 1.33e-17
Size: 100,000
Language Mean (ms) CV (%)
----------------------------
C++ 103.77 1.29
Go 104.60 1.12
Haskell 104.08 1.15
Python 104.62 1.25
- Negligible performance differences
- Cache-resident data sets
- Linear scaling with threads
Language Mean (ms) CV (%)
----------------------------
C++ 103.40 0.48
Go 105.70 1.11
Haskell 145.27 34.25
Python 406.15 0.49
- Memory hierarchy effects become visible
- Thread synchronization overhead emerges
- GC pressure in managed languages
Language Mean (ms) CV (%)
----------------------------
C++ 493.23 16.59
Go 701.25 12.38
Haskell 3885.12 17.08
Python 12866.13 22.14
- Memory bandwidth saturation
- NUMA effects dominant
- Garbage collection overhead significant
- Thread-local accumulation
- Cache-line alignment
- NUMA-aware thread pinning
- False sharing mitigation
// C++: Lock-free accumulation
std::vector<std::thread> threads;
std::vector<SumThread> sum_threads;
-- Haskell: MVar-based synchronization
results <- replicateM numWorkers newEmptyMVar
# Python: Thread pooling
thread = SumThread(numbers, start_idx, end_idx)
threads.append(thread)
- Loop unrolling
- Vectorization
- Constant folding
- Dead code elimination
- Language-Specific Limitations
- Python: GIL contention
- Haskell: Garbage collection pauses
- Go: Channel communication overhead
- C++: Cache coherency protocol
- Hardware Constraints
- Memory bandwidth saturation
- Cache line bouncing
- NUMA access patterns
- Thread scheduling overhead
- Implementation Improvements
- SIMD vectorization
- Cache-oblivious algorithms
- Work-stealing schedulers
- Dynamic thread pooling
- Measurement Enhancements
- Hardware performance counters
- Cache miss profiling
- Branch prediction statistics
- Memory bandwidth utilization
- Performance Hierarchy
- C++ provides best raw performance
- Go offers good balance of performance and safety
- Haskell shows competitive small-scale performance
- Python suitable for prototype development
- Scaling Characteristics
- Linear scaling up to physical core count
- Memory bandwidth becomes bottleneck at scale
- Thread synchronization overhead increases with dataset size
- Statistical Significance
- ANOVA confirms significant differences (p < 1.18e-24)
- Tukey HSD shows clear language performance tiers
- Non-normal distribution suggests complex performance factors