💻Parallel and Distributed Computing Unit 2 – Parallel Computer Architectures
Parallel computer architectures revolutionize computing by enabling multiple processing elements to work simultaneously on complex problems. This approach divides tasks into smaller sub-problems, allowing for concurrent execution and improved performance.
Key concepts include speedup, scalability, and parallel overhead. Various architecture types exist, such as shared memory and distributed memory systems. Memory organization, interconnection networks, and programming models play crucial roles in parallel computing efficiency.
Flynn's Taxonomy classifies parallel architectures based on the number of instruction streams and data streams
Single Instruction, Single Data (SISD): sequential execution on a single processing element
Single Instruction, Multiple Data (SIMD): multiple processing elements execute the same instruction on different data simultaneously (vector processors)
Multiple Instruction, Single Data (MISD): rarely used in practice, multiple processing elements execute different instructions on the same data
Multiple Instruction, Multiple Data (MIMD): multiple processing elements execute different instructions on different data simultaneously (most common parallel architecture)
Shared memory architectures allow multiple processing elements to access a common global memory space
Uniform Memory Access (UMA): all processing elements have equal access time to the shared memory (symmetric multiprocessors)
Non-Uniform Memory Access (NUMA): access times to the shared memory vary depending on the location of the processing element and the memory (distributed shared memory systems)
Distributed memory architectures have each processing element with its own local memory, and communication occurs through message passing over an interconnection network
No global memory space, data must be explicitly exchanged between processing elements
Hybrid architectures combine shared and distributed memory characteristics, such as clusters of shared memory nodes connected by a high-speed network
Memory Organization in Parallel Systems
Shared memory systems provide a global address space accessible by all processing elements
Simplifies programming by allowing direct communication through shared variables
Requires synchronization mechanisms (locks, semaphores) to ensure data consistency and avoid race conditions
Distributed memory systems have each processing element with its own local memory
No global address space, communication occurs through message passing
Requires explicit data partitioning and communication, increasing programming complexity
Cache coherence maintains consistency between local caches and shared memory in shared memory systems
Ensures that all processing elements have a consistent view of the shared data
Achieved through hardware mechanisms (snooping protocols, directory-based protocols) or software techniques (cache invalidation, update propagation)
Memory consistency models define the allowable orderings of memory operations in parallel systems
Sequential Consistency (SC): memory operations appear to execute in the order specified by the program
Relaxed consistency models (Total Store Order, Partial Store Order) allow reordering of memory operations to improve performance while maintaining correctness
False sharing occurs when multiple processing elements access different parts of the same cache line, causing unnecessary invalidations and updates
Can degrade performance by increasing cache coherence traffic and memory latency
Interconnection Networks
Interconnection networks enable communication between processing elements in parallel systems
Network topology defines the arrangement and connections between nodes in the network
Common topologies include bus, mesh, torus, hypercube, and fat tree
Topology affects network diameter, bisection bandwidth, and scalability
Routing determines the path a message takes from the source to the destination node
Static routing uses fixed paths based on the source and destination addresses
Dynamic routing adapts the path based on network conditions (congestion, failures)
Flow control manages the allocation of network resources (buffers, links) to prevent congestion and ensure fair access
Credit-based flow control uses credits to track available buffer space at the receiver
On-off flow control uses control signals to start and stop the transmission of data
Network performance metrics include latency (time to send a message), bandwidth (amount of data transferred per unit time), and throughput (number of messages delivered per unit time)
Network contention occurs when multiple messages compete for the same network resources, leading to increased latency and reduced throughput
Can be mitigated through efficient routing, flow control, and load balancing techniques
Parallel Programming Models
Shared memory programming models allow multiple threads to access and modify shared data
Examples include POSIX Threads (Pthreads), OpenMP, and Java threads
Requires synchronization primitives (locks, barriers) to coordinate access to shared resources
Message passing programming models involve explicit communication between processes through send and receive operations
Examples include Message Passing Interface (MPI) and Parallel Virtual Machine (PVM)
Requires data partitioning and distribution among processes
Data parallel programming models express parallelism through operations on large data structures
Examples include High Performance Fortran (HPF) and partitioned global address space (PGAS) languages (Unified Parallel C, Co-Array Fortran)
Compiler and runtime system handle data distribution and communication
Task parallel programming models express parallelism through the decomposition of a problem into tasks that can be executed concurrently
Examples include Cilk, Intel Threading Building Blocks (TBB), and OpenMP tasks
Runtime system manages task scheduling and load balancing
Hybrid programming models combine multiple programming models to exploit different levels of parallelism
Examples include MPI+OpenMP for distributed memory systems with shared memory nodes
Allows efficient utilization of hierarchical parallel architectures
Performance Metrics and Evaluation
Execution time measures the total time required to complete a parallel program
Affected by the number of processing elements, problem size, and parallel overhead
Speedup quantifies the performance improvement of a parallel program compared to its sequential counterpart
Ideal speedup is equal to the number of processing elements, but is limited by Amdahl's Law and parallel overhead
Efficiency measures the fraction of time processing elements spend performing useful work
Calculated as the ratio of speedup to the number of processing elements
Ideal efficiency is 1, but is reduced by parallel overhead and load imbalance
Scalability assesses a parallel system's ability to maintain performance as the problem size and number of processing elements increase
Strong scaling keeps the problem size fixed and increases the number of processing elements
Weak scaling increases both the problem size and the number of processing elements proportionally
Load balancing ensures that work is evenly distributed among processing elements
Static load balancing partitions the work before execution based on a priori knowledge
Dynamic load balancing redistributes the work during execution based on the actual progress of processing elements
Performance profiling tools help identify performance bottlenecks and optimize parallel programs
Examples include Intel VTune Amplifier, TAU (Tuning and Analysis Utilities), and HPCToolkit
Provide information on execution time, communication patterns, and resource utilization
Challenges and Limitations
Amdahl's Law limits the achievable speedup of a parallel program based on the fraction of the program that must be executed sequentially
Diminishing returns as the number of processing elements increases
Emphasizes the need to minimize the sequential portion of the program
Scalability limitations arise from the increased communication and synchronization overhead as the number of processing elements grows
Network latency and bandwidth constraints limit the efficiency of communication
Synchronization bottlenecks, such as global barriers, can limit the scalability of parallel algorithms
Load imbalance occurs when work is not evenly distributed among processing elements
Can result from unequal task sizes, data dependencies, or dynamic behavior of the application
Limits the utilization of processing elements and reduces overall performance
Data dependencies and communication patterns can limit the available parallelism and impact performance
Fine-grained dependencies may require frequent communication and synchronization
Irregular communication patterns can lead to network congestion and reduced efficiency
Fault tolerance becomes increasingly important as the scale of parallel systems grows
Higher probability of component failures in large-scale systems
Requires techniques for detecting and recovering from failures (checkpointing, redundancy)
Energy efficiency is a critical concern in large-scale parallel systems
Power consumption grows with the number of processing elements and communication infrastructure
Requires energy-aware design and management techniques (dynamic voltage and frequency scaling, power-aware scheduling)
Real-world Applications and Case Studies
Scientific simulations leverage parallel computing to model complex phenomena
Examples include climate modeling, molecular dynamics, and computational fluid dynamics
Require large-scale parallel systems to handle the computational demands and data volumes
Big data analytics utilize parallel processing to extract insights from massive datasets
Examples include web search engines, social network analysis, and machine learning
Benefit from parallel algorithms for data partitioning, distributed storage, and efficient query processing
Computer graphics and animation rely on parallel computing for rendering and simulation
Examples include movie special effects, video game engines, and virtual reality applications
Exploit parallelism at multiple levels (pixel, object, frame) to achieve real-time performance
Cryptography and security applications use parallel computing for efficient encryption, decryption, and cryptanalysis
Examples include secure communication, digital signatures, and password cracking
Benefit from parallel algorithms for key generation, hashing, and brute-force attacks
Bioinformatics and computational biology apply parallel computing to analyze biological data
Examples include genome sequencing, protein structure prediction, and drug discovery
Require parallel algorithms for sequence alignment, pattern matching, and molecular simulations
Financial modeling and risk analysis employ parallel computing for real-time decision making
Examples include portfolio optimization, option pricing, and fraud detection
Benefit from parallel algorithms for Monte Carlo simulations, time series analysis, and machine learning
Industrial design and engineering utilize parallel computing for simulation and optimization
Examples include computer-aided design (CAD), finite element analysis (FEA), and computational fluid dynamics (CFD)
Exploit parallelism to reduce design cycles and improve product quality