Parallel and Distributed Computing

💻Parallel and Distributed Computing Unit 2 – Parallel Computer Architectures

Parallel computer architectures revolutionize computing by enabling multiple processing elements to work simultaneously on complex problems. This approach divides tasks into smaller sub-problems, allowing for concurrent execution and improved performance. Key concepts include speedup, scalability, and parallel overhead. Various architecture types exist, such as shared memory and distributed memory systems. Memory organization, interconnection networks, and programming models play crucial roles in parallel computing efficiency.

Key Concepts and Terminology

  • Parallel computing involves multiple processing elements working simultaneously to solve a problem
    • Achieved by dividing the problem into smaller sub-problems that can be solved concurrently
  • Speedup measures the performance improvement gained by using parallel computing compared to sequential computing
    • Calculated as the ratio of sequential execution time to parallel execution time
  • Scalability refers to a parallel system's ability to maintain performance as the problem size and number of processing elements increase
  • Amdahl's Law predicts the theoretical speedup of a parallel program based on the fraction of the program that can be parallelized
    • Expressed as: Speedup=1(1P)+PNSpeedup = \frac{1}{(1-P) + \frac{P}{N}}, where P is the fraction of the program that can be parallelized and N is the number of processing elements
  • Gustafson's Law addresses the limitations of Amdahl's Law by considering the ability to solve larger problems with more processing elements
  • Parallel overhead includes communication, synchronization, and load imbalance costs that reduce the efficiency of parallel execution
  • Granularity refers to the size of the tasks or sub-problems assigned to each processing element
    • Fine-grained parallelism involves smaller tasks, while coarse-grained parallelism involves larger tasks

Types of Parallel Architectures

  • Flynn's Taxonomy classifies parallel architectures based on the number of instruction streams and data streams
    • Single Instruction, Single Data (SISD): sequential execution on a single processing element
    • Single Instruction, Multiple Data (SIMD): multiple processing elements execute the same instruction on different data simultaneously (vector processors)
    • Multiple Instruction, Single Data (MISD): rarely used in practice, multiple processing elements execute different instructions on the same data
    • Multiple Instruction, Multiple Data (MIMD): multiple processing elements execute different instructions on different data simultaneously (most common parallel architecture)
  • Shared memory architectures allow multiple processing elements to access a common global memory space
    • Uniform Memory Access (UMA): all processing elements have equal access time to the shared memory (symmetric multiprocessors)
    • Non-Uniform Memory Access (NUMA): access times to the shared memory vary depending on the location of the processing element and the memory (distributed shared memory systems)
  • Distributed memory architectures have each processing element with its own local memory, and communication occurs through message passing over an interconnection network
    • No global memory space, data must be explicitly exchanged between processing elements
  • Hybrid architectures combine shared and distributed memory characteristics, such as clusters of shared memory nodes connected by a high-speed network

Memory Organization in Parallel Systems

  • Shared memory systems provide a global address space accessible by all processing elements
    • Simplifies programming by allowing direct communication through shared variables
    • Requires synchronization mechanisms (locks, semaphores) to ensure data consistency and avoid race conditions
  • Distributed memory systems have each processing element with its own local memory
    • No global address space, communication occurs through message passing
    • Requires explicit data partitioning and communication, increasing programming complexity
  • Cache coherence maintains consistency between local caches and shared memory in shared memory systems
    • Ensures that all processing elements have a consistent view of the shared data
    • Achieved through hardware mechanisms (snooping protocols, directory-based protocols) or software techniques (cache invalidation, update propagation)
  • Memory consistency models define the allowable orderings of memory operations in parallel systems
    • Sequential Consistency (SC): memory operations appear to execute in the order specified by the program
    • Relaxed consistency models (Total Store Order, Partial Store Order) allow reordering of memory operations to improve performance while maintaining correctness
  • False sharing occurs when multiple processing elements access different parts of the same cache line, causing unnecessary invalidations and updates
    • Can degrade performance by increasing cache coherence traffic and memory latency

Interconnection Networks

  • Interconnection networks enable communication between processing elements in parallel systems
  • Network topology defines the arrangement and connections between nodes in the network
    • Common topologies include bus, mesh, torus, hypercube, and fat tree
    • Topology affects network diameter, bisection bandwidth, and scalability
  • Routing determines the path a message takes from the source to the destination node
    • Static routing uses fixed paths based on the source and destination addresses
    • Dynamic routing adapts the path based on network conditions (congestion, failures)
  • Flow control manages the allocation of network resources (buffers, links) to prevent congestion and ensure fair access
    • Credit-based flow control uses credits to track available buffer space at the receiver
    • On-off flow control uses control signals to start and stop the transmission of data
  • Network performance metrics include latency (time to send a message), bandwidth (amount of data transferred per unit time), and throughput (number of messages delivered per unit time)
  • Network contention occurs when multiple messages compete for the same network resources, leading to increased latency and reduced throughput
    • Can be mitigated through efficient routing, flow control, and load balancing techniques

Parallel Programming Models

  • Shared memory programming models allow multiple threads to access and modify shared data
    • Examples include POSIX Threads (Pthreads), OpenMP, and Java threads
    • Requires synchronization primitives (locks, barriers) to coordinate access to shared resources
  • Message passing programming models involve explicit communication between processes through send and receive operations
    • Examples include Message Passing Interface (MPI) and Parallel Virtual Machine (PVM)
    • Requires data partitioning and distribution among processes
  • Data parallel programming models express parallelism through operations on large data structures
    • Examples include High Performance Fortran (HPF) and partitioned global address space (PGAS) languages (Unified Parallel C, Co-Array Fortran)
    • Compiler and runtime system handle data distribution and communication
  • Task parallel programming models express parallelism through the decomposition of a problem into tasks that can be executed concurrently
    • Examples include Cilk, Intel Threading Building Blocks (TBB), and OpenMP tasks
    • Runtime system manages task scheduling and load balancing
  • Hybrid programming models combine multiple programming models to exploit different levels of parallelism
    • Examples include MPI+OpenMP for distributed memory systems with shared memory nodes
    • Allows efficient utilization of hierarchical parallel architectures

Performance Metrics and Evaluation

  • Execution time measures the total time required to complete a parallel program
    • Affected by the number of processing elements, problem size, and parallel overhead
  • Speedup quantifies the performance improvement of a parallel program compared to its sequential counterpart
    • Ideal speedup is equal to the number of processing elements, but is limited by Amdahl's Law and parallel overhead
  • Efficiency measures the fraction of time processing elements spend performing useful work
    • Calculated as the ratio of speedup to the number of processing elements
    • Ideal efficiency is 1, but is reduced by parallel overhead and load imbalance
  • Scalability assesses a parallel system's ability to maintain performance as the problem size and number of processing elements increase
    • Strong scaling keeps the problem size fixed and increases the number of processing elements
    • Weak scaling increases both the problem size and the number of processing elements proportionally
  • Load balancing ensures that work is evenly distributed among processing elements
    • Static load balancing partitions the work before execution based on a priori knowledge
    • Dynamic load balancing redistributes the work during execution based on the actual progress of processing elements
  • Performance profiling tools help identify performance bottlenecks and optimize parallel programs
    • Examples include Intel VTune Amplifier, TAU (Tuning and Analysis Utilities), and HPCToolkit
    • Provide information on execution time, communication patterns, and resource utilization

Challenges and Limitations

  • Amdahl's Law limits the achievable speedup of a parallel program based on the fraction of the program that must be executed sequentially
    • Diminishing returns as the number of processing elements increases
    • Emphasizes the need to minimize the sequential portion of the program
  • Scalability limitations arise from the increased communication and synchronization overhead as the number of processing elements grows
    • Network latency and bandwidth constraints limit the efficiency of communication
    • Synchronization bottlenecks, such as global barriers, can limit the scalability of parallel algorithms
  • Load imbalance occurs when work is not evenly distributed among processing elements
    • Can result from unequal task sizes, data dependencies, or dynamic behavior of the application
    • Limits the utilization of processing elements and reduces overall performance
  • Data dependencies and communication patterns can limit the available parallelism and impact performance
    • Fine-grained dependencies may require frequent communication and synchronization
    • Irregular communication patterns can lead to network congestion and reduced efficiency
  • Fault tolerance becomes increasingly important as the scale of parallel systems grows
    • Higher probability of component failures in large-scale systems
    • Requires techniques for detecting and recovering from failures (checkpointing, redundancy)
  • Energy efficiency is a critical concern in large-scale parallel systems
    • Power consumption grows with the number of processing elements and communication infrastructure
    • Requires energy-aware design and management techniques (dynamic voltage and frequency scaling, power-aware scheduling)

Real-world Applications and Case Studies

  • Scientific simulations leverage parallel computing to model complex phenomena
    • Examples include climate modeling, molecular dynamics, and computational fluid dynamics
    • Require large-scale parallel systems to handle the computational demands and data volumes
  • Big data analytics utilize parallel processing to extract insights from massive datasets
    • Examples include web search engines, social network analysis, and machine learning
    • Benefit from parallel algorithms for data partitioning, distributed storage, and efficient query processing
  • Computer graphics and animation rely on parallel computing for rendering and simulation
    • Examples include movie special effects, video game engines, and virtual reality applications
    • Exploit parallelism at multiple levels (pixel, object, frame) to achieve real-time performance
  • Cryptography and security applications use parallel computing for efficient encryption, decryption, and cryptanalysis
    • Examples include secure communication, digital signatures, and password cracking
    • Benefit from parallel algorithms for key generation, hashing, and brute-force attacks
  • Bioinformatics and computational biology apply parallel computing to analyze biological data
    • Examples include genome sequencing, protein structure prediction, and drug discovery
    • Require parallel algorithms for sequence alignment, pattern matching, and molecular simulations
  • Financial modeling and risk analysis employ parallel computing for real-time decision making
    • Examples include portfolio optimization, option pricing, and fraud detection
    • Benefit from parallel algorithms for Monte Carlo simulations, time series analysis, and machine learning
  • Industrial design and engineering utilize parallel computing for simulation and optimization
    • Examples include computer-aided design (CAD), finite element analysis (FEA), and computational fluid dynamics (CFD)
    • Exploit parallelism to reduce design cycles and improve product quality


© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.

© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.