Outline for 3/11/98
- Last time: Distributed Shared Memory
- Administrative:
- Come with questions next time about exam
- Remember my offer to accept questions for exam.
- Objective: Harvesting processing cycles across the network.
Issues
- Why?
- Which tasks are candidates for remote execution?
- Where to find processing cycles? What does "idle" mean?
- When should a task be moved?
- How?
Motivation for Cycle Sharing
- Load imbalances. Parallel program completion time determined by slowest thread. Speedup limited.
- Utilization. In trend from shared mainframe to networks of workstations -> scheduled cycles to statically allocated cycles
- "Ownership" model
- Heterogeneity
Which Tasks?
- Explicit submission to a "batch" scheduler (e.g., Condor) or Transparent to user.
- Should be demanding enough to justify overhead of moving elsewhere. Properties?
- Proximity of resources.
- Example: move query processing to site of database records.
- Cache affinity
Finding Destination
- Defining "idle" workstations
- Keyboard/mouse events? CPU load?
- How timely and complete is the load information (given message transit times)?
- Global view maintained by some central manager with local daemons reporting status.
- Limited negotiation with a few peers
- How binding is any offer of free cycles?
- Task requirements must match machine capabilities
When to Move
- At task invocation. Process is created and run at chosen destination.
- Process migration, once task is already running at some node. State must move.
- For adjusting load balance (generally not done)
- On arrival of workstation's owner (vacate, when no longer idle)
How - Negotiation Phase
- Condor example: Central manager with each machine reporting status, properties (e.g. architecture, OS). Regular match of submitted tasks against available resources.
- Decentralized example: select peer and ask if load is below threshold. If agreement to accept work, send task. Otherwise keep asking around (until probe limit reached).
How - Execution Phase
- Issue - Execution environment.
- File access - possibly without user having account on destination machine or network file system to provide access to user's files.
- UIDs?
- Remote System Calls (Condor)
- On original (submitting) machine, run a "shadow" process (runs as user)
- All system calls done by task at remote site are "caught" and message sent to shadow.
How - Process Migration
Checkpointing current execution state (both for recovery and for migration)
- Generic representation for heterogeneity?
- Condor has a checkpoint file containing register state, memory image, open file descriptors, etc.
Checkpoint can be returned to Condor job queue.
- Mach - package up processor state, let memory working set be demand paged into new site.
- Messages in-flight?