|
| Task: | Beowulf |
| Group: | iainr |
| Stage: | 1 |
Configuration and management of Beowulf cluster.
After going to tender the contract was placed with Workstations UK who originally supplied a cluster of 64 nodes with 100Mb Fast ethernet. Unfortunately this was found to be unreliable and they were unable to provide a solution which worked and met the tender spec. We have since decided to use desktop PC's in order to avoid going through another lengthy tender process.
We have come out slightly ahead in that we can duplicate the 64 node cluster and also provide a 16 node cluster with higher bandwidth/lower latency networking (Myrinet).
There are two small beowulf clusters being used by research groups at KB and FH for specific tasks, neither is for general usage.
Because of the various delays and the need for more work on DICE we will start off with a minimal service and upgrade to a full service as we can
We want to be able to do a hands off installation of groups/all of the nodes via PXE ( but with safeguards so that we don't accidently nuke the thing).
Will re-installing ~80 DICE machines at once have any kind of load effect on the system as a whole?
Making the Beowulf server a DICE installation server would make things nice and self contained.
We cannot upgrade packages underneath running jobs, we need a method to schedule updaterpms other than at reboot/nightly update.
Actually we can probably treat the nodes as dcs laptops and use pbs to schedule running laptopupdate periodically.
If we are using a batching system we may be able to use it to schedule software updates or at least mark downtime on the nodes.
There is likely to be an ongoing support issue in installing packages which support parallel code (MPI, PVM, etc) particularly wrt security and integrating it with whichever resource management software we use.
There are policy issues wrt user priorities and resource usage which have to be sorted out with the userbase.
Conventional wisdom says have a single point (head node) for users to interact with the Beowulf, any scheduling scheme is better than relying on human nature (see also security below).
There are basically two ways of controlling the load on the nodes, batch processing jobs across the nodes (PBS, NQE) or treating the cluser as one large virtual SMP machine (MOSIX). We could use either or sit one on top of the other (running PBS over MOSIX has it's supporters).
It would make most sense to optimise the smaller cluster towards High Performance jobs ( short runs w massively parallel code) and the larger cluster for high throughput (run the same problem over and over again with different conditions).
If the division is likely to be doing collaborative work (particularly wrt DataGrid) we should consider looking at Globus.
Urghh, lots of people do lots of different things, most of them not nice, rsh type security(!) is probably the most common.
Given we are likely to be running all sorts of wierd and wonderful daemons and protocols with strange trust relationships we should restrict access on the individual nodes and hide them behind a firewall.
We want to monitor, probably with archival data and alerts.
Node availability.
Stats on system load, memory usage, disk usage for individual nodes.
Stats on jobs going through the beowulf.
Hardware health (this may not be possible with desktop PC's)
Run status of key daemons on nodes.
Do we want to go the whole hog of process accounting?
Documentation on how tools interface with the system (MPI and PBS don't know about each other and scripts need to be written to set up the MPI environment).
Documentation on whatever scheduling policy is used.
Tools to allow users to check on the state of the system and running jobs (preferably without loading the system.
I get to wimp out on this slightly as it partly depends on the arrival/install dates of hardware and the software choices. Also the electrical work is largely outside our hands.
We have rpms running under 6.2 for lam-mpi (over ssh) and PBS, since PBS seems to be the standard scheduling system I would propose using it at least in the first instance.
There are lcfg entries to build a private net (192.168...) and set up dns for same. Initially we'll run with one interface on a private net and one on a dcs/informatics subnet partly to make testing easier. This would make some hosts available in .inf if other people need to build/test stuff. Once we're happy that the hardware is all working OK we lock off access to the nodes (maybe drop everything onto a private subnet) and only allow access to the head nodes.
A cluster of nodes with one interface on wire-m all of which are accessible to some netgroup via ssh. Required software will be installed and configured to use the private network for message passing/routing etc. but will probably not be optimised for load balancing and some per user configuration of ssh or other connection service may be required. There will be no job scheduling and software upgrades will be done as scheduled maintenance announced via motd/eduni.dcs.sys.
It should be possible to power up all the nodes with a minimum of fuss and without crippling the CO responsible
This will provide job scheduling and load balancing for the nodes and required software will be integrated with the scheduling software.
User access to the individual nodes will be removed and both interfaces on each node will be configured for private subnets (or subnets inaccessible to normal users).
Security to the nodes will be increased, probably have some kind of limited firewall running on anyhting accessing the private networks and something monitoring logins.
Software updates will be run as scheduled jobs to avoid upgrading anything "underneath" running jobs.
Full site specific documentation on using the scheduling system will be provided including example scripts where appropriate
As Minimum for 64 node above but the myrinet routing will be hand configured (this will have to be done every time a node is rebooted) and there may be a very limited set of software which will support myrinet.
As Full service for 64 node but myrinet routing will be configured automagically Actons for minimum setup on both clusters.
Install the secondary nics in the nodes
set up lcfg entries for all the nodes, including secondary nic entries and configuring them as laptops w.r.t. updaterpms, they also get to be nameservers
Install the head nodes and servers as NIS servers.
Set up an auth netgroup for authorised users
Set up the 4108gl switch on wire M and with the Priv666 VLAN
Install the backup server and connect to both subnets, we will use it as the fileserver to begin with. Probably we should configure it as an installser ver.
Run the power, ethernet and the ps/2 cables to the first two shelves.
Stack the first 8 gx240 nodes on each shelf and install 7.1, these nodes are now operational as compute nodes.
Look into wake on LAN cos switching on 64 nodes via the power buttons is not going to be fun
setup the 530's on the new racking and install them, these are now operational as compute nodes
Run the first set of nodes for 24 hours and check on temperature issues, need to use probe as we don't have onboard monitoring
Assuming no temp problems add nodes as required and stir when approaching the boil
Add the 530's to the old racking and setup the myrinet switch.
Setup the server when it comes and reconfigure the backup server to be a backup server
setup the hot spares, the 530 can pretty much be configured as per the standard nodes, just needing the fibre and ethernet connections to be swapped, the gx240 hot spares can run as test machines but will have to be rebooted in order to join the cluster (or we accept VLANing their traffic across other switches).
Install PBS
Configure installed software to use PBS and produce generic scripts for users
Write some nice documentation for users on how all this works.
Drop user access to the individual nodes and confgure firewalling on all machines with access to the 530's private subnet.
Install other software as required (MOSIX, whatever)
As per above
Investigate the possibility of setting up jobs which run across all nodes
The SMC 1255TX-LP's are here and have been tested in testnode2 using the tulip drivers.
I suspect that there's a bunch of work to be done on this in terms of longer-term development.
Install system.
Kerberos Infrastructure
SNMP monitoring ( could be used as a testbed ).
Networking/firewalls.
|
Please contact us with any
comments or corrections.
Unless explicitly stated otherwise, all material is copyright The University of Edinburgh |
|