White dot for spacing only
The Dice Project


Task:Beowulf
Group:iainr
Stage:1


Description

Configuration and management of Beowulf cluster.

Report changes

2002-03-08>/dt>
slight update to minimum install
2002-03-03defined minimum/full beowulf service and how to get there
2002-02-28
updated to do list of tasks
2002-02-18
Added actions, updated current situation.
Current Situation Issues Implementation Details

I get to wimp out on this slightly as it partly depends on the arrival/install dates of hardware and the software choices. Also the electrical work is largely outside our hands.

Minimum service for 64 node GX240 cluster.

A cluster of nodes with one interface on wire-m all of which are accessible to some netgroup via ssh. Required software will be installed and configured to use the private network for message passing/routing etc. but will probably not be optimised for load balancing and some per user configuration of ssh or other connection service may be required. There will be no job scheduling and software upgrades will be done as scheduled maintenance announced via motd/eduni.dcs.sys.

It should be possible to power up all the nodes with a minimum of fuss and without crippling the CO responsible

Full service for 64 node GX240 cluster.

This will provide job scheduling and load balancing for the nodes and required software will be integrated with the scheduling software.

User access to the individual nodes will be removed and both interfaces on each node will be configured for private subnets (or subnets inaccessible to normal users).

Security to the nodes will be increased, probably have some kind of limited firewall running on anyhting accessing the private networks and something monitoring logins.

Software updates will be run as scheduled jobs to avoid upgrading anything "underneath" running jobs.

Full site specific documentation on using the scheduling system will be provided including example scripts where appropriate

Minimum service for 16 node 530 cluster.

As Minimum for 64 node above but the myrinet routing will be hand configured (this will have to be done every time a node is rebooted) and there may be a very limited set of software which will support myrinet.

Full service for 16 node 530 cluster.

As Full service for 64 node but myrinet routing will be configured automagically Actons for minimum setup on both clusters.

Actions to bring 530 cluster up to full service. Actions to bring gx240 cluster up to full service Current Status as of 2002/03/08 10:47:32
Nodes
We have all the nodes and they all test out ok (hurray), sample nodes have redhat 7.1 w lcfg installed and are being used to work on kernel drivers and other assorted software.
Power
power to the racking has been installed.
Racking
We are short 8 beams for the racking which are on back order, no delivery date as of yet but it isn't holding us up (yet). Key Industrial have a delivery date of ~25 Feb. I need to chase this up
KVM's
PS/2 Cabling order is currently stuck in the admin triangle.
Ethernet
The 4108gl and PSU/cards is here waiting for George to do his magic though probably he won't have time and we'll configure it manually.

The SMC 1255TX-LP's are here and have been tested in testnode2 using the tulip drivers.

Myrinet
Cards and switch have arrived, have quickly tested one of the cards in a 530.
Hardware Monitoring
Doesn't look like we will be able to do this in the short term, the gx240's don't appear to have any and the PC87365 on the 530s isn't supported. We may be able to get CPU info out of the Max 1617s on the Xeons but this apparently only works with 2 processors installed :(.

I suspect that there's a bunch of work to be done on this in terms of longer-term development.

Backup server
Is set up and installed.
Dependencies

Actions

As we're running way behind on this we are now aiming to get all the nodes up and running and leave it at that until (and if) there is some slack in DICE to do more work.

Current

2002-02-19
Port dhcp component to 7.1 and re-write to new lcfg standards.

Completed

2002-02-19
Port grub component to 7.1 and re-write to new lcfg standards.Basic component is now in place
References


 : Deploy 

Mini Informatics Logo - Link to Main Informatics Page
Please contact us with any comments or corrections.
Unless explicitly stated otherwise, all material is copyright The University of Edinburgh
Spacing Line