White dot for spacing only
The Dice Project


Task:Network
Group:gdmr,cms,jmho
Stage:1


DICE Deployment: Network

George Ross

2002/03/20 16:34:05

Report changes

2002-03-20
Note non-filter access control requirements
Note that internal DNS-system use of hosts-format does not imply its external availability
DNSsec signing of zones for stage 2
DNSsec implies local DNS processing
More on NTP
First pass at site DICE wires
"Current tasks" and "actions" removed to the stage1progress page
2002-02-12
Merge existing "_T1" and "_whereAt" documents into one

Transition

As a bootstrapping procedure it is intended to provide an initial core DICE network based mainly on existing legacy mechanisms, thereby giving a stable though not fully-featured platform for others to work from. Once sufficient of the other parts of DICE are up and running it will then be possible to migrate most network provision to DICE. (Perimeter filtering will have to remain at least partly outside DICE for some time yet.)

Meantime it's important not to assume that machines are necessarily installed in their final network location!

Basic requirements

We're spread across five sites, with an EdLAN connection to each. We want there to be network robustness in the face of things going down, but we don't want to have to replicate absolutely everything. If communications break down to a site we need there to be a quick way to set up at least narrow-bandwith paths to restore sufficient connectivity.

Physical Infrastructure

KB is reasonably OK. There are enough free switch ports for up to 40 new 100baseTX connections, though there is little room for expanding the number of Gbit-attached machines. This may need to be addressed in the medium term.

FH might need another card in the server-room switch, but apart from that should be OK on the 100baseTX front. Again, Gbit-connected machines may need to be considered later.

SB hasn't quite had all its switches fully deployed yet. We'll need to keep an eye on the server-area 100baseTX count. Again, Gbit-connected machines may need to be considered later.

BP needs some work to bring things fully up to spec. Once that's done there should be sufficient network space for full DICE deployment. There is some concern about the amount of physical space available in the machine room.

The KB EdLAN link has been upgraded to 1000baseSX. We're waiting to hear back from EUCS on the best way to upgrade FH, SB, BP -- they would have to be 1000baseLX over SM, but it's not clear yet if there are fibres in place and where they should be terminated.

Inter-site communications

Options:
  1. Pass lots of common DICE subnets to all the sites using VLANs across EdLAN.
  2. Keep distinct subnets at the various sites, and route between them using EdLAN and the sites' external-access routers.
  3. As above, but run a transit subnet as a VLAN to all the sites and have a fast router at each to connect to it.

Moving to OSPF would make some things possible which aren't with RIP, and some other things easier. In particular, RIP(v1) assumes idential subnet masks; and feeding the same subnets into EdLAN using RIP from several places interacts badly with its internal routing. On the other hand, OSPF does make some things, notably route filtering, more complicated. Adopting OSPF would have to be coordinated with EUCS.

Emergency "JCB-proofing" links to a site could be wireless, MegaStream, ISDN, modem or whatever happened to be available at the time. The University phones system doesn't offer ISDN. BT's ISDN2e lines are roughly 300 pounds each installation charge, and 100 pounds quarterly rental; other suppliers might be around too.

For KB it is also intended to redeploy the former EdLAN link components now that the main link has been upgraded to Gbit, as the existing CS-EE 10base2+fibre IP-level backup connection will cease to be on March 25th, resulting in a 100Mb ether-level connection under the control of Spanning Tree Protocol. This will provide some redundancy against switch faults at KB without the firewalling complications of the current setup.

Common subnets

There are several big problems with running lots of subnets between the sites:

The whole thing is just too messy to even contemplate a diagram. It's very definitely a last-resort choice, should there be some problem with both of the other two schemes, but it's hard to conceive of any such eventuality.

Distinct subnets

Keeping distinct subnets at each site solves those problems, but pushes all the inter-site traffic through the "front-door" filtering routers. Those would therefore need to be fast. There would also have to be at least a second router at each site for redundancy, though it could probably be smaller. Options: iptables on Linux; ipfilter on big Suns; Cisco or equivalent, as EE are expecting to use for the new DEE network. We really don't like having dependencies on expensive one-off boxes, which indicates against pure switch/router solutions. IPfilter is mature and proven. Suns are more expensive than PCs, though, but Ultra-5s or Blade-100s with GigaSwift cards might suffice. IPtables is still very definitely under development. On the other hand, the routing architecture for this design would be straightforward, and could use RIP (or perhaps RIPv2) throughout. There would be some load-sharing, though appropriate kernel support in the hosts being routed would be required in order to achieve maximal effect.

There's a diagram of what this would look like below. The semi-circles represent (one or more parallel) routers. Note that Appleton Tower would continue to run as a satellite site -- for simplicity the existing Sun routers at AT would be relocated or retired, and there would then be no inter-VLAN routing done there at all (hence there's no router shown there in the diagram). However, it makes a lot of sense to run the lab as a satellite of SB rather than of KB, particularly if SB's upgraded Gbit connection were to be terminated at AT rather than OC, so that's what's shown.

[[Diagram here...]]

One thing to be aware of with this approach (though one we do have at present with the "wireless" wire) is that any subnet which were to be shared across sites for whatever reason could only be routed at one site, and so all machines on that subnet would appear to be part of that routing site. For low-traffic or "unsupported" subnets that might not be a problem.

Transit subnet

We can avoid the problem of requiring a fast "front door" to each site by adding a private transit subnet. The diagram below is a logical representation of how this would look in practice. The semi-circles at each site represent routers. The ones which connect to EdLAN would do perimeter filtering; the ones which don't connect directly to EdLAN could do less, or perhaps no, filtering.

Bear in mind that this is the logical diagram. Physically the connections to the transit subnet run as VLANs over the same pieces of fibre as sites' external connections, so this arrangement doesn't give any additional redundancy against external connectivity problems. That would have to come through completely separate connections.

[[Diagram here...]]

Routing and filtering are both more complicated under this scheme but there are several compensatory benefits:

Note that subnets shared across sites still present some problems, though of a different nature. By using OSPF we would be able to advertise routes through more than one of our external routers. Against that, the route that any packet took to a host on a shared subnet would not necessarily be optimal, so we would not want there to be much external traffic to such hosts (the transit routers don't count here, as there should be very little external traffic sent directly to them, if any). Also, partitioning of the VLAN at EdLAN would also result in some hosts on the subnet being unable to communicate with the others.

This option has its attractions, but we would need to liaise with EUCS to make it work. Initial discussions have been promising, and several implementation strategies appear possible. Which one is adopted would depend mainly on how they interacted with the EdLAN routers, as they all involve roughly the same amount of work at our end. Specifically, do we run in one OSPF area spanning all the sites or several; and do we share a common OSPF incarnation in the EdLAN routers with the rest of EdLAN, or do they run a separate incarnation for us? There may also be implications for aggregation and stubbiness, and for route filtering, but these shouldn't cause major problems either way.

We are aiming towards this scheme, though with the option of adopting the second if either the costs turn out such that its simplicity wins or the necessary routing framework can not be set up jointly with EUCS.

South Bridge and Forrest Hill

At present SB and FH are run as though they were one site, with common VLANs carried across EdLAN. For DICE it is intended that they be split apart into two completely separate sites, each with its own subnets and local servers. This has implications for other tasks in evaluating their resource requirements.

Printing subnet

The printing group would like to establish a common subnet across all the sites. The most straightforward way to fit this into the schemes above is to have the print servers be dual-homed onto this shared wire and an appropriate site-specific one. These machines would then not advertise routes to the wire, or indeed not advertise any routes at all, effectively making it private (indeed, an RFC1597 subnet might be used). The print servers would have to be set up in such a way that partition of their private subnet did not cause them problems.

Network infrastructure machines

It is proposed that each DICE site should have a "network infrastructure" machine. There are various site-specific tasks which would be better performed locally, and these could usefully be combined on a modest-spec Linux machine with a reasonable amount of memory. These would include: local DNS cache, backup router, and site switch management host. This is in addition to the transit and external routers.

Perimeter (and interior) filters

Even in the Kerberised DICE world perimeter filters are still a Good Thing. They deny outsiders useful information, which with any luck will result in most of them going elsewhere. They provide defence in depth against misconfiguration and bugs. They protect the things which aren't Kerberised. And they can be used to control egress as well as ingress of packets.

Clearly the transit subnet model proposed above implies that the filter rulesets should be unified, given that packets for any site can, in principle, arrive through any other. Complete unification would not be necessary if the distinct subnets model were adopted, as there would be no point in a site's filter rules admitting all traffic destined for the other sites if those other sites' machines would not normally be accessible through the site (though JCB-proof paths might require that at least some other-site rules be incorporated). Of course, all sites' filter rules would accept all traffic originating from the other sites.

At present KB (and AT) filters are generated by combining rule files and executable scripts, each of which is designed to incorporate the rules necessary for a particular filtering task; while other sites are using more conventional static configuration files. Neither of these mechanisms would be suitable as-is: KB's generated rulesets are quite site-specific, while static files appear to be too inflexible. This area will require more investigation, though one possible approach would be to extend the KB mechanism to produce meta-ruleset scripts which would be suitable for execution at each site to generate the eventual site-specific rulesets.

A complicating factor is that the best choice for perimeter filter software appears to be IPfilter, which does not run on current DICE platforms. Perhaps generating the meta-ruleset scripts on DICE machines for subsequent execution on the filter machines would suffice. This requires more thought.

Whatever network model and filtering software is used, the potential for asymmetric routing paths makes it desirable that the rulesets used be state-free in general. This is unfortunate, as stateful rulesets can be somewhat easier to write, but the alternative risks connections being broken as the underlying routing shifts. However some statefulness would certainly be beneficial for transitory connections such as DNS queries and xdm's chooser mechanism. (The alpha-test versions of IPfilter 4.0 contain state-synchronisation code, so eventually all the perimeter filters could perhaps share state. This is still some way off however!)

The initial proposal would be to create a unified ruleset based on the union of the existing ones, with site-specific hooks if the distinct subnets model is adopted. This would then be reviewed and adjusted before deployment, and could of course be altered in the light of experience.

On the assumption that legacy machines would be brought inside the DICE perimeter, the unified ruleset would also have to take account of legacy machine requirements.

In addition to establishing a sound perimeter, it would certainly be desirable for some internal machines to apply their own additional network filtering. Some additional internal firewalling of groups of machines might also be required; how this would be implemented remains to be considered.

Note also that there are other network-based access control mechanisms in place. Two which will certainly have to be reviewed are: the NFS share permissions, as it is the intention that all filesystems should be mountable everywhere; and TCP-wrapper rules, so as to ensure that consistent rules are applied across all sites.

DICE subnets and DNS

It would be cleanest overall if DICE and legacy subnets were kept disjoint, though in practice a complete separation probably won't be possible. There will be a requirement for non-.inf machines on .inf wires in any case, such as for the EdLAN routers, so having legacy machines as well wouldn't cause any additional namespace management problems at least from the DICE end. However, legacy sites will need to ensure that they have mechanisms in place to apply DICE-controlled addressing information to their legacy systems.

It is proposed in the first instance that the existing "makeDNS" program, as used for .dcs and fairly extensively around the rest of the University, be used to generate the DNS zone files. This utility transforms the well-understood /etc/hosts format into files suitable for feeding to a DNS master. It does have some limitations, and is certainly in need of an overhaul, but it should serve well enough for the first phase of the project. It is assumed that some form of remote file editing mechanism will be available to simplify the process, at least from its user's point of view. Note however that the suggested use of a hosts-format source file from which the DNS zones are generated does not imply any commitment to making the information available in such a format outside the DNS system itself.

The .inf space is currently managed jointly with .dcs; this should be split apart as soon as practicable.

None of the Informatics zones is signed at present. Doing so is not a stage-1 task, but should be considered again early in stage 2.

The existing dns object has been upgraded for bind9 and ported into DICE using "minimal conversion". Full ngeneric conversion has yet to be done, one of the problems being maintaining backward compatibility with the KB legacy Suns which also use the same code to configure their DNS.

DNS service is a cheap operation for modern machines, and there is really little reason not to run slave nameservers on all DICE systems. In any case, full end-to-end DNSsec implies that response signatures be checked as close to the calling application as possible. Dropping Hesiod will reduce the size of the zones carried considerably; and some optimisation in terms of the reverse zones would be possible though perhaps hardly worthwhile.

There are also advantages for establishing central servers at each site to carry all the DICE and legacy zones, and to act as caches through which most other machines would be configured to forward external queries. This might be done on the sites' network infrastructure machines.

At least two legacy names, dns.dcs.ed.ac.uk and dns2.dcs.ed.ac.uk, are widely known and will require to be perpetuated more-or-less indefinitely. MX and other backward-compatibility records will also be required for .dcs, .dai and .cogsci, and probably other legacy domains too. These domains could, of course, be served from the DICE nameservers; there's no particular reason to set up anything separate. However, the widely-known addresses for these machines are on external subnets, and at least in the first instance only Suns are likely to have sufficient protection available to them.

In the longer term there is the question of whether addresses are a property of the network, being given by it to machines, or whether they do actually belong to the machines and so come from lcfg. This question is likely to result in considerable debate, and isn't addressed further here! One argument in favour of the former viewpoint is that layering in networks is a good thing from the understandability and maintainability point of view.

Logging

Having a central log repository has been found to be beneficial at KB, and it is proposed that the practice continue. Should we log: The network infrastructure machine would be a good place for logging to be done.

VPNs

Establishing a VPN ("virtual private network") endpoint would have many advantages for laptop users. Remote dialin users could be made to appear as though they were on the internal DICE network, which would get around awkward access restrictions, anti-spamming measures and the like; and the wireless network could have its security tightly screwed down if users had an alternative to direct network access.

Perhaps this is a stage-2 task, but if so it's one that's worth investigating early on.

Switch management

Switch management will be moved to a uniform system for DICE, running on the sites' network infrastructure machines. This will cover DICE, legacy and assorted hangers-on wires. The bits passing through the switches don't, of course, have to be DICE bits, regardless of how the switches are managed. This is dependent on a version of rfe able to edit configuration files at several locations.

Local network monitoring will also be performed, as is currently done, with common index pages pointing out to generated site-specific pages.

KB switches are currently managed using my package.

SB and FH switches are managed using my package, but need to be split apart. Some of the link names could do with being a little more descriptive.

BP isn't managed using any package yet; but that'll come when the network there is overhauled.

KB, FH and SB are all currently monitored from KB, but will move to being site-local once the switch configurations are moved to DICE.

NTP

Apart from Kerberos expecting to find machines' clocks synchronised, there are some other benefits from keeping time under control:

The existing time synchronisation network has three stratum-2 servers at KB, with all the other machines using them as time sources. (Serving NTP is a lightweight operation, as the daemons ramp the interval between queries up to several tens of minutes once everything has stabilised, so this doesn't cause any load problem.)

For the initial DICE deployment it is proposed to keep this existing setup. Once the dust has settled, the intention is to (logically) disperse the S2 servers across the Division sites for robustness, possibly also adding a fourth.

Two other aspects of our NTP net are adequate for the purpose for now but should be revisited again later:

Other things to think about

Where are we now?

  1. "At KB" we currently have:
  2. At FH/SB we have: FH and SB will split into two independent sites in the DICE world.
  3. At BP we will shortly have:
  4. The KB and centre-area external subnets will stay as existing. The transit subnet has been set up, with basic routing at KB using the legacy Suns; routing of this at other sites will have to wait for OSPF.

DICE wires at each site

The exact mix of DICE wires for each site has yet to be decided finally. There may well be no pressing reason for uniformity. Although in principle any machine should function in the DICE world just as well whichever wire it's attached to, in practice we'll want to assign them to particular subnets for the following reasons:
  1. At KB we might expect to have:
  2. At SB we might expect to have:
  3. At FH we might expect to have:
  4. At BP we might expect to have:

The "wireless" network is somewhat anomalous, as it currently exists across all four sites but is routed only at KB. As the use of a VPN endpoint is to be considered in stage 2, which will cover the wireless network as well as external access, the existing setup will be retained for now.

How do we get from here to there?

  1. We'll start off with a shared DICE development wire and the suggested transit wire. The existing experimental DICE wire CS-M should be suitable. The namespace is currently managed in common with .dcs, but this should be split out onto a new DICE management machine as soon as practicable. The existing subnets continue in place meantime.
  2. Once the rest of the DICE infrastructure is sufficiently in place we set up the individual sites' DICE wires, a site at a time, at which time existing development machines can migrate and new machines be installed. Subnets carrying the legacy non-DICE machines will remain in place as necessary.
  3. Existing non-DICE machines and subnets mostly fade away over several years.
  4. A minimal presence is then maintained for the old .dcs, .dai and .cogsci domains, probably indefinitely. Nameserver names are known in too many places for it to be realistic to think about expunging them all, and legacy mail (and web?) will require addressing information in the old domains even if the machines which service the requests live in the new world.

For future study

For future study:

Dependencies

The network task is dependent on the following other things happening:

The fall-back position, should the transit network not be possible for some reason, is to put all the traffic through the main "external" routers. If necessary, faster hardware might have to be thrown in, though this could be decided later in the light of experience.


1_network.html,v 2.7 2002/03/20 16:34:05 gdmr Exp


 : Deploy 

Mini Informatics Logo - Link to Main Informatics Page
Please contact us with any comments or corrections.
Unless explicitly stated otherwise, all material is copyright The University of Edinburgh
Spacing Line