Contents

Agenda 2

List of Participants 3

Executive Summary 4

Synopsis 6

Comments by Participants 15

Presentation Materials 16

 

 

 

Proceedings prepared by
Todd Hansen, Mike Gannis, and Hans-Werner Braun

 

 

About the cover:

This image is a screen capture from a demonstration of the Cichlid visualization tool, presented at the workshop by Jeff Brown (NLANR). The visualization represents the logical layout of the vBNS, according to on-line vBNS documentation (see http://www.vbns.net/logical.html). Blue cubes correspond to MCI-vBNS POPs, blue pyramids correspond to current and planned Aggregation Points, red spheres correspond to current and planned vBNS Approved Institutions, and gray spheres correspond to the networks of current and planned vBNS Partner Institutions. The green cylinder represents an OC-48 link, purple cylinders represent OC-12 links, green cylinders represent OC-3 links, blue cylinders represent DS-3 links, and thin purple lines represent slower links of various kinds. The small red 'cubes' represent packets — they do not represent any real feature of the network, but were added to encourage the workshop participants to think about what types of meaningful data they could visualize this way.

The workshop and these proceedings were sponsored by National Science Foundation Cooperative Agreement No. ANI-9807479 with the National Laboratory for Applied Network Research at the University of California, San Diego. The Government has certain rights to this material. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation or other institutions.

Workshop Agenda

Thursday, 1 July 1999

Introductions

Hans-Werner Braun, National Laboratory for Applied Network Research

Status and Goals

NSF perspective

Bill Decker, NSF

I2/Abilene perspective

Guy Almes, Internet2/Advanced

NGI perspective

Phil Dykstra, Army Research Laboratory

Specific Work in Progress

Les Cottrell, SLAC DOE

Andy Germain. NASA

Matt Zekauskas, Internet 2

Hans-Werner Braun, NLANR Measurement and Analysis Team

Tony McGregor, NLANR Active Measurement Program (AMP)

Related Work

End-to-end engineering considerations

Matt Mathis, NLANR/PSC

Needs for university network researchers

Arne Nilsson, North Carolina State University

Routing topology and stability measurements

Craig Labovitz, Merit

CANARIE

René Hatem, CANARIE

Discussions

How can we improve measurement and analysis?

How can NGI provide appropriate information to PITAC?

What are our action items as a result of this workshop?

List of Participants

Name

Affiliation

Guy Almes

Internet2/Advanced

Javad Boroumand

NSF

Hans-Werner Braun

NLANR/UCSD

K Claffy

CAIDA/UCSD

Les Cottrell

SLAC

Bill Decker

NSF

Phillip Dykstra

ARL/DOD/DREN

Mike Gannis

NLANR/SDSC

Andy Germain

NASA

Todd Hansen

NLANR/UCSD

René Hatem

CANARIE

Craig Labovitz

MERIT/University of Michigan

Matt Mathis

NLANR/NCNE/PSC/CMU

Tony McGregor

U of Waikato/NLANR

Arne Nilsson

ECE/North Carolina State University

Kevin Thompson

MCI WorldCom

Matt Zekauskas

Advanced/Internet2

 

Challenges and Opportunities for
Measurement and Analysis in a
High Performance Computing Environment

Executive Summary

he participants represented the NSF High Performance Connections program, the Internet2 consortium, and the Next-Generation Internet (NGI) initiative.

One objective of the workshop was to identify strategies and opportunities that derive from high-performance networking environments. These networks can be viewed as testbeds with unique features (unusually high data rates, a closed rather than unbounded network topologies, and sizes that make it feasible to instrument a relatively large proportion of their nodes and links). What are the unique opportunities for measurement and analysis?

Another objective was to recommend a research agenda for the participants working in high-performance networking environments. Can we define what we should accomplish, and by when? What are the "needs," what are the "wants," and when should we have results to satisfy them? Realistically, what can we accomplish, and by what dates?

In addition to general measurement and analysis concerns, the NGI program faces a review in the near future and must provide information on the progress and successes of advanced networks in supporting cutting-edge science. Are the meritorious applications that justify the program getting the performance they need? If not, why not? What information is needed, and how and when can it be obtained? How can the reporting format be simplified so that someone who isn’t an expert in networking can easily understand it? NGI is concerned with application-to-application performance and not just with backbone metrics; this entails getting measurements from the researchers’ perspectives.

There does not appear to be systematic measurement of bandwidth, latency, and QoS to the desktops of the 100 Mbps and 1 Gbps testbeds, and in the 1999 review the President’s Information Technology Advisory Committee (PITAC) "was unable to learn how well the NGI testbeds are operating."

Payoffs of having a wide-area network analysis infrastructure include:

• A systemic view of Internet complexity

• Development of commonly agreed-on scalable service models

• Service metrics and accountability for resource consumption: traffic signatures, aggregation of transactions vs. long-term flows; workload profiles, changes in usage; applicability to the real world

• Increased stability of the routing and addressing infrastructure

• Global Internet problem isolation and resolution

• Support for security and engineering of the system in a global context, across both political and ISP boundaries.

Participants’ suggestions to improve and coordinate measurement and analysis activities included:

• Definition of standard file formats and reporting mechanisms for all of the data being collected in support of consistent measurement and reporting throughout the backbone networks, within gigapops, and at universities.

• A portable passive monitor (OCXmon) that can be set up to test a specific network location would be a good thing for each connection site campus to have.

• Active monitors (e.g., AMP systems) should be moved further into campus to get a more precise idea of application-to-application performance.

• More instruments are needed along multiple points in a connection path to compartmentalize where TCP problems appear. Researchers need the ability to trace traffic close to the sender and receiver of a TCP connection, as well as at locations within the wide-area network.

• Correlation of data acquired through active and passive monitoring efforts.

• Open exchange of data, measurement tools, and analysis tools among researchers.

• To ensure better cooperation from campuses, perhaps participation in measurement activities should be required as a qualification for being an "NGI site" (it is too late to do this for HPC sites).

Action items for NGI participants in support of the next PITAC review were identified as:

=• Identify and begin testing 100 NGI sites with 100 Mbps end-to-end capability

• Set up a measurement mesh between NGI sites

• Build an acceptable "traffic report"

• Keep better track of the top speeds and applications on the network

• Put greater focus on the campus and on end systems.

Near-term NGI measurement objectives included:

1. Using few OCXmons at a few NGI sites to detect high-performance flows, to determine what degree of performance real applications are getting without having to actually find the applications, instrument them, etc.

2. Start testing and tuning performance between the 100 100-Mbps sites so that we can demonstrate 60+ Mbps on a regular basis.

3. Demonstrate a few "gigabit" flows in a few special cases.

4. Using passive OCXmon flow analysis, diagnose sender/receiver TCP/app behavior and make recommendations on how to tune/improve specific high speed applications. This would require significant development time and is subject to privacy concerns, but participants believed it would be of great benefit in improving application performance.

Challenges and Opportunities for
Measurement and Analysis in a
High Performance Computing Environment

Synopsis

ans-Werner Braun opened the workshop with introductions and a statement of objectives. The participants represented the NSF High Performance Connections program, the Internet2 consortium, and the Next-Generation Internet (NGI) initiative.

One objective of the workshop was to identify strategies and opportunities that derive from high-performance networking environments; these networks can be viewed as testbeds with unique features (unusually high data rates, a closed rather than unbounded network topologies, and sizes that make it feasible to instrument a relatively large proportion of their nodes and links). What are the unique opportunities and challenges that face network measurement researchers? (Some challenges include multiple high-performance service providers, demanding applications, new and experimental technologies and protocols, limited High-Performance Network Service Provider aggregation points, and restricted numbers of users and uses of the networks.)

Another objective was to recommend a research agenda for the participants working in high-performance networking environments. Can we define what we should accomplish, and by when? What are the "needs," what are the "wants," and when should we have results to satisfy them? Realistically, what can we accomplish, and by what dates?

Status and Goals

Bill Decker, Director of the NSF’s Advanced Network Infrastructure program, discussed expectations for measurement and analysis research from the perspective of the NGI and the need for information:

• Treating the NGI as a testbed – Using the results of our research to see how wide-area networks are performing.

• Enabling applications – The NSF should identify next-generation applications that have been enabled by the network (i.e., that would not exist or would not be

as effective).

• Citing results – NSF is very interested in being able to cite the knowledge, tools, and results gained of NGI networking.

Decker urged the workshop participants to identify the short-term and long-term measurement, analysis, and reporting tasks that need to be accomplished to inform the President’s Information Technology Advisory Committee (PITAC) and the NGI review next spring. It would be very good for the network research and analysis community to be able to demonstrate that scientists can do better science using high-performance networks because we have used our measurements to improve the functionality of these networks.

Decker had two requests for workshop participants resulting from these expectations:

• Come up with some plans for near-term and long-term use of the measurement and analysis work that has been undertaken. In the near term, produce some reports on the current state and progress of the research; in the long term, create plans and identify additional focus areas.

• Help the NSF consider what the nature of the funding programs in this area should be – what are the hot topics, who are the communities of interest, what is an appropriate level of support, where should we be going?

Guy Almes of Internet2 discussed measurement and analysis from an Internet2/Abilene perspective. Almes believes that the network and its applications each drive the other – that applications’ continual need for greater capabilities drives network engineering, and that new capabilities provided by advances in networking motivate developers to create more ambitious applications. The high-performance networks are characterized by a combination of high bandwidth, wide area, and intrinsically bursty applications; there is a need for multicast, QoS and measurement. He then discussed Abilene’s network infrastructure and plans for increasing bandwidth and the number of interconnections. Consistent with Joint Engineering Team planning, they are planning three high-speed peering points with the vBNS, ESnet, and other federal networks at the NGIXes. Almes’ "bottom line" is that Internet2 must support inter-university networking and advanced applications, must support a university/
gigapop/backbone/NGIX infrastructure with multiple support organizations and multiple kinds of engineering, and must have a commitment to cooperative end-to-end solutions.

His measurement perspectives are that monitoring traffic utilization (periods, formats, etc.) is easy and "boring" but important to do well, that active measurements are good for throughput and can give very accurate one-way delay and loss numbers, and that passive measurement devices such as OCXmon are especially useful because each monitor is useful to all other sites.

A discussion raised several noteworthy points:

• Matt Mathis noted that all TCP/IP bugs appear as less-than-expected performance from the network. These problems discourage use of the infrastructure. "If the network performed better, then people would use it more." (As an example, as modems have increased in speed, Internet utilization by home computer users has increased because they found the net more enjoyable to use.

• Regarding NGI’s questions: We know what we can currently measure. Can we use those measurements to answer the questions we face? Are the meritorious applications that justify the program getting the performance they need? If not, why not? How can the reporting format be simplified so that someone who isn’t an expert in networking can easily understand it?

• Does Abilene have OCXmon? "No, and it’s something we should consider."

• Support of scientific applications is different from supporting the commodity Internet, and Congress understands this. But unless high-performance networks for research and engineering are limited to being intranets, we have to think about turning Internet2/NGI into a commodity. For campuses to purchase OC-3 and OC-12 services from a provider as a commodity, measurements of performance will be necessary. There is no way at present for HPC or NGI to give a researcher an "allocation of connectivity" – we can’t yet specify it, measure it, or verify it.

• The high-performance backbone networks (vBNS and Abilene) represent a "universe that is closed from end to end." This may allow us to take measurements that can’t be made on the unbounded commodity Internet, and to discover things that we otherwise would not learn.

• Matt Mathis explained that there is a need to instrument along multiple points in a connection path to compartmentalize where TCP problems appear. He sees a need to have the ability to trace traffic close to the sender and receiver of a TCP connection, as well as at locations within the wide-area network.

Phil Dykstra of the Army Research Laboratory presented the NGI perspective. He gave an overview of the Defense Research and Engineering Network (DREN) and the Large Scale Networking (LSN) Joint Engineering Team (JET) that coordinates six major networks (Abilene, DREN, ESnet, NISN, NREN, and vBNS), and described the peering relationships and interconnects of these networks. (A map/logical diagram of the peering relationships is included in these proceedings.)

According to Dykstra, NGI has three goals: (1) to enable research, (2) to develop testbeds, and (3) to enable high-speed and/or data-intensive applications. There are two identified subdivisions of the testbed goal:

 Goal 2.1 – hook up 100 sites at 100 times current end-to-end network speeds (essentially 100 Mbps)

 Goal 2.2 – hook up 10 sites at 1000 times current end-to-end network speeds (essentially 1 Gbps)

Developing a list of 100 sites to meet goal 2.1 is a significant challenge. In addition to the "last mile problem" in qualifying sites, it is also difficult to identify the point of contact at each institution.

Dykstra made it clear that NGI is concerned with application-to-application performance and not just with backbone metrics. This means we need to get measurements from the researchers’ perspectives. He also mentioned the NGI/PITAC desire for an "Internet Traffic Report" similar to the one currently available for the commodity Internet – a simple presentation that gives non-specialists an idea of the payoff that these expensive networks provide. However, the commercial "traffic report" is derided within the measurement and analysis community as hopelessly inaccurate. "The challenge is to produce something like the Internet Traffic report that we don’t consider bogus."

Dykstra discussed various methods of network optimization for better end-to-end throughput, such as raising the MTU of links, reducing latency, and reducing loss. (He also advocated turning off slow alternate interfaces to avoid using them by mistake.)

Dykstra’s action items for NGI are to:

• Start tabulating and testing the "100 sites" of goal 2.1

• Set up a measurement mesh between NGI sites

• Build an acceptable "traffic report"

• Keep better track of the top speeds and applications on the network

• Put greater focus on the campus and on end systems.

In a discussion, Hans-Werner Braun asked whether it was possible to identify machines that are "more equal than others" –that are privileged to ping to make measurements. A method must be found to allow security systems to discriminated between tests and attacks.

In addition, there seemed to be widespread advocacy for "test’ machines on campuses near the DMZ to make measurements of campus network performance.

Specific Work in Progress

Les Cottrell of SLAC spoke on end-to-end monitoring in ESnet for high-energy and nuclear physics (HENP) and the relevance of ping. PingER treats the Internet as a black box. It provides useful real-world measures of network round trip response time, loss, reachability, and jitter. It is low-cost, widely available, mature, well-understood, does not require clients to install software; no special privileges are needed for monitor sites, and needs 100 bps per link and about 600 KB per month per link. It does not make one-way measurements.

The monitoring program is being undertaken at 18 sites worldwide (5 on ESnet, 2 on vBNS, 2 in Canada, 7 in Europe, 2 in Asia). There are 1261 monitoring-remote-site pairs and 379 unique hosts at 272 sites, and 50 "beacon sites" in 27 countries. They do more that 1 million probes of the Internet per day, and have more than four years of data.

PingER tools also are deployed by the Cross-Industry Working Group (XIWT) at 10 monitoring sites, 150 pairs. This is mostly full-mesh pinging, and is for a mainly commercial community of interest.

Surveyor/RIPE measurements use a dedicated PC running Unix at key sites, with GPS for clock synchronization; this allows measurements of one-way delay and loss. The community of interest is Internet2 clients. The HENP community is cooperating with Surveyor, and is using PingER analysis tools on Surveyor data.

Cottrell compared PingER results with Surveyor data between SLAC, FNAL, and CERN from Nov 1998 to May 1999. The two methods are complementary and the results agree well. Cottrell suggests that Surveyor be used for high detail and PingER every half hour for general trend analysis. They also intend to correlate AMP data (SLAC now hosts an AMP machine).

Networks have been improving with time with respect to round trip time and packet loss – Cottrell has seen a ten-times improvement in the last five years. Limiting and blocking of ping was first noticed in 1996, and is an attempt to protect against attacks (ping ‘o death and smurfing). It is currently a small effect, see on about 2% of hosts. ESnet and vBNS/Internet2 seem well-configured to provide good service within and between their nets. Intercontinental service is "poor to bad."

Monitoring is needed today to manage bandwidth; in the future it will be needed to gauge and manage QoS. End users need monitoring to know what to expect, write SLAs, set baselines, identify problems, and make plans.

More information is available at http://slac.stanford.edu/pubs/slacpubs/7000/slac-pub-7961.html and at http://slac.stanford.edu/com/net/
wan-mon.html
.

Andy Germain followed with a presentation on NASA’s networks. He began by noting that NASA is now a network user and not a provider and therefore has no internal knowledge of the network. He wanted us to focus on end-to-end performance as well as how to use the network effectively.

Matt Zekauskas presented on Internet2 and work in progress. Abilene currently is making active measurements with Surveyor. Abilene has been making active measurements with Surveyor; they are doing no flow monitoring or passive data collection, but would like to start. Within the Internet2 consortium, individual universities and gigapops are doing flow measurement using Coral/OCXmon, Netflow, and RTFM. They are developing SNMP utilization statistics, and want to develop some sort of network QoS analysis based on active and passive measurements as well as link utilization statistics. This analysis would help them in their end goal which is to provide end-to-end users with good, reliable application performance.

Zekauskas described the need for a standard file format and reporting mechanisms for all of the data being collected in support of consistent measurement and reporting throughout the Internet2 structure, at universities, within gigapops, and within the backbone networks. He thinks a portable machine that can be set up to test a specific network location would be a good thing for each connection site campus to have. He also discussed the need for multicast measurement tools.

Hans-Werner Braun and Tony McGregor of the NLANR measurement and analysis team described their measurement architectures and goal of creating a general network analysis infrastructure. They view the challenges and opportunities of wide-area network research as:

• Expanding the network infrastructure to support measurements and analysis

• Formulating questions that realistically can be answered

• Supporting outside researchers with data and other assistance

• Improving the quality and broadening the availability of the data

• Improving and integrating analysis and visualization tools

• Aggregating various data sets for correlation

• Validating data and results

• Reporting the results for high-performance environments accurately, consistently, and understandably to the community and to non-specialists

Braun described the OCXmon passive monitoring system and its capabilities. They can collect passive traces on the following media: Ethernet 10/100, DS3, FDDI, OC-3, OC-12 ATM (and POS soon). NLANR has developed several analysis tools to analyze the data. Cichlid can be used to view the characteristics of trace data. A daily Max Throughput chart is being developed in an effort to characterize the type of traffic going through the links.

McGregor described AMP, the Active Measurement Program. He mentioned the need to move AMP systems further into campus to get a more precise idea of application-to-application performance. AMP monitor PCs have no GPS card, and can only measure round-trip timings rather than one-way delays – does this matter? He also discussed the need for validation, the need for a light-weight throughput test, and an idea for triggered throughput tests. He also brought up the IPPM protocol as a solution to data analysis problems. McGregor finished with a discussion about the idea of integrating active measurement data with other data collected such as passive data.

Braun sees the payoffs of having a wide-area network analysis infrastructure as:

• Giving a systemic view of Internet complexity

• Addressing the persistent lack of commonly agreed-on scalable service models

• Service metrics and accountability for resource consumption: traffic signatures, aggregation of transactions vs. long-term flows; workload profiles, changes in usage; applicability to the real world

• Increasing the stability of the routing and addressing infrastructure

• Global Internet problem isolation and resolution

• Support for security and regulation of the system in a global context, across both political and ISP boundaries.

The challenges that face us are:

• How can we make the most of the flood of data — gigabytes per day — that we are collecting?

• How can we help to educate the larger community?

Related Work

Matt Mathis discussed what TCP tells us about the network. He enumerated the types of problems we can diagnose from simply taking passive trace data from various measurement points. From any tap on the forward data path you can get classic flow statistics (total data, total time, average rate, packet sizes), round trip time, loss rate, and window; you can tell if bottlenecks are upstream or downstream, you can identify some application duplex problems, and you can determine whether the rate is network-limited or host-limited. From a pair of passive monitors, you can diagnose receiver mistuning, find receiver CPU or application bottlenecks, and locate receiver bugs (closer to the receiver is better). From a pair of monitors near the sender, you can identify sender mistuning, sender CPU or application bottlenecks, and sender bugs.

Mathis noted that users have no experience with network applications operating the way they should, so they don’t know when they should complain.

Proposed in discussion was the concept of a Web site that a user’s endpoint machine could access that would diagnose TCP tuning problems and suggest ways to increase performance. Alas, this idea does not work – measurements to diagnose problems should be made next to the sender rather than the receiver. Perhaps the only useful thing such a Web site could tell a client was whether the receiver (client) socket buffer is set up correctly. The group also discussed the idea of setting up a monitor to watch the network for any applications having trouble and to diagnose their problems; network administrators who had this tool could contact the user and help them fix their applications. The consensus of the workshop participants was that this was an interesting idea but probably would be viewed as too great a violation of users’ privacy.

Arne Nilson of North Carolina State University gave his views of the needs of university network researchers. NCSU has been involved in high-speed networking with NC-REN (1985), VISTAnet gigabit testbed (1988), North Carolina Information Highway (1994) OC-3 and OC-12 ATM, and VITALnet (1996) 2.4 Gbps between North Carolina university campuses. Ongoing measurement activities are on the NCSU backbone network and OCXmon measurements on the North Carolina Networking Initiative’s NC GigaPoP. Their current measurements take approximately 60 traces per day, with 500,000 data points per trace; the interval between traces is randomized with a mean of about 20 minutes. They log the arrival time for packets, packet lengths, source, and destination ports. They are examining the type of traffic according to port (http, ftp, etc.), the distribution of port traffic, and long-range trends in traffic. Nilson described developing multi-level traffic models to analyze the performance of their networks. Their objective is to generate maps of long-term trends. They use many homegrown statistics tools, but they also use SAS packages; SAS can handle 1 million data points.

Nilson’s conclusion: "It is awfully difficult to make traffic predictions!" He was asked how the attendees could help his efforts, and his response was by providing access to data, and by coming up with statistical measurements.

Craig Labovitz presented his work with routing topology and his analysis of the churn of the global routing tables. "We see lots of problems with BGP growing much past where it is today." Withdrawing and reinserting a route takes a toll on the network. In practice even if a network is multi-homed it doesn’t mean that a system will have 100% uptime because it can take 30 minutes for the routes to switch. CIDR has solved the problem of the number of prefix entries; it has not solved the problem of the number of routes. He raised the question of how to provide a fault tolerant Internet service. He also noted that many networks pull their routes every night in order to reconfigure their routers. Craig has performed some interesting tests, such as inserting a route failure into the network and watching how long it took for it to update at different locations.

René Hatem followed with a presentation on CANARIE, the Canadian high-performance backbone program that runs CAnet-2 and CAnet-3 with the objective of providing high-speed connectivity to scientific researchers. They currently are using Surveyor and OCXmon devices, flow analysis with cflowd, and other measurement tools. As with the American networks, CANARIE’s measurement and analysis activities are intended to keep their sophisticated users informed on the status and performance of the network, to "improve the network user’s experience end-to-end," to support operations and engineering, and to demonstrate return on investment to their funding agency. Hatem also agreed with the previously expressed opinion that traffic will flow in higher quantities as network performance increases and users figure out how to use it. Rene’ also expressed a need to get more tools for doing performance analysis.

Discussions

Bill Decker began a lengthy discussion session with the topic of what we as a measurement community should set as our short-term and long-term goals.

He began by listing NGI’s interests: advanced network research, NGI testbeds, NGI applications, geographic research, minority access, and technology transfer.

NGI testbeds enable NGI applications by providing the necessary end-to-end bandwidth, low latency, acceptable Quality of Service, and good security for a critical mass of end-users. (Decker noted that security hasn’t been dealt with in a very substantial way.) The 1 Gbps testbed (NGI goal 2.2 of Phil Dykstra’s presentation) is at the leading edge of current commercial off-the shelf (COTS) products and technologies, is of particular importance. The goal is to bring 10 sites into the testbed, but the number of end-user machines at each site is still undetermined.

Unfortunately, there does not appear to be systematic measurement of bandwidth, latency, and QoS to the desktops of the 100 Mbps and 1 Gbps testbeds, and in the 1999 review the President’s Information Technology Advisory Committee (PITAC) "was unable to learn how well the NGI testbeds are operating."

The NGI needs to learn the number of testbed network desktops and sites. It would also be very helpful if the agencies reported daily averages and peak-minute measurements of these metrics between many endpoints and for many links (much as http://www.
internettrafficreport/
does.)

An easily digested reporting scheme, as exemplified by the "Internet Traffic Report" (http://www.internettrafficreport.com/) is needed (preferably a more meaningful one than the example).

As stated earlier in the workshop, NGI’s action items are to:

• Start tabulating and testing the "100 sites" of goal 2.1

• Set up a measurement mesh between NGI sites

• Build an acceptable "traffic report"

• Keep better track of the top speeds and applications on the network

• Put greater focus on the campus and on end systems, including the "last mile" problem and TCP tuning.

In the 1999 review of the NGI program, the maximum reported end-to-end performance was about 10 million bytes per second using IP and 0.6 MB per second for TCP. These are disappointing figures. As the NGI is implemented, there is room for substantial progress. (However, it was noted that no one seems to know the source of these figures – people keep citing them, but their authenticity is unknown.)

Hans-Werner Braun noted that we need to be able to put together a "compelling story" on one or two pages, with references to more information. We also should differentiate between "typical" and "high-end" users.

What responsibilities do the connection awardees have to help us meet these requirements? In free discussion, the idea was advanced that perhaps the sites can obtain testimonials from end users about how the high-performance network has enabled their research.

Bill Decker asked whether it is feasible to set up a measurement mesh between NGI sites. If so, how and what kind?

Hans-Werner Braun answered that AMP is easy to procure and install, while Surveyor and OCXmon are significantly more expensive and time-intensive to deploy. An AMP station can be built and installed in about one week, for about $1,500.

Javad Boroumand suggested that participation in measurement activities be required as a qualification for being an "NGI site."

Matt Mathis asked whether a measurement mesh between NGI sites optimally should be an N2 mesh, or should we measure N/2 point-to-point links? (The latter would be easier to do, and it would be easier to do analysis to demonstrate high performance.) Perhaps what we need to do is find the big flows, and track down the users who create them (vs. broadcasting a request for "power users" to identify themselves).

It was proposed that we might have three tiers of performance demonstrations. The first tier would be our machines with measurements for all 100 sites. The second tier would be a list of machines that we know can perform at the targeted speed. The third tier would be a list of applications and machines that we have tested or measured at the expected speed. It was mentioned again that we should use trace data to identify TCP tuning problems.

Bill Decker asked whether it is possible to identify "interesting" high performance applications. The PITAC reviewers want to see something fundamentally next-generation, fundamentally higher-performance by an order of magnitude than the current state of the art.

Guy Almes asked whether the PITAC reviewers might be more interested in a dozen examples of projects that enable good science – application-specific examples of high-performance networking success stories. If we then had measurements to back up the capabilities for the rest of the 100 sites, that would be sufficient information.

Hans-Werner Braun recommended that resources should be made available to upgrade users’ infrastructure. He noted that it’s astonishing that there are sites that can’t give us 100 base T, only 10 base T.

Bill Decker noted that we do best when we try to answer the specific points that the oversight group raises. We should help them answer the questions that they think it’s important to answer.

Matt Mathis asked, "Do users really need as much bandwidth as they say or think they do? Or do they want something N times as fast as what they have, and what they have has a bottleneck that limits its performance to a fraction of what it should be?"

Phil Dykstra asked whether it would be possible to set up a Web page to tell users about their sites’ performance. Matt Mathis replied that we should teach people to run tcpdump on their own machines and then ship the dump file to someplace with automated analysis capabilities. The big problem is that you can’t diagnose the sender from close to the receiver (at least, not with the existing sniffing infrastructure).

Phil Dykstra asked about creating an NGI version of Linux, tuned and instrumented for high performance. Matt Mathis answered that other people have advanced this idea, and there’s a move afoot to do just that.

Guy Almes advanced that one of the things we should be doing is measuring bps rates from department to department in addition to remote sites, to help identify and document the situation. "My intuition is that PITAC people find bits-per-second rates meaningful."

Can we set up AMP machines to demonstrate "good fractions" of OC-3 or OC-12 between two points? Can we show 70% of 100 Mbps?

Matt Mathis noted that "The goal of a good network manager should be to be invisible – to have people ask ‘So what do you do, anyway?’ But the people who get the attention and praise are the ones who visibly solve problems – and often the problems are there because they did things wrong in the first place."

"We tend to think of the network as being from connector-to-connector, in the wall ... while most users think of it as socket-to-socket, in the desktop machine, including software."

At the end of the discussion session, the group ended up with the following NGI measurement objectives on the whiteboard. The group believed that these would help address some of the concerns of the PITAC.

1. Packet-header trace monitors plus analysis to detect high performance flows. This resulted from something Hans-Werner Braun had started. By looking at OCXmon flow traces, you can pick out the top-Mbps flows on a regular basis. This would allow us to passively see what performance real applications are getting without having to actually find the applications, instrument them, etc. A few OCXmons at a few NGI sites would give us a sampling of how we are doing.

2. "Routine" 100 Mbps. This of course is an NGI goal 2.1 objective. NGI would like to start testing and tuning performance between the goal 2.1 sites so that we can demonstrate 60+ Mbps on a regular basis. (This performance could be verified in part by #1.)

3. Assembling some "OC-12 facts." It would be good to demonstrate a few "gigabit" flows in support of the goal 2.2 objective. Rates of 100 Mbps should be "routine," but we should at least be able to demonstrate "gigabit" in a few special cases.

4. Scan for applications that we can help. Using passive OCXmon flow analysis again, Matt Mathis believes that it will

be possible to diagnose sender/receiver TCP/app behavior and make recommendations on how to tune/improve high speed applications. This item will require significant development time, and is subject to privacy concerns, but we felt that such passive analysis would be of great benefit to improving application performance.

An immediate action item is for NGI to identify the 100 "goal 2.1" testbed sites that can expect "routine 100 Mbps." Another is for the NGI networks to decide which of those sites they might like to monitor.

Comments By Participants

Phillip Dykstra

Army Research Laboratory

irst, my thanks to Hans-Werner for hosting the measurements workshop at SDSC last week. I thought it was useful and productive.

At the end of the strategy session (July 1st), the group ended up with the following measurement objectives on the whiteboard. We felt that these would help address some of the concerns of the PITAC.

1. OCXmons to detect high performance flows

This comes from something Hans-Werner just started. By looking at OCXmon flow traces, you can pick out the top Mbps flows on a regular basis. This would allow us to passively see what real applications are getting without having to actually find the apps, instrument them, etc. A few OCXmons at a few NGI sites would give us a sampling of how we are doing.

2. "Routine" 100 Mbps

This of course is an NGI goal 2.1 objective. We would like to start testing and tuning performance between the goal 2.1 sites so that we can demonstrate 60+ Mbps on a regular basis. Would be verified in part by #1.

3. Some OC-12 "facts"

It would be good for us to demonstrate a few "gigabit" flows in support of the goal 2.2 objective. 100 Mbps should be "routine", but we should at least be able to demonstrate "gigabit" in a few special cases.

4. Scan for applications that we can help.

Using passive OCXmon flow analysis again, Matt Mathis thinks that it will be possible to diagnose sender/receiver TCP/app behavior and make recommendations on how to tune/improve high speed applications. This one will take some significant development time, and is subject to privacy debates, but we felt that such passive analysis would be of great benefit to improving application performance.

As part of the above, I argued that we need to start getting specific about who the goal 2.1 testbed sites are, so that we know which ones to test, and between which to expect "routine 100 Mbps." I invite the NGI networks to start compiling which sites they consider part of goal 2.1, and which of those sites they might like to monitor.

Presentation Materials

The following pages reproduce materials that were presented in the form of slides, handouts, or screen images during the workshop.