NLANR/MNA logo

Summary of Research Activities - January 2004

line

Development and distribution of measurement and analysis tools

~ Continuing development of new metrics and real-time analysis for PMA

Pere Barlet has started working in Hamilton. Initially he and Jörg Micheel did some performance testing of his application, with the help from Jesper Peterson and Stephen Donnelly from the Endace team. It turns out that the SmartBits traffic generator is a good commercial tool, but does not support generating a broad variety of end systems and applications. Further testing was postponed until the DAG-based transmit solution is accessible.

Pere and Jörg have also done some analysis of SMARTxAC for use with PMA. Pere's application is designed for doing the reporting at CESCA. SMARTxAC is a traffic analysis tool that reports interesting information for network management. It also has a Web interface through which the analysis results can be accessed graphically. Some of these graphs (for each organization connected to the monitored network and for each day/week/month) are:

  • Time-evolution of the traffic by applications (in units of megabits per second and packets per second).
  • Traffic by applications (in units of bytes and packets)
  • Traffic by destinations (where destinations are the interesting external networks defined by the network administrator).
  • Traffic by application x destination
  • Traffic by protocol
  • Traffic using undefined IP addresses
  • Traffic using unknown ports

Pere began testing the SMARTxAC analysis software using the PMA daily traces to generate some graphical reports. He got some interesting results using the traces of one monitor, but the source code would need to be modified in order to get the data from all of the monitors at the same time. The tool is also reporting a lot of data that is not interesting at all, since the IP addresses inside the PMA traces are anonymized. But PMA needs lightweight processes and as little as possible generation of big datasets and large files. Pere and Jörg discussed how to continue. They found two alternatives: the first would be to develop a new version of Pere's tool, more focused on the requirements of the PMA project. This version should be simpler than the tool we are currently using at CESCA, since all of the reports related to IP addresses (the most expensive ones) are no longer needed. The other alternative would be to develop a lightweight live-capture tool in order to get graphical results in real time (instead of by using CoralReef's crl_flow). This new tool might run inside each PMA monitor, and each monitor might run a Web server inside where these results could be accessed in real time. This second option was judged more interesting and challenging from a research point of view, although it also will be more difficult. One of the driving factors was opportunity; several weeks in Hamilton are available for Pere to work in one stretch and have all the gear available for use, including traffic generators and all sorts of DAG cards up to 10 Gigabits, to test application performance.

Pere Barlet has begun designing and creating this new real-time capture/visualization tool for PMA's monitors; the final goal is to work with 10 Gbps links. He developed a capture program that aggregates packets into flows (as CoralReef does). Flows are aggregated at a higher level than IP flows, since ports+protocol are translated into its application. The lpf algorithm is applied to IP addresses and are translated into ASs (or subnetworks, institutions, etc.). ASs, network names, etc. will be anonymized before storage and visualization. The capture engine is multithreaded, so the program continues to capture data while it is processing or dumping the captured traffic. The application is programmable; one has to provide three files with a list of the autonomous systems and application (TCP/UDP port) definitions. Jörg and Pere discussed introducing a flow engine in addition to his AS engine; they decided to further develop the solution towards a complete demo by the end of February. The data collected by the monitor will be anonymized and then stored into an RRD database, where it can be aggregated and sampled into minutes, hours, days, weeks, months, etc.

In many ways, this solution is very complimentary to what Jörg has worked on with Chris Gross; the model is different, but the output is compatible in the sense that we can display the various views (data) through the same mechanisms.

Chris has made progress on his sensor, focusing mostly on flows. But the machines he is developing the sensor on aren't functional at this time. He plans to continue development on the sensor on SDA as soon as it is collecting traffic. Chris also has been giving conceptual thought to his flow engine with respect to making improvements other than flows to the existing system.

~ Progress on the reimplementation of AMP and the development of a new testing architecture

Tony McGregor completed the code for the remote server execution. He also spent some time rearranging the way most of the file name constants are defined to make them simpler and more flexible. He now has 13,696 lines of code. I've been reviewing openSSL, because he wants to do both client and server authentication.

Tony worked through a number of security issues with the amplet code, including getting client authentication sorted and how to extract information from the remote certificate. He wrote some documentation about creating a certificate authority, and the certificates and keys needed for the amplets. Tony got most of the way through the final install scripts for the amplet code. He has set up a fresh machine and made a copy of its disk, etc., and is doing the final debug. The next step will be the soak test and also resolving any portability issues.

Tony had been using his home Linux gateway as a testing machine for the amplet code (amongst others) but was having some trouble with connections hanging. Xing suggested that this might be because it had an old 2.2.16 kernel, so he upgraded the kernel to 2.6.1 and installed new versions of various tools that wouldn't work with the old kernel. That fixed the problem. Unfortunately, the driver for the satellite connection that he uses wouldn't compile with the new kernel, so he ultimately had to revert to the old kernel again -- but at least we know what the problem is. Tony will put together a new machine for Xing to use for his testing.

~ IPMP

Matthew Luckie worked on state diagram figures for IPMP for describing how the protocol works. He and Tony McGregor have been discussing the clustering project with several people, trying to improve the performance of the algorithm, and have proposed some new parameters based on the half/ full duplex behavior of the connection.

Matthew Luckie reports that Mark Allman has set up a team to review the IPMP protocol, to take it forward within the Internet Engineering Task Force (IETF).

~ Path Visualization Tool

Lana Kennedy spent considerable time trying (unsuccessfully) to debug a persistent array index error in pathvis. She re-evaluated the pathvis implementation, and is rewriting the code to be cleaner and easier to understand and a lot more streamlined. She is changing the code from an odd C/C++ hybrid to straightforward C++, and changing the design from using awkward arrays to the C++ stl list template. Lana and Ben Reesman got the program to compile and run, but the seemingly buggy output seems to indicate an inconsistency with either the file reading code or the comparison code. She plans to test to make sure that it's the code that is the problem, and not the data, and then subject it to a bunch of test cases. She is looking at the actual data to create test cases.

Lana also worked with the gd graphics library. She installed a couple other libraries for proper performance, libpng and zlib, and is currently trying to compile the sample program from the gd website and to use the library for the pathvis output.

Extending the Network Analysis Infrastructure (NAI) in support of new and developing HPC needs

~ New (and developing) strategically important measurements and deployments

Bud Hale reports various new (and developing) strategically important measurements and deployments. Three new PMA deployments are in various stages of preparation, including the OC192mon for Internet2. Bud has ordered AMP monitors for the Internet2 sites. He is still pinging on Matt Zekauskas to take care of the task of allocating the IP address space for the Internet2 GigaPOP AMP machines, as he committed to do back in December. Brian Court of CENIC has told Bud that he would like to have AMP monitors at seven CENIC sites in California. The AMP machine recently deployed to Taiwan has arrived but at last report was still held in customs. Bud has provided all the information possible to get it released. No word as yet from Taiwan as to whether or not this has happened.

Three new PMA deployments are in various stages of preparation, including the OC192mon for Internet2:

~ OC192

Stephen Donnelly spent several weeks testing the OC192a monitor. While Stephen tested the monitors in 10GigE LAN and WAN and OC192, Jim was able to configure the Adtech to provide the needed test data. It appears the issues of running two DAG 6.1 cards in the same machine have been resolved. Stephen seems very happy with the results. Both of the OC192 monitors are installed in the machine room racks. The taps are installed and the fiber to take the data to the monitors is in and hanging from the taps. Jim is waiting for the hanging fiber to be connected to the DTF router to begin collecting traces. Jay Dombrowski and Lyle Carlson of SDSC have assured us they would connect our taps into the router as soon as the taps arrived; it's now up to them to connect the taps. Jim has requested that this be expedited, but has not received a response.

nai-p-sda, San Diego Supercomputer Center, SDSC ~

Jim and Bud Hale met with Grant Duvall of CAIDA in regard to the CalREN signal taps. This should facilitate the connection the nai-p-sda (San Diego Abilene) monitor to the Abilene network through the GigE CalREN connection here at SDSC.

An aggravating sequence of problems with the passive monitor at San Diego Supercomputer Center (nai-p-sda) appears to finally be solved, and the monitor is ready to be collecting traces. Jim Hale's diagnosis of the problem being the RAID controller was correct: the backplane connecting the four SATA hard disks was bad, causing two disks to not function. After a month-long battle with tech support, we finally received a backplane replacement. The monitor detected all the disks, but this new backplane also had problems -- the fourth disk in the collection could not be formatted. Jim tried swapping the disks, and the fourth disk continued to be the problem. Jim installed the machine without the failing disk until a replacement could be installed . As the monitor began collecting traces, problems began with another disk. It was obvious the controller was bad on the motherboard. Rather then dealing with tech support again and all the delays that would cause, Jim bought another motherboard and installed it, which seems to have solved the problem. (Jim will send the other board back on warranty.)

The box is now capturing data. This is a link which uses 802.1Q VLANs, and Jörg Micheel had to tune ipanon to accept those record formats and increase the fixed capture size to 60+16 Bytes to get all the TCP/IP header data. Jim is talking to Jay to get more information as to which VLAN is which and what the nature of the residual IP communication is.

nai-p-psc, Pittsburgh Supercomputer Center, Pittsburgh, PA ~

Jim Hale was contacted by Kathy Benninger of the Pittsburgh Supercomputing Center. She informed him that she has received all the OC48MON equipment we've sent. She is expecting to have the monitor (nai-p-psc) installed Friday, though she won't have the taps installed until after February 9 when a key technician will return from vacation.

~ Special Traces

Jörg Micheel reviewed a note from Nevil Brownlee in Auckland that some of his Auckland-8 data copying operations had caused a major stir at ITSS; due to a misunderstanding between the Jörg and Nevil the accounting was still enabled and we had accumulated an NZ$780 traffic bill during the Christmas holidays. Luckily, Nevil clarified the situation and bailed us out the next day, and we now have permission to copy data during the night hours and the weekend.

Jörg also had to establish which of the Auckland-8 files have made it to San Diego and regenerate those that did not, restricted by the nightly copy window for Auckland-to-San Diego. Things are hampered by the crontab run scripts; Jörg is using rsync, so data only gets copied once, and if it is interrupted, the next nights run will pick up from there.

~ IPv6 and IPv6 Scamper

Matthew Luckie made quite a bit of progress on scamper. He split scamper.c, the logic for driving scamper, into a series of smaller and more general purpose units. He also managed to find and fix some PMTU bugs and performance limitations.

Matthew conducted IPv6 path-MTU discovery between all the AMP IPv6 monitors to check that the code worked. It seemed to work just fine. He discovered several tunnels. One is out of AARnet (expected, as they are in Australia and native international peering is probably a bit of a stretch at this point). There was a bizarre IPv6 tunnel out of Oregon that reported its MTU as 1480 for one hop, and then 1460 for the next. Matthew queried Joe St Sauver about this. Joe says that the tunnel is IPv6 in IPv4; but Matthew thinks the interface is reporting an incorrect MTU value.

He also found a router on Abilene that returned "network unreachable" instead of "packet too big" in response to 1500-byte packets. The router is located on the way to amp-missouri; Matthew sent them an email to confirm this odd behavior. He also found a couple of IPv6 routing anomalies out of NYSERnet and reported them to Bill Owens, who fixed them. One of their routers was not taking routes from a particular source which made chunks of their IPv6 space non-routable.

Bill Owens is in the process of putting together a Linux box with a GigE card that will connect into a port on a router that has an MTU of 4470. He would like us to conduct a survey of Abilene-connected hosts to show where the MTU bottlenecks are. The last time a survey was conducted from NYSERnet, it found that not a single campus was connected to Abilene with >1500 byte path MTUs. Bill thinks this is a useful statistic; the last time this happened, the information was presented at an Internet2 Joint Techs meeting.

Matthew did a survey via the system manager for new IPv6 addresses for adding to AMP monitors. In the process, he found a few sites who were blocking SSH, and a few sites that were down. He sent e-mail to a few site administrators asking them if they would be willing to allow IPv6 tests to their site; one response so far has been very enthusiastic.

Matthew also used the AMP mesh to show that the anycast address for routing to the closest 6to4 gateway is not well done in the USA, or else there are no good 6to4 gateways here. (See RFC3068 for details about this address.) Basically he could not find any of the AMP machines on networks that route to a 6to4 gateway in North America. He also found that most of the international AMPs were routed to close-by 6to4 tunnel entry points.

Matthew has written a plan to get funded by WIDE to continue the development of scamper. He also has done some work on scamper's PMTU support for Linux.

Outreach, application support, utilization improvement, and documentation activities

~ Presentations and Conference/Meeting Participation

Ronn Ritke attended the Town Hall Meeting at SDSC. Ronn participated in a TeraGrid breakout sessions that had a network focus.

Tony McGregor and Matthew Luckie and attended the NZNOG conference.

Ronn Ritke gave a presentation on NLANR/MNA as part of a measurement session in the Joint Techs/APAN meeting, and met privately with a number of people.

~ Collaborations and activities supporting network research

Fey Sheu informed Ronn Ritke that Taiwan will develop their own active measurement infrastructure at the GigaPOPs, and possibly further onto campuses.

Tony McGregor resolved a problem with the calorie database rejecting connections that was triggered by a user (Sevcan Bilir, a new Ph.D. student with The University of Texas at Dallas) who couldn't use the traceroute request page. He also has started making a new "day in the life" dataset for Sevcan. Tony provided information on AMP data for Hank Nussbacher, who posted a request on the Internet2 measurement mail list.

Connie Logg and Les Cottrell at SLAC plan to use the algorithm from Tony McGregor's event detection work of a couple of years ago as part of their bandwidth estimation work. (Warren Matthews gave them a copy of Tony's PAM paper.)

Jörg Micheel received e-mails from various users with interest in the data. Felix Hernandez from UNC inquired about the new NCAR-1 and Leipzig data sets, and we've provided him with descriptions. Joel Summers of the University of Wisconsin at Madison had some problems with large files (>4GB) on the Leipzig data, and so have we. The problem may be with the original server in Leipzig giving incorrect file size and/or not providing access to files of that size; there are several Linux LARGEFILE issues (libc support) problems that might be the cause. Murray Jorgensen of the University of Waikato approached us about some data previously processed by UNC DiRT, and asked whether we could provide converters and more background.

Jim Hale met with Margaret Murray and the system administrators of CAIDA. Discussions focused around details of movement and consolidation of CAIDA, NLANR, and HPWREN equipment -- what equipment has been requested to be moved, what equipment is being removed, and what equipment would be in better places in new locations. A considerable amount of cooperative data was shared. Everyone viewed the meeting as successful and beneficial.

Jim's meeting with the CAIDA led to other meetings that included tours of the San Diego Network Access Point (SD-NAP) at SDSC. This provided valuable information, including information regarding the source of the regenerative tap -- it turned out that the NLANR portion of the tap had not yet been activated. We will confer with Jay Dombrowski regarding how the taps are routed and cabled.

Ronn Ritke discussed NLANR/MNA international activities with Doug Gatchell of NSF. Doug was very interested in the collaborations with Korea and Australia and their efforts to develop their own local active measurement infrastructure. Ronn also discussed AMP and PMA plans with Kevin Thompson of NSF. Kevin reviewed presentation slides on current activities and provided comments.

Ronn, Hans-Werner Braun, and John Towns had a videoteleconference to discuss potential collaboration projects for NLANR DAST and MNA. Jim Williams of TransPac has told Ronn that he is interested in having NLANR/MNA continue the measurement collaboration in their next proposal.

Ronn met with Vijay Samalam, Networking Director at SDSC, to update him on current NLANR/MNA activities. Ronn also had several meetings with Peter Arzberger of UCSD about upcoming PRAGMA events and activities.

Ronn also spoke with Jay Dombrowski about OC192/10GigE DTF measurements at SDSC. Jörg wants to move forward with these measurements. Ronn met with Stephanie Sides of Cal-[IT]2 for an OptIPuter status update, and spoke with Phil Papadopoulos of SDSC about possible future OptIPuter measurements; a testbed has been implemented.

Ronn Ritke corresponded with Jian-Bo Gao of UCLA. Wireless traces taken by Todd Hansen were sent to Jian-Bo for statistical analysis and modeling. Ronn also spoke with Matt Zekauskaus about IP addresses for Observatory Project AMPs and the Ann Arbor PMA.

~ Documentation, Web work, networked data, publications

Ronn Ritke and Mike Gannis revised, updated, and reorganized the NLANR/MNA International Collaborations white paper. Mike, Ronn, and Gail Bamber designed posters describing NLANR AMP and PMA activities for an upcoming presentation to NSF. (Two posters will be made; Kevin Thompson will have one for display at NSF.) Gail and Mike also created a handout to accompany the poster.

Maureen Curran wrote overview paragraphs on NLANR/MNA's OC192 efforts and the PMA special traces for the new poster (and for other boilerplate uses). She also wrote succinct overviews for MNA in general, AMP reimplementation plans, and the real-time PMA measurement and analysis efforts

Maureen collected and distributed new information on highlight activities from the current weekly into the appropriate categories, used some of the info in the new special traces update paragraph.

Maureen wrote a new overview of NLANR/MNA activities. At the same time, she excerpted and developed parts into stand-alone paragraphs on individual efforts: brief overview paragraphs for MNA, PMA, and AMP each in general, OC192 Measurement and Analysis, PMA Special Traces, PMA real-time efforts, AMP IPMP, Current AMP goals/reimplementation, and international deployments. This information also will be used on the mna.nlanr.net website "More Information" page, to which the Highlights section of the front page will link. Using the new boilerplate language and poster paragraphs, Ronn and Maureen created a new set of slides which can be used as the foundation for at least three different slide sets.

Maureen continued work on the Web page templates and PMA pages and infrastructure. She wrote the text for an update and detailed background for the OC192 Web page, set up the infrastructure of each of the three navbars (MNA, PMA, AMP) for cross-browser compatibility, and created a placeholder page for PMA pages. She laid out, set up, and tested the entire PMA Web page template system, including designing a new template for the special traces (and other PMA pages without the full navbar), and created all of the PMA templates. Maureen also found a way to use an embedded style sheet to alter the size of the <H1> tags, on our website, to enable search engines to find it more easily (and give us higher ranking) than if we used <H2> or <H3>, but we won't have obnoxious, oversized page titles. Tony and Maureen considered how to automatically update the "Last modified on" date for each Web page, and Tony wrote a script to do this (which will be good for version control with multiple users).

Maureen also had the current issue of NATimes reprinted, and prepared a FedEx shipment to Tony for distribution at the NZNOG. The shipment also included copies of the International Collaborations handouts.

Ben Reesman restored missing coordinates to several AMP sites. Tony McGregor sent software that presents a programmatic interface in C and PERL to retrieve AMP data from the data library. Ben attempted to write PHP bindings for it; writing "extensions" to PHP using the Zend engine turned out to be more complicated than he expected, and the PHP engine has several limitations that may make it incompatible with the AMP data library; he is still examining these issues. Ben also made modifications to the AMP scripts, incorporating changes that Tony suggested.

Chris Gross worked on development of the Web log system.

Ongoing measurement and analysis, networked data, and infrastructure support

Tony McGregor did some tidying up on the system_manager as a result of some issues Matthew Luckie pointed out.

Jörg Micheel noticed safety net hits during data collection (an indication of data loss) while starting a long trace on SDA. It turns out that ipanon using gdbm is rather slow and won't deal with 2+ megabytes per second of incoming data. Jörg used the alternative IP hashing function to get around this problem; hashing is also used for the daily collections. This is the second major problem with the gdbm library in two weeks; Klaus Mochalski also reported that ipanon (the tool we use for trace anonymization) slows down dramatically once the trace database reaches about 100 MBytes. Jörg thinks we should consider using alternative methods for storing IP mappings temporarily. SDA produces some 5 gigabytes of data per hour, a very rich source of data, and with the 3x80GB RAID0 array we should be in a position to collect two days in a row with no problem. Jörg intends to start such a trace beginning of next week, then use the monitor for further real-time works.

At Endace, Jörg has been working with colleagues to build an SATA based 2U server with 8 disks in a RAID0 that can deliver a full 250 MBytes/sec sustained transfer rate to disk. The goal is to support full dual Gigabit recording without loss. (Last October they built a SCSI system with six drives, but it is too expensive and messy to assemble and maintain.) They are making good progress with SATA, but a single controller won't be adequate Jörg is considering a 2x4 solution. The 1x8 delivers very spiky throughput, oscillating between 400 and 150 MBytes/sec using Hans-Werner Braun's famous dskwtst application. The actual throughput from the capture card to the RAID0 will differ, but dskwtst provides for a very good first cornerstone measure. I'll share some graphs next week.

~ Servers, system disk, and upgrades - AMP

Bud Hale reported an anomaly on the AMP server. The script that creates the /etc/hosts file seems to encounter an occasional error on AMP. Several times Bud noticed that the file had not been correctly created the night before. Bud is continuing to look for the cause of the intermittent error. It does not cause problems with data collection but it does cause a problem using host names directly to the remote sites.

Matthew Luckie helped Bud with setting up a serial console on FreeBSD.

As reported last month, the data disk fill on AMP was approaching the upper limit. Bud had intended to start archiving at the beginning of the New Year holiday weekend, but a problem (apparently caused by a name server for HPSS) that prevented connection to the HPSS account persisted most of the weekend and prevented my archiving until the am_slave process on AMP stopped. On Sunday Bud came into SDSC and worked with machine room operator Tim McNew to cause the name server to enable the connection to the HPSS and begin archiving. This brought the data disk fill down to roughly 75 percent. After a week this increased to the low eighty percent level; on VOLT the fill level has reached the ninety percent level. Bud ran the archive script on the VOLT server over the Martin Luther King holiday weekend, and the VOLT data disk fill was brought down to the low- to mid-seventy percent level. The AMP data disk fill has been increasing; the archive on that server should be needed again in early February.

As the NLANR infrastructure grows, a need for better access to remote monitors has been identified. Bud Hale continues to research options. Bud and Jim have looked at simple remote rebooting methods such as the DataProbe iBoot and the CPS Heart Beat Rebooter. Not long ago they implemented a remote serial port console access to the amp-surf (SURFnet in Amsterdam, Holland) monitor that is proving very valuable to diagnose anomalies; unfortunately, it requires access through another host on site. Other options are the Intel IPMI chipset and the Cyclades Corporation PCI-bus Remote Access AlterPath SM100 card solution. Jim Hale has both an IPMI and a Cyclades SM100 card and is testing them.

As reported last month, Bud and Jim were in the process of acquiring new machines for the AMP and VOLT data collectors/servers. Bud spent a great deal of effort correcting paperwork errors caused our salesperson at CPP submitting two different quotes for the same equipment -- the equipment could not be picked up until the purchase order was corrected. Bud and Jim were finally able to take delivery, and after some discussion CPP has assigned a new sales person to deal with us. Bud and Jim were busy for a couple of weeks loading and testing the new AMP and VOLT servers. That is a vital task now that the data collection has grown and the data disks fill so rapidly (see note above).

There was some setback on the acquisition of machines for amplets. Bud and Jim have been testing an Intel board from RackSaver, but they have learned that Intel has terminated the production of that board. RackSaver provided a substitute board for test. Bud started the test but quickly learned of a BIOS problem causing the onboard Intel Pro 10/100 network interface to fail. RackSaver reported that a BIOS upgrade flash will be required to enable it to operate.

~ Servers, system disk, and upgrades - PMA

Jim Hale added the additional disk chassis to the PMA server, giving it nearly a terabyte of storage. The additional storage was not in the original design of the machine, so the new gear is external. Since we've maxed out the controller, a new server design will soon be needed to fulfill the needs of the program.

Jörg Micheel and Jim completed the planned upgrade of pma.nlanr.net to double the available disk space for traces. Jim could not preserve the existing data on the RAID0, which has since kept Jörg busy with restoring the data content from the various sources, primarily from the HPSS. The data collectors were down for a day, but all is in good shape now. Jörg spent several days restoring all the online traces on the pma.nlanr.net server after the upgrade to 1 terabyte of storage space. All of the long trace files from the various sources are now online and represent roughly 60% of all the data accessible. Jörg also restored most of the metadata and the collection of old trace files (samples of every 15th of each month since the beginning of collection). The daily traces are still coming in, and once the next six to eight weeks are online the 1TB will be pretty much all used up; this forces us to plan well ahead for the next upgrade within two or three months, since we intend to bring more long trace files on line. We should try and pick a new 4U chassis with plenty of room for more cheap ATA disk drives and capacity to go up to 16 drives, and continue working with 250 GB units, rather than the current 120 GB units.

Jim and Bud Hale are in the process of developing a DAG card testing capability here at SDSC. This is sorely needed to provide more assurance that cards in monitors being shipped are good. They continue to work on shipping methods to reduce the monitor damage in shipment, and are working with a professional shipper to ship PMA machines now. Jim had them construct a special shipping container and packing; the expense was quite reasonable, and they believe it will prove quite valuable.

Jim spoke with Grant Duvall of CAIDA, who has offered to work with him on developing a test bench using the Smart Bits equipment for testing PMA monitors before they leave to remote locations.

Existing measurement sites maintenance and troubleshooting:

A total of 19 remote sites in the NAI infrastructure received attention during this period. 11 have been resolved and the monitors are again collecting data. 8 were still being investigated, or pending site action, at the end of the period. (Outages are considered "open" until the monitor is again collecting data.)

  AMP - 11 problem sites: 7 resolved, 4 open
  PMA -   8 problem sites: 4 resolved, 4 open

~ AMP machines

Outages remain very low, the lowest in quite a while. A nagging problem keeps arising -- sites continually seem to be applying port blocking. Bud Hale constantly needs to check the collected AMP site data, because in the recent past up to ten sites per week have applied ICMP echo request blocking. When this happens, we need to confer with on-site people to request them to remove the block. In the December 2003 report we mentioned that we had to work with many sites to get them to remove ICMP echo request blocks. In January only amp-uiowa (University of Iowa) and amp-aarn (Australia Net) were blocking all ICMP echo requests. Bud communicated with site administrators and succeeded in having these blocks removed. Some sites still block ICMP echo requests from selected networks, and Bud is working to overcome this problem.

The amp-ucf monitor (University of Central Florida, Orlando) failed. A replacement machine was prepared and shipped. That particular monitor is installed behind a firewall using a translated internal address. In the past this caused no problem, but now each time the system manager is used to initiate the new monitor to take data the connection is lost. Bud is continuing to study this and hopes to have it working soon. Site amp-uc (University of Cincinnati) had a power supply failure in the monitor. Bud shipped a replacement and it was installed only three days after the problem arose.

Sites with brief outages this month included amp-msu (Michigan State University, Lansing), amp-jpl (Jet Propulsion Laboratory), and amp-utk (University of Tennessee, Knoxville). Site amp-msu experienced a failed Ethernet switch port, and amp-jpl had a brief outage caused by a faulty Ethernet cable. Site amp-utk was a situation in which new and uninformed people have come into the network operations; they were unaware of the purpose of the AMP monitor, so they powered it down to see if anyone complained! Bud was able to get the machine back on line in a short time ... and the site people are better informed now.

Other sites with outages this month were amp-uci (University of California, Irvine), amp-uwm (University of Wisconsin, Milwaukee), and amp-uvm (University of Vermont). Site amp-uci apparently was an operating system hangup since it was corrected with a reboot by the technician at the site. Site amp-uwm was a condition in which ssh connections were rejected. A power cycle corrected that problem. The outage was only been partial and appears to have been caused by failure to receive some updates by the system manager. Site amp-uvm appeared to have had a similar problem.

As noted in previous reports, Bud has been monitoring the amp-surf device (SURFnet in Amsterdam, Holland) for the anomaly of connection loss due to loss of path to default router. That has been a persistent problem since the installation of the machine. Bud has been using an "out-of- band" connection, an ifconfig down/up through the remote serial port access, to correct this. This month the problem suddenly disappeared, for no identifiable reason, for a period of two weeks. But the problem returned, and Bud again found the machine operating fine but with the routing table indicating that it could not connect to the gateway. Since that incident it has kept the connection for the rest of the month.

~ PMA machines

Bud is working on getting the machines at Tel Aviv University, Israel and the machine at University of Buffalo returned to San Diego.

As with the AMP address space mentioned above, Bud has been pinging Matt Zekauskas to finish connecting the OC12 POS monitor to the Internet2 Ann Arbor site. The monitor is connected to the Ethernet but the DAG interfaces are not connected. Matt has had the optical signal splitter package since December but he has not yet installed it. Ronn Ritke will see Matt at the Joint Techs meeting and will talk to him about this if still necessary, but there is still no word from either him or Dan Pritt as to when that site will get the fiber connections.

Bud also continued to contact Matt Grover at the nai-p-fla (University of Florida, Gainesville) site to complete the installation of that monitor. Again. progress had been slow. Dan Miller, the IT department director there, took a role in getting our monitor reconnected -- they were still in the throes of rearranging their equipment, but he expedited this and we finally have the site connected and monitoring traffic. However, at the end of the month only one of the cards was detecting. Bud has a new contact at that site who will try to get it fully operational.

We had been experiencing a strange anomaly at the Texas GigaPOP (nai-p-txg) -- the site should have given us lots of data, but the monitor was collecting nothing at all. Jörg had Jim Hale contact the site to determine what was going on. Site administrator Jason Tasker told him that a few months ago some re-cabling had been done and something may have been disconnected. Jason had corrected the cabling issue, and by the next morning the trace summary indicated that 98 MB had been collected.

We thought we'd gotten past the problems with the monitor at AMPATH (nai-p-amp). Working with Eric Johnson, it was determined that one of the DAG cards was not functioning. We promptly sent two replacement cards and worked with Eric to get them installed. After connecting into the machine and running some applications we were satisfied we'd repaired the monitor. Since then something took down one of the cards and Jörg decided it probably was a problem with the machine. Jim prepared a replacement monitor to be shipped to AMPATH.

Jim had been trying to get the Mid-Atlantic Crossroads GigaPOP (nai-p-max) in Washington D.C. rebooted for some time. This machine would be a perfect candidate for a proposed remote system management solution, such as the Intelligent Platform Management Interface (IPMI) or the Cyclades SM100 card. We finally decided to replace it, and Jim was preparing a replacement monitor when a SCSI drive failed, bringing down the nai-p-amp machine. Jim expedited his efforts to prepare and ship a replacement machine. He also supported the OC192 monitors to enable Steven Donnelly to test the OC192 cards while the nai-p-amp and nai-p-max machines were still here at SDSC.

The nai-p-max replacement has shipped. Jim spent some time with Grant Duvall from CAIDA on the Spirent test equipment, and used the equipment to test the monitor before shipping it out. After a few adjustments the monitor tested just fine. With a familiarity with the equipment, it will be possible to test all the monitors before they leave San Diego as well as determine the functionality of all the older cards we have.

Site nai-p-mra (Merit Communications at University of Michigan, Ann Arbor) had a brief outage. Jörg was able to start it collecting traces again. Jim and Bud have been working with Salvadore Hernandez and Bert Rossi at Merit to resolve the problem. It appears that our monitor had, for some unknown reason, become configured to OC3-ATM when it should be OC12-POS.

Jörg Micheel investigated the cause of problems with NCG (NCAR GigE), which captures for several days in a row, then starts producing silly sized trace files until being rebooted. It appears that the card is reaching a state where it won't react on resets or any other commands given via the control interface. Since the daily collection goes through a complete cycle every time, what happens is that the collection will ignore the reset and the fixed-length record commands and start collecting data in the wrong format. The ipanon process, supporting fixed length records only, subsequently discards all the good data, which is just in variable length records and produces empty trace files. Once in this state the monitor continues to behave in this way until being reset by a reboot (the reboot issues a PCI bus reset, to which the card responds). Jörg upgraded the firmware to version 2.4.12, to see whether this will make a difference.

~ Management And Administrative

Jörg Micheel has been in contact with David Huang, who is a very sharp young UCSD student who wants to work with us on trace analysis, and has asked him to work with us on postprocessing the long trace files we have recently generated.

Maureen and Tony McGregor discussed the resumes of applicants for the AMP programmer position and I've been looking over applications as they arrive. We seem to be developing quite a good pool of fresh graduates to choose from. Maureen and Tony are developing a new, more difficult skills test.

Our processor hardware supplier, RackSaver has assigned a new Account Representative to us. We're hoping the company took note of the problems we were having with them and is not just trying to patch the problem with a new face.

Hans-Werner Braun, Ronn Ritke, Tony McGregor, and Jörg Micheel conducted the weekly NLANR/MNA managers conference calls. Maureen Curran and Lana Kennedy prepared reports.

- 30 -

Thanks to Mike Gannis for his help with completing this report.

see link to more info...

more info...

 
Home

AMP:  Active
Measurements


PMA:  Passive
Measurements


Citings: Data Users

Publications & Resources

Meet the Team

Feedback

 
see link to more info...

more info...

divider line
Back to the Top       last modified: 5/11/04      Comments and questions are welcome:   Feedback .
acknowledgment