Summary of Research Activities - June 2004
Development and distribution of measurement and analysis tools~ Progress on the reimplementation of AMP and the development of a new testing architecture I made a bunch of changes to the amplet code so that it can more easily be packaged as a Debian package. Perry Lorier is going to do the actual packaging and then the CRCnet guys plan to deploy it on CRCnet (which is also running the latest IPMP kernel on all their routers). I did some work to make the package compile on Debian Sarge and FreeBSD 5.2.1. Ivan Koga, from Brazil gave me access to his machines to do the debugging, and he will be using a pre-release of the new version there. There were just a few small issues (e.g. the prototype for exit has moved). I have put together a new release with these fixes, but have not put it on the Web page yet because I want to fix a few other things. I have modularized most of the code and added separate jitter and random packet size graphs, with the related extension of the AMP PHP extension. [Tony McGregor] An AMP machine (at the NZ National Library) went live this period, using the new amplet code. It is the first actual deployment of the new code and seems to be working well so far. I gave the site admin a login, so he can check the machine over, and added it into the international mesh. I spent most of a day hacking up the version of the central code that run on erg (the NZ amp data server) so that it will display the graphs from the new code. [Tony McGregor] I applied for and received an OptiPuter account. The plan is to deploy software only AMP code on several nodes and do application-to-application measurements. Using grid type machines is a whole new arena for me, so there will probably be a bit of a learning curve, if it can actually be done. I compiled the amplet code on one of the OptiPuter nodes. There were a couple of problems related to the termcap library and a change in the way error messages are reported which I will need to roll back into the main code distribution. The code will not compile on the other node; apparently the openssl install on that node is broken. [Tony McGregor] Continuing development of new metrics and real-time analysis for PMA ~ A longish thread of communication with Klaus Mochalski, Klaus Degner in Leipzig regarding identifying application traffic (partly as a follow up on some dialogs that I have had in the aftermath of publishing early pictures on the OC192MON data at Indy). We are on the same page that we want to get running code implemented very quickly, and there is a chance we can deploy new code on relatively short notice. Also a longish thread on the same subject with some research friends on the East Coast. I am intending to keep the dialog going. [Jörg Micheel] ~ IPMP and IPMP cross-traffic-from-trace (ctft) generator I spent most of one week implementing and debugging IPMP flow counters for FreeBSD and Linux kernels. Learning two different ways of locking data structures, setting timers, and allocating VM pages for allocating flow structures was worthwhile but frustrating. I finished the new code by the deadline to get it distributed on crc.net. I have been playing with the implementation since. [Matthew Luckie] The idea of the flow counters is to pin-point loss events after they occur. For example: mjl12@ttk:~$ sudo ./ipmp_ping-flow/ipmp_ping -4s 1400 -c 3 -w 0 pir We have sent 3 packets here, but only two came back. We know that the middle packet made it to 10.1.240.1, but not to 10.1.255.253 due to the flow counter indicating that 3 packets from the same flow were seen at one point but only 2 packets at the next. The flow counter is the last column in the above printout, and we count from 0. [Matthew Luckie] I have since discovered a bug in my Linux implementation that will cause the sender to panic when attempting to send packets larger than the interface can send without fragmentation. I have a fix for this but it is fairly non-critical. I did not notice it in my testing as the kernel behavior changed between 2.4.20 and 2.4.26. In 2.4.20, the kernel would silently fragment a packet (even if the DF bit was set). In 2.4.26, it just panics with some comment about the stack. [Matthew Luckie] For those interested in the implementation: I allocate a hash table sized as a power of two (in this case 512 entries). Each entry has a head structure that has a timer in it, and then a pointer to a linked list with flows in order of when they expire, with the next node to expire at the head of the list. When we get a flow cache hit, the flow moves to the tail of the list. We search the list from the tail back to the head when we see an IPMP echo packet go past, which means we should cover the active entries first. If we run out of space in the table, the head node is dropped to make way for a new flow counter. The flow counter is allocated 4 bits, which means we wrap the counter every 16 packets. We allocate a single timer for each hash table entry as the first entry to expire in each list is at the head. When the timer goes off, the list is culled for expired entries. [Matthew Luckie] I will be sending the implantation and a revised draft to Endace in the coming weeks to push along the IPMP transmit function they are implementing in some cards. [Matthew Luckie] If I get a chance, I will include mrtg-like graphs of using IPMP to profile and debug a path on CRCnet known to have substantial loss and RTT spikes. The paper is targeted at network operators so the idea is to make a clear case that IPMP would be of benefit to them. [Matthew Luckie]
Extending the Network Analysis Infrastructure (NAI) in support of new and developing HPC needs~ Special Traces I have collected a first 4 hour data set at the OC192MON sitting on the Indianapolis - Chicago Abilene link, filling up all of the 2x141GB data array. Anonymization is showing the strange pattern again, it took 40 minutes to compile the first 10 minutes of trace file, the later ones took nearly 2 hours, and 90 minutes into the trace I am now faced with a memory allocation crash coming from within the gdbm library. I have started tackling the issue, but it is hard, and I am oscillating between debugging and finding a workaround, and going for a complete reimplementation of this part of the anonymization code, for speedup and reliability. I would like to publish some backbone data first though. [Jörg Micheel] After researching alternative database systems to replace gdbm, and developing a new DBM-based version, the IPLS-CHIN system is presently busy finishing the anonymization and copy of the trace run, with the database currently at 2GB size, and growing, and I am confident we will have this trace published early next week. To make it accessible to the public I now have to really switch to the new pma.nlanr.net server, it would not fit the old storage array. [Jörg Micheel] I have kept the OC192MON at Indy busy finishing the anonymization of the four hour trace from June 1st and I am finally done. Data is accessible on the new PMA system at SDSC for FTP/HTTP retrieval. Further AMD64 surprises included trouble with gzip/gunzip, which would start reporting errors on decompressing files bigger than 4GB because of a typecast error. I have written a little program to understand the issue and it turns out that gcc actually has short (16) and int (32) the same size as on IA32 platforms (i386 as we know them), but long are 64bit on the AMD64 and trigger surprises for some programs. Well, the speed improvement is still worthwhile the trouble and I have managed to let the machine skim through the 66 Gigabytes of compressed trace data merely in one evening, something which for the Auckland-8 data set took close to a week. Preliminary pictures at: http://198.202.74.122/Special/ipls3/ (new PMA server). [Jörg Micheel] The analysis of the OC192c Abilene backbone data is rich with discoveries. A few samples (and observerations): [Jörg Micheel] http://198.202.74.122/Special/ipls3/20040601-214000-0.html
While I am busy working away on the data I have turned all the three monitors at IPLS into 8x90 seconds data collectors like the rest of the PMA infrastructure (code I2A, I2C and I2K for the three links to Atlanta (OC48c), Chicago and Kansas City). The resulting files have exploded the daily trace collection, it is going to double or triple. As a result, with the shortage of disk space on the old PMA system, the automatic scripts have reduced the number of days kept online to just two (yesterdays and todays data). We need to move to the new machine urgently. [Jörg Micheel] I have continued post-processing the Abilene-III trace data set to also have the one hour view. I have worked on the matrix of HTML file to make browsing easier for the end user. I am pretty close, with just a few HTML files missing. Progress visible here: http://198.202.74.122/Special/ipls3/. [Jörg Micheel] ~ New (and developing) strategically important measurements and deployments At the end of last period, we successfully installed, or nearly installed, six PMA monitors, all of them with CDMA time support, plus two new AMP monitors. Two more (NCAR and AMPATH) are in progress. This appears to be a new record for the team. Quite a lot of follow-up work took place in the aftermath. [Jörg Micheel] We have been working with the I2 folks in Indy to connect the second (remaining) OC192MON to the legacy network. As this happened, we learned that one of the fiber tap ports was not delivering signal to the card. Caroline Carver unfortunately had to get back into the Qwest POP the next day to address this. Jim and I are excited about the excellent support we have received from the team at Indy. As of end of this week all three systems are spinning comfortably. In a sidetrack, the AMP system at the node is now also working after Caroline had looked into the wiring of the -48VDC power supply. [Jörg Micheel] Due to some minor connectivity problems, the OC192 in Indianapolis named ipls-kscy (Kansas City) had not yet come on-line. I had the great pleasure to work with Caroline Carver of Internet2 correcting the problem. Jörg and I asked her to look into correcting the connectivity issues that were preventing this monitor from producing. Caroline was able to re-terminate the network cable and get the monitor on the network. At which point Jörg noticed the monitor was seeing no signal on one of the cards. Caroline went back to the P.O.P. and while I monitored test applications at my end, she was able to track down where the connection was that had failed. She very easily followed my instructions and made the event a very simple and logical procedure. [Jim Hale] I worked with Caroline Carver and John Hicks on the IPLS-ATLA OC48 monitor in Indianapolis. It suddenly stopped seeing data though the monitor was functioning just fine. The problem was the fiber having been moved to the wrong ports on the DAG cards. It was the first place we looked. When we were unable to improve things, I brought Jörg in and I guess Jörg just explained the port situation better. He immediately got it functioning. IPLS-ATLA is collecting just fine. [Jim Hale] Good news with respect to a potential router clamp instrumentation as an upgrade to the present IPLS backbone taps. It appears that we will have the necessary moral, technical, and possibly also financial, support to move closer to a full instrumentation. More work needs to be done, I have done first inquiries with Endace to obtain a price proposal for some additional equipment required. [Jörg Micheel] Purdue. The GIGEMON at Purdue is moving forward, but not fixed. Turns out that Scott Ballew managed to issue the wrong IP addresses to us, he is going to fix this in the next few days. The splitter is installed, he is intending to connect the system soon. [Jörg Micheel] Purdue. We have worked with Scott Ballew to get this GIGEMON live. The system went onto the Internet on Tuesday and we were jointly trouble- shooting one of the fiber ports. We get a good signal on outgoing traffic, but not on inbound. With some emails back-and-forth we still have not drilled to the cause, Scott is committing to get a signal reading done. He would also like to see tty-style access and Linux getty is playing up on the SuperMicro SATA machine, we have not figured out why. Also, Linux does not like our dual-IP on a single IP subnet and only one IP address would respond to pings and ssh connect attempts. I am considering to abandon the second IP and simply use eth1 on all SuperMicro machines. More in the PMA trouble-shooting and upgrades section below. [Jörg Micheel] Purdue. Scott Ballew has reenabled access to the GIGEMON at Purdue and I have carried out some further testing. The monitor is still suffering from loss of sync at one of the two ports, we have to be patient for Scott to sort this out via measurements. I have continued looking at the remote management (IPMI) option and, while experimenting, have shot the system one more time. I now know what the issue is, but I have since moved to playing with the identical SDA monitor, to keep the burden of having to reboot the system local (with Jim). [Jörg Micheel] ODU. Bud got the system going in a concerted effort with Sheila Beilsmith at Old Dominion last week. This is now an upgrade from a previous FORE-based OC3MON to a DAG3.2 system (still at OC3c ATM, but with precision time stamping). I got the machine spinning, with a few hiccups that Jim managed to resolve for me. Great for the system count! [Jörg Micheel] Jim has settled the netmask issue with the system at ODU and also finished the configuration for the new AMPATH system, including CDMA support. He is presently approaching Ernesto at FIU in Miami to get the box out into the field. [Jörg Micheel] SDA. We had the previous SDA monitor shipped to Purdue and got a new box in at SDSC last Friday with Jim. Appears as if an old bug (dagrom utility spinning in an endless loop) still has not been fixed with the new 2.4.13 software release from Endace, I had to put the same patch back in that I applied previously. Looks like a bug report is due to Endace. SDA is spinning happily as of today, we will need to keep watching it. [Jörg Micheel] PSC. The Pittsburgh GigaPOP OC48MON is spinning happily. We are enormously thankful to Kathy Benninger at PSC for her tireless support over the last few months! An effort that truly has payed off, this system is collecting interesting data. [Jörg Micheel] APN. After an initial email from Jim we have been following up with Linda Winkler to understand the destiny of our OC3MON at StarTAP (which by now is almost a legend as one of the longest running in the infrastructure) and we have been told that StarTAP was finally shut down last week (end of May 2004), with it our monitor there. Jim is checking with UCSD property rules on how we can put the box to rest without necessarily having to ship it back to San Diego. [Jörg Micheel] ANL. Linda Winkler came back to us with a (justified) complaint by ANL netops about potential security hazards with this system and I have done a scan and rework of all the services on the box and also configured /etc/hosts.allow appropriately. I am hopeful this has cleared the issue for her. [Jörg Micheel] AMPATH. We have spent some effort into finding the issue with legacy Ethernet support on the Dell 750 and have worked out that the new chipset is an Intel e1000. Locating the drivers and loading them seems to have fixed the problem for Jim. We are waiting for an additional CDMA time receiver to arrive, I have given Jim the recipe to solder the extra 1PPS leads for the DAG cards, and with those the box should be ready for shipment to Miami some time next week. [Jörg Micheel] The new PMA monitor for AMPATH is now ready to be shipped. I had been waiting for the arrival of a new CDMA timer (which arrived Friday) to ship with the unit. Special cabling was required to connect the timer to the DAG cards. This new unit was completed and ready to go to Florida; I got the approval of the AMPATH crew. I have had a chance to really work with this unit and tried to learn more about how these systems work. I closed up some security issues and the monitor is ready to ship and install. [Jim Hale] The new Dell Poweredge 750 AMPATH passive monitor did finally go out to Florida this week. The shipment began shipping at the beginning of the week. For some reason it took about four days to finally get shipped. This new shipping method is going to need close attention. It took longer to get it shipped then it will take to ship it. [Jim Hale] I have been looking for ways to place some further PMAMONs using existing equipment, primarily OC48MONs, which we have spare. Research has shown that OC48c links are quickly fading from the network map. Instead, there are more and more 10 Gigabit links, in particular many OC192c PoS. The links that I have found are GigaPOP access links, such as MAX, NOX, SOX, and APAN/TransPAC. I have approached Linda Winkler, John Hicks, Matt Zekauskas and some (so far anonymous) contacts at NOX and SOX to see if they might be happy hosting a system. From Chris Robb and John Hicks I have learned that the OC48c from LA is going to disappear, while the Chicago OC48c will be replaced by OC192c (if TransPACs application under INRC succeeds) or will cease to exist in about 2-3 month. In the current situation it is very difficult to plan any measurement activity, all we have is a commitment from John Hicks to get us involved once the future is clearer and predictable. [Jörg Micheel] Continuing my search for further partners to place existing PMA gear among the HPC networks revealed the following. [Jörg Micheel] GLORIAD. We have received an enthusiastic response from Greg Cole (NCSA) to our proposal for installation of a pair of OC3MONs at Northwestern to monitor the links towards Russia and China. Jim and I are checking some technical details and we are awaiting for Greg to confirm the installation process at NU. [Jörg Micheel] NYSERnet. I tried a blind Web request with NYSERnet and received a long reply the next morning from Bill Owens detailing possible instrumentation ideas. At present, the most likely opportunity is to monitor the link from NYSERnet to Abilene at NYC on GigE-LX/LH. We have also discussed moving the dormant BUF monitor at the University of Buffalo towards their OC3c link at the same location. Bill also offered some opportunity for 10GigE monitoring, which we should consider later down the track. He also shared details about the MAN LAN (Manhattan Landing) which is a Layer 2 Exchange for NY and overseas partners, which we could monitor at the VLAN level at a SPAN port. There is also a new HOPI activity that we should learn more about. It appears as if a three-way communication between Bill, Matt Z and us could reveal some further ideas for future monitoring. [Jörg Micheel] I have prepared an OC12 Passive monitor for Bill Owens at NYSERNet. I have also got a Gigabit monitor in the works to add to the NYSERNet monitoring collection. [Jim Hale] We have come very close in the dialog with Bill Owens at NYSERnet on getting an OC12MON (back) into Buffalo as well as a GIGMON monitoring the NYSERnet link to Abilene. Jim is working through technical details with Bill. We are also waiting for confirmation from Greg Cole at NCSA for the go-ahead on the two systems at NU for GLORIAD. Very little work on the extra gear from Endace for the Indy instrumentation, but I have some communications from Jason Hurd that I need to follow up on. [Jörg Micheel] Working towards the NLR, I am learning more about DWDM (and how ~~ Chris Small of Internet2 informed me that the Internet2 AMP monitor at Sunnyvale, CA., was installed and connected to the network. I ran the system manager to start it Thursday night. And it came online on Friday morning. A system manager run was started on Friday to distribute the new HPC list with that new site included. [Bud Hale]As of this period, four of the eleven Internet2 AMP sites are up and collecting data. Chris Small installed the monitor at Internet2 at Chicago this week. I initialized the machine on Thursday. And I followed that with a system manager run to upload the HPC.list file to all sites with the amp-i2ch (Internet2, Chicago) IP address. Also this week I prepared and shipped three additional monitors for the Internet2 sites. I prepared these last machines for Internet2 Denver, DC and NYC. [Bud Hale] The third NZ AMP machine (at the national library) went live this period, using the new amplet code. [Tony McGregor] ~ IPv6 and IPv6 Scamper I have been working with Bill Owens to find MTU bottleneck locations on I2, probing with Scamper from two hosts. The first host is capable of sending 4470 byte packets and the second is capable of sending 9000 byte packets. So far, the targets have been AMP monitors in the HPC mesh - hosts that we know exist. Bill has also probed PlanetLab hosts, but has not found a PMTU >1500 to those either. Bill wants to extend this to cover all of I2, and has supplied me with a BGP table to pull out ASes and address ranges to form an address list. The idea is to tag MTU bottleneck locations at core/regional/campus levels. He will do the tagging. He would like to see jumbo capable AMP monitors that could be used to highlight connectivity problems. For example, some routers are not sending ICMP fragmentation required messages, which slows down connection establishment markedly. Jörg has also put me in touch with Grover Browning to help form an address list to cover Abilene. Now, I have basically got a small script to extract AS#s from the output of 'show ip bgp' on a Cisco router and to produce an address list based on the advertised prefixes. [Matthew Luckie] Bill and Kenjiro Cho have supplied me with useful bug reports when Scamper does something bizarre, and also their experiences with the code. So far I am plugging the gaps promptly thanks to the information that Scamper outputs in debug mode. One is a fairly simple case of ignoring cloned routes in Linux that cause a machine to remember PMTUs for paths that we have already discovered, which cause Scamper to probe with the PMTU for the path, rather than the outgoing interface's MTU. The other is a bit more challenging to figure out. It involves ignoring responses that have an RTT that is not considered sane. I am triggering assertions in code, and need to debug the root cause. [Matthew Luckie] More work on the Scamper file format. I have specifications for the MTU data to be recorded, but they are not very detailed nor do they match Scamper's behaviour. I have been working on devising something a bit more generic and extensible. [Matthew Luckie] In the process of implementing flow counters, I found and reported a bug in FreeBSD http://docs.freebsd.org/cgi/mid.cgi?20040609115052.D24917 which has been confirmed. I guess I need to file a PR to get it fixed. [Matthew Luckie]
Outreach, application support, utilization improvement, and documentation activitiesOf 13 papers accepted by the SIGCOMM Network Troubleshooting Workshop (Portland, Oregon, August 30 - Sept 3), Matthew Luckie was the primary author and a coauthor on two of them. Worked on the paper which will be submitted to SIGCOMM NetTs. I am using their comments as motivation to get some things implemented and experiments done for my thesis. I had a useful meeting with Tony where he went over the paper with a fine-toothed comb, and I later addressed his comments. I also talked with Maureen about using her expertise to tidy it up further; her effort on editing the two papers is very much appreciated. She does an awesome job. I also had very useful feedback this week from Matt Brown in particular, as well as Tony, James, Jörg, and Perry. I am very happy with the IPMP paper, which is at: http://www.wand.net.nz/~mjl12/nts35-luckie.ps; the Scamper paper is available at http://www.wand.net.nz/~mjl12/nts40-cho.pdf. [Matthew Luckie] I edited Matthew's two papers that were accepted for the SIGCOMM Network Troubleshooting Workshop (NetTs): "Path Diagnosis with IPMP" (Matthew and Tony) and "Identifying IPv6 Network Problems in the DualStack World" (with Kenjiro Cho and Bran Huffaker). [Maureen C. Curran] Read through and commented on the final draft of Matthew's paper to SIGCOMM Network Troubleshooting workshop. [Tony McGregor] I met with Vijay Samalam to go over a set of the NLANR review handouts. [Ronn Ritke] Emails with Paul Love and Jörg Micheel, to plan for presentations on NLANR/MNA activities at the July Joint Techs meeting. The draft agenda has us listed on Monday at 3pm. [Ronn Ritke] Discussed a possible real-time software demonstration for SC2004. I spoke with Greg Lund and Mike Gannis about a demo time slot for Jörg. [Ronn Ritke] ~ Collaborations and activities supporting network research Matt Zekauskas came back to us with a list of proposed research as a follow up of our meeting in Indianapolis. I replied with a staged plan for collaboration with Stas Shalunov at I2, and provided a pointer for early access to the most recent OC192c backbone data captured the previous week on the link from Indianapolis to Chicago (IPLS-CHIN). [Jörg Micheel] I have been approached by a fellow by the name Yenyung from George Washington University to provide data from MRA between end of January and mid February and I have spent most of the week getting those traces off the HPSS. Am done today. His research is also on looking at the spread of a particular worm. [Jörg Micheel] I gave Paola Grosso from SLAC access to the throughput tests. He is using them the the web services query request interface that Warren has done. [Tony McGregor] I granted Chris Costa access to run AMP throughput tests (with the usual warnings). [Tony McGregor] Emails and phone call with Greg Cole concerning future GLORIAD measurements. Greg will work with groups in Russia and China to create a list of addresses for one-way AMP measurements. [Ronn Ritke] Applied for an OptiPuter account. The plan is to deploy software only AMP code on several nodes and do application-to-application measurements. [Tony McGregor] I had a query from Sevcan Bilir about my IP to AS translator (IPAS). She is a PhD student at the University of Texas Dallas. In answering her question, I found a small bug in the sample code (not related to her question) and corrected that. I also noticed the database is no longer being updated. I will need to work on that. [Tony McGregor] Chris Costa of CENIC wishes to use AMP to conduct ad hoc tests on CENIC backbone legs. Also, Chris indicated that CENIC is planning to request AMP monitors at the CENIC backbone nodes. Chris is working with Brian Court of CENIC to follow through with the AMP requests. Jim and I had the pleasure of working with Chris Costa while he was with the ACT department here and UCSD, before joining CENIC. [Bud Hale] A conversation with Kevin Walsh about his measurement activities and the July Joint Techs meeting. [Ronn Ritke] When traveling to/from the Joint Techs meetings, have firm plans to continue the face to face dialog with I2 and the GigaPOPs on future passive measurement works. [Jörg Micheel] Meeting with Matt Zekauskas about the IND router clamp and Fall I2 meeting. [Ronn Ritke] Conversations with Peter Arzberger and Teri Simas on Sunday. Some emails with Peter about the AMP with the CUDI group in Mexico City. Emails with Teri on software questions for Greg Cole, in preparation for the next PRAGMA meeting in San Diego. [Ronn Ritke] Phone call to Chris Bruja at CISCO about some possible future plans. [Ronn Ritke] Some talk with Glenn Larratt from Rice about duplex match problems with their machine. [Tony McGregor] We enjoyed a visit from Vinton Cerf. It was fascinating to hear of his experiences in earlier network development as well as about the work he is currently doing with MCI. [Bud Hale] Reviewed a paper for the British Computer Journal. [Tony McGregor] ~ Documentation, Web work, networked data, publications Discussions with Ronn Ritke and Greg Lund about publicizing NLANR successes (in particular, international cooperative efforts and GLORIAD participation) in media such as Federal Computer Week. We intend to develop a "story pitch" briefing sheet to be used by media relations people. [Mike Gannis] Conversations with Mike Gannis about possible PR ideas for Dave Hart. I mailed Dave Hart a copy of the NLANR/MNA review handouts. [Ronn Ritke] I continued to work on the PHP extension for AMP. I have mostly added the code to make it compatible with both the old and new versions of the files (from the old amp code base and the new amplet package). I just have to check and debug the code that determines when the next change in any of the relevant time zones is and readjusts the offset between the data and the display time zones the functionality of the extension should be complete. [Tony McGregor] Continued coding the first data page for the new central code. I am now able to display a RTT graph in any time zone from the old data. The time zone preference is remembered from session to session either by a cookie or through a user login, backed by postgresql database. I am pretty pleased with the progress. I am currently working on integrating the new and old data formats into the library. It is a bit tricky because the two are recoded in different time zones--the new data is recorded in UTC--so knowing the time zone is necessary to work out the file name. I think I am now over the main hump of starting up, except that I need to create the new package structure (configure, cvs archive, automake etc). That will take some work. I currently have a single daily graph which can display, mean, median, max, min stddev and jitter. It has user selectable bin sizes and (optional) ymax value. It is still a bit rough, though. [Tony McGregor] Updated published papers Web page is posted. http://moat.nlanr.net/Papers/ [Lana Kennedy, Maureen Curran] Met with Lana and we went over creating a single file "database" of the citings and the collaborations. Lana has already started learning about m4s so she can create the file and system for us. Earlier I had talked to Tony about m4 vs. PHP, vs. wiki, etc. m4 and a flat file is what we are thinking about. [Maureen C. Curran] Started learning about m4. I found a little tutorial and generated a test Web page with an m4 I created. It is really neat, and I am learning a lot about how m4s work. It also makes it much easier for me to understand how Maureen uses them for the Web. Maureen sent me the m4s she uses, and I started looking through them. [Lana Kennedy] Went to the Tech Seminar on Wikis. They are pretty interesting open-ended web pages. I plan to do some follow-up research on them for a student Web page. [Lana Kennedy] Worked on Citings; had some graphics created that are very similar to the ones on the current Citings page. Maureen gave them a thumbs-up, so I added text to them to reflect the percentages. I also looked into plotutils as a potential source for generating further pie charts for citings, but the charts are only 2-D, and we would prefer a 3-D chart. I think OpenGL might work. [Lana Kennedy] Added the AMP package to the tools index page. http://mna.nlanr.net/tools.html [Maureen C. Curran] I finished off the yearly graphs/pages for the FTP logs, I started parsing the http logs as well. It will be interesting which method of retrieval turns out to be the most used. I really am looking forward to seeing the comparison graphs. [Chris Gross]
Chris has made some progress with his annual graphs and we have discussed the issue of discussing a particular anomaly. He is looking into integrating HTTP access data next. Maureen continues to steer the process here. Chris is now in a study break and I am looking forward to seeing him making big leaps forward on the visualizations over the next couple of weeks. [Jörg Micheel] Jörg, Chris, and I went over some possible ways to handle an anomaly which occured May 2003, where someone at hp downloaded the same IPLS file hundreds+ of times, which skews the graph from 2-6 or 7 GB, to 1900 GB downloaded for a day, thereby making the rest of the month so small as to be indecipherable in the yearly views. Jörg wants to emphasize the normal, regular data. Therefore, Chris's graph leaves that data out, and I will be putting an annotation on the graph page. Added the yearly graphs to the index of the PMA Stats page; I also added the expression "by month" to the titles of the users, files, and traffic htmlm4 files. I tried to add the note re the outlier (May) being removed from the yearly graph, but had permission problems again. [Maureen C. Curran] Created a couple sample navbar images for the Florida-I dataset. [Lana Kennedy]
Ongoing measurement and analysis, networked data, and infrastructure supportWith Hans-Werner we have made some definitive steps towards getting HPWREN passive monitoring data published and analyzed. I spent a good day researching options on how we can transfer the data from HPWREN onto PMA without causing too much CPU and administrative overhead. We went in circles a few times, including myself rebuilding SSH and SSHD to include the cipher NONE, eventually we settled on putting in a private network only consisting of a twisted pair drop cable between the two machines as a private Gigabit pipe and RSH to automatically transfer the data on a daily basis every night. [Jörg Micheel] To get this sorted, we had to engage in a major troubleshooting party involving the three of us: Jim, Hans-Werner and myself. This incident has re-triggered my thought that there ought to be a new competition in the same spirit as the Internet Landspeed Record. This new category should be the Internet Troubleshooting Record, where you multiply the link data transfer rate (i.e., Gigabit Ethernet) by the number of miles that the various parties are apart (for instance 6500 miles between Hamilton and San Diego). You multiply the two for scoring. The reason I have not gone much further with this is because I am struggling to settle on whether you get double points for full duplex connectivity and whether there should be demerit points for using cell phones. The standard would just involve email as a means of communication. We surely scored high the other week when we fixed that OC48MON connectivity in Indy the other week. [Jörg Micheel] ~ Servers, system disk, and upgrades - AMP Archiving of AMP and VOLT finishes leaving the disk fill from 77 to 80 percent over the eight data disks. It would be possible to make more use of more of the 36 GigaByte disks from the de-commissioned PMA server. The plan would be to keep the /disk0 through /disk7 the same size on AMP and VOLT to keep the directory layout identical between the two. This was implemented on the AMP data collector. At the next archiving, the data disk fill proceeded as expected and was in the low eighty percent level. However, /disk0 and /disk1 were in the low fifty percent level since those were replaced with 36 GigaByte disks. With spare 36 Gig drives I plan to replace the /disk0 and /disk7 on VOLT to match AMP. [Bud Hale] Following Tony's evaluation of both FreeBSD and Debian Linux, the machine tested with FreeBSD has been reworked with Debian Linux. So both new servers have the Linux OS installed and are ready to be brought online as the new servers. Also, on that subject, this week SDSC NetOps issued two additional IP addresses for the new servers. The new servers are to be configured with two IP addresses each such that failure of either machine will cause the remaining machine to take over serving in a more seamless manner. [Bud Hale] Tony decided last week that both of the new AMP Servers should have the Debian Linux operating system installed. This week I began converting the Free BSD Server to match. After installing the Operating System, I transferred the relevant control and security files and upgraded the kernel to match the original system. Short of a little clean up work the system is ready for development into the AMP system. [Jim Hale] Upgrading the kernel to the 2.6.2 version, on the VOLTDB machine was not as simple as upgrading on the AMPDB server. I did not realize how much setup Tony had done before me. I ran into an odd problem that when the new kernel was installed, SSH began blocking connections. I would have liked to find out why, though it was faster to just flush the system and start again. I have had limited time to work on this. With the 2.2.20 kernel, the 1Gig interface was not recognized, and with the 2.6.2 kernel, the 100 Mb interface is not being recognized. [Jim Hale] Continuing to experience anomalies with the system manager software on photon, specifically processes update_systems, build_XXX.list.pl and build_hosts.pl. It appears the list and hosts building processes fail to recognize and incorporate entries. And this appears to be the root cause of missing sites and data in the AMP Web interfaces data pages. At least three relatively new AMP sites are failing to make it to the Web interfaces data page but those sites are collecting and transferring network data. Those sites are amp-purdue (Purdue U.), amp-i2su (I2 at Sunnyvale) and amp-ucam (Cambridge U. in England). [Bud Hale] I have the HPSS HSI interface working on AMP and VOLT. I plan to suggest, to Tony, some changes to the doarchive script on AMP and VOLT to move from the FTP interface. [Bud Hale] Discussed the HSI HPSS interface with Tony. Consequently I communicated with Mike Gleicher, the IBM contractor that created the HSI HPSS interface. In response to a question from Tony, Mike confirmed that no Perl HSI module exists similar to the Perl FTP module. Mike made a number of suggestions that I will be following up on. [Bud Hale] I have done some work with Larry Diegel of SDSC on the HPSS HSI interface. Larry is developing a recommendation on the best method to assure data archived by the HSI interface is secure before it is deleted from the AMP/VOLT server. [Bud Hale] We had a disk fail on amp which Jim replaced. I aliased amp's IP addresses onto volt, which kept the Web server up and working well while we fixed things. Jim and I copied the data across from volt (the time zone difference worked in out favour and I took over after Jim went to bed). As I write this I am just waiting for a final copy to finish and I hope to have am_slave restarted before I go to bed. [Tony McGregor] We suffered a disk failure in the AMP server this week. The failure occurred on the same disk as failed about two months ago. The worst part was, I replaced the disk three times before I found a working disk. The disk is now in. and the server is now up and running again. I have recently not paid much attention to the AMP project procedures. This was good to for me to practice the routines again. [Jim Hale] I have been researching the available 10 Gigabit Network Interface Cards available from various manufactures. Intel seems to believe their Intel PRO/10GBE LR (long range, single mode) Server Adapter are available by their distributors. So far I have not found a distributor that can get a hold of one or even a price on them. The distributors like ComputerOne do not expect to be able to sell them before the middle or the end of June. I have seen this in the past. Someone has to have them. So my search continues. Distributors do have the Intel PRO/10GBE SR (short range, multi mode) Server Adapter for about $4500.00. As far as the S2io XFRAME 10GB interface, S2io has not received them from their Manufacturer SGI. They too are not expected till the middle to the end of June and are expected to sell for $7000.00. [Jim Hale] ~ Servers, system disk, and upgrades - PMA I spent a good half day researching alternative database systems to replace gdbm, which was causing issues for the anonymization procedure with those huge trace files (and the number of IP addresses to be handled) at the Indianapolis router node. It is nearly impossible to find something simple and straightforward, like a disk-based B-tree implementation, which is robust and performant. Most freeware out there is SQL or even more complicated, way beyond the needs for this application, which just exchanges 32bit numbers for 32bit placebos. I finally convinced myself to give Berkeley DBM another try, which I had to abandon some 4+ years ago for being utterly broken and performing poorly, and once I had done all the works it turned out that to date, this is a much more mature, industrial strength, solution than a few years back. A test run against the previous gdbm revealed some inconsistencies, which I worked out as further bugs in the gdbm solution (assigning a fake IP address more than once) and I am finally happy with the new DBM-based version. [Jörg Micheel] We have had a crash of the pma.nlanr.net server, Jim came into SDSC late on Friday to fix it and it is working again. I have not had the time to switch the machine over to the new system. Jim has also told me that SDSC HPSS admins are waiting for us to discontinue ftp access and switch to HSI. I wonder how we will be able to do that with the new FreeBSD AMD64 system, possibly we will have to get the sources and compile our own binaries. I will look into this. [Jörg Micheel] I noticed that the PMA server had become unreachable--the server seemed to have become unresponsive at the console. After rebooting the server, I noticed that the interfaces in the server were functioning just fine, but I could not reach any other units. I traced it to one of the new Gigabit switches. After moving the connection to the other new switch, the server performed normally. I did not pursue the failed switch any further. I will look into it further at a later time. [Jim Hale] Because of a network issue with the unit at Purdue. I began researching serial over LAN options with the identical GigE monitor here at SDSC. This is very interesting research. I should have all 16 collecting monitors inspected in about a month, and when the process is complete, I will issue a report on the findings. [Jim Hale] Endace. I have filed a bug report on the dagrom spinlock issue with the 4.3GE cards and received a reply from Stephen Donnelly with a pointer to the latest 2.4.14 release, also a hint that the cause might be with the stable image on the cards. There is a chance the problem might go away with upgrading this one (which typically should not be done by a customer, but ... :-). They have also accepted my patch workaround to the dagrom tool to mitigate the problem. [Jörg Micheel] I have been working with Jason Hurd (Endace North America Sales Rep) to find a workable solution for some extra gear we need in Indianapolis and things are looking not too bad, more work to be done here. I have also filed a request for repair on one of the DAG4.2 OC48c cards we found had failed in Pittsburgh. Response was positive, we have an RMA, my belief is that Jim will be following up from San Diego. [Jörg Micheel] NetOptics. I have emailed this company to understand if they would be in a position to provide us with a DWDM filter for future NLR activities. I have also asked to clarify so called -G- type splitter solutions. Have not received a reply as yet. [Jörg Micheel] Existing measurement sites maintenance and troubleshooting: A total of 15 remote sites in the NAI infrastructure received attention during this period: nine have been resolved and the monitors are again collecting data. Five were still being investigated, or pending site action, at the end of the period. (Outages are considered "open" until the monitor is again collecting data.) And one site has been discontinued. AMP - 9 problem sites: 7 resolved, 2 open ~ AMP machines Site amp-ukans remains down after it was disconnected due to security concerns. I am continuing to communicate with Martin Huerter on reconnecting it so it can be examined to determine if there has been a compromise. As pointed out by Tony a short time back, we take all those suspicions very seriously. I worked with Martin to demonstrate that the machine had not been hacked. He was then able to get the site security people to get it back online. [Bud Hale] Site amp-vanderbilt (Vanderbilt U, Nashville, Tenn.) was unreachable by ssh. That meant it was not receiving updates. It was responding to ICMP echo request and collecting data. The site technician rebooted by reset switch to no avail. Curiously, ssh login was restored by power cycling the machine as a final effort. [Bud Hale] Site amp-utexas (U. of Texas, Austin) was moved to a new network. That was completed and the machine was collecting data. However, it recently went offline. Investigation revealed the new switch port it had been moved to had failed. That was corrected and the machine came back online. [Bud Hale] Site amp-yale (Yale U.) failed. Investigation indicated it was reachable and was collecting ICMP data. However, no traceroute data was being collected, though it was still reachable by ssh login, so we asked site tech Jeremy George to investigate for us. I restored the UDP trace route function by rebooting it remotely. The monitor now appears to be quite functional and collecting data but is failing to transfer that data to the AMP/VOLT servers. Tony and Jeremy resolved the problem. It turned out a port on their switch was behaving very oddly, letting through some traffic but not the other. [Bud Hale, Tony McGregor] Also down was amp-rnpb (RNPnet, Brasil). This outage was merely an OS hangup. It was put back online with a power cycle. The down time was due to a local holiday and site people were absent. In the recent past the site has been requiring frequent reboots to stay online. I am conferring with the site technician to coordinate a machine replacement. [Bud Hale] Site amp-clemson (Clemson U) had a short outage. That was due to an inadvertent ethernet disconnect by a site technician. It was quickly corrected when the site was notified. However, it went down again and was brought back online with a power cycle. A third outage occurred such that a power cycle will not put it back online. The site technician indicates that it may be a power supply. [Bud Hale] Chris Small of Internet2 reported the amp-i2in (IPLS POP site in Indianapolis) was installed but would not come up. It sounded as if the -48 VDC power source line were reversed. While Caroline Carver was at the POP, working with Jim, I asked her to reverse the power lines. The machine powered up and came online. I then completed the setup and started the system manager to start data collection and transfer. That was completed and the site is collecting and transferring data. [Bud Hale] Site amp-uah (U. of Alabama, Huntsville) went down late in the week. Investigation revealed that to be caused by an inadvertent disconnect by site people running network tests. The monitor was reconnected when site people were not notified of the disconnect. [Bud Hale] I got contacted by Bruce Morgan from AARNet, who was concerned that their amp machine may have been compromised because it was generating traffic to an unroutable address. This turned out to be the PNGS (Pacific Northwest Gigapop) monitor that was decommissioned because the rack space was too expensive. Even though the site was listed as having no machine in the database, it had also been left marked as active and other sites were still trying to test to it. I have changed the script that generates the config files so that if there is no machine at a site, it will not be tested to, even if it is marked active. I also did a system_manager run to send out the updated configuration files. [Tony McGregor] ~ PMA machines Jim, Hans-Werner and myself have worked on surveying, and, where necessary, tightening the sysadmin status of PMA machines in the field. Jim has been successful in getting the AMPATH folks to respond on swapping the monitor in Miami one more time (I am very relieved). Jim and I continue experimenting to enable the remote reboot (IPMI) capabilities with the SuperMicro server machines using the SDA system. [Jörg Micheel] Jörg got the Old Dominion University monitor spinning again, after the monitor had been shipped back here for repairs. Sheila Brink (formerly Beilsmith) was mentioning adjustments to the subnet mask had not automatically changed the broadcast address. As I was trying a command line adjustment to test the solution before adding the correction to the interfaces file, the unit went off-line. Sheila later rebooted the unit. For a time the monitor refused to function, then a short time later it began to operate on its own. Sheila managed to get the monitor going again. A hard correction was made and Sheila announced the problem is solved. [Jim Hale] I have been in contact with Linda Winkler pursuing getting the APN monitor back on-line; after a short time she informed me she had solved the connection problem. The unit is at STARTAP, and STARTAP has been shut down. [Jim Hale] Jim acquired a CDMA time receiver for the AMPATH site. It arrived on Friday; he will probably have that on site early next week. [Jim Hale, Bud Hale] It appeared sites nai-p-odu (Old Dominion U), nai-p-txg (Texas GigaPop), and nai-p-ncg (National Center for Atmospheric Research, GigE) were online but not collecting traces. Site nai-p-odu was restored, but the other two are still not collecting. Those are nai-p-txg (Texas GigaPop) and nai-p-ncg (Nat. Center for Atmospheric Res., GigE) sites. Jim is working with people at both sites to restore them. It seems the nai-p-txg site needs a reboot. There is a replacement monitor at the nai-p-ncg site waiting to be installed. Scot Colburn is the site technician at NCAR. Scot is very helpful and will attend to it as soon as possible. [Jim Hale, Bud Hale] NCG. Jim has sent a replacement machine to NCAR, one of the drives apparently had died. [Jörg Micheel] Site nai-p-apn (APAN OC3 monitor at Argonne Nat. Lab) is off line. [Bud Hale] Worked this week to resolve some issues at the IPLS site in Indianapolis. It appears that after the OC48 monitor there was operational and collecting traces, the fiber connectors got moved from the Dag card receive ports to the transmit ports. That condition persisted after a detailed conversation with site people to explain why only the receive ports are used. This points out the need to develop a method to prevent attempts to connect fiber connectors to the transmit ports such as a secure blocking plug for that port. [Jim Hale, Bud Hale] ~ Management And Administrative Conducted the references for the AMP FTE; incorporated Tony's concerns into the questions. Although the concerns for each of the candidates are actually covered pretty well in the general list of reference questions developed some time ago, so they needed just a bit of re-emphasis. We came to the conclusion that the various concerns we all had are turning out to be unfounded for both the top candidates. For clarity, I put together and polished PDF versions of the references that I conducted, and sent them to Tony. Ronn and I had a conference call with Tony to discuss the results of the references and Tony's ultimate decision. I called Tony's choice and using the vague, HR mandated protocol language, asked if he were to be offered the job would he be interested - yes he is and will sign and accept the position when UCSD's HR sends the formal offer. Ronn and I discussed salary and relocation reimbursement parameters. Ronn worked with Barbara Carstens, Amy, et al. who came up with a good way to handle the moving expenses reimbursement issue. I also wrote and sent the email for the folks who did not get interviewed, apologizing for the extreme delay since the time they took the programming skills test. [Maureen C. Curran] Worked with Ronn on the screening matrix for the applicants for the AMP FTE. I had previously developed a rough version of the five categories and knew who was in which category which made this a straightforward--though a bit memory-bending--exercise. Sent him the Word doc form and five selection criteria. Then we heard back from Amy and had to meet to do it all over again, including the original resume one for all applicants, not just the interviewees one. Apparently when HR worked with Ronn to post the position, they mixed things up regarding the criteria. Ronn exchanged emails with Amy Han about the wording for the FTE hiring paperwork, and then hand-carried the forms to Vijay and on to SDSC HR, where he turned them in to Amy Han. [Maureen C. Curran, Ronn Ritke] We enjoyed a visit (albeit short) by Kevin Thompson this period. Multiple strategic planning meetings and discussions. [Hans-Werner Braun, Ronn Ritke, Tony McGregor, Jörg Micheel] Quite a few interesting discussions regarding future possibilities. It occurred to me that, unlike other network research activities, passive measurements, being link technology dependent, are extremely vulnerable to the rapid changes occurring in the research networking landscape. Those risks come in two forms. One is that the performance envelope is being pushed forcefully (OC12c -> OC48c -> OC192c -> lambdas), in turn forcing us to acquire expensive matching equipment to enable _any_ data collection (even if the actual link loads are comparably small and we could "get away" with much less). The second problem is the lifespan and administrative side of the PMA network of monitors. If you try to "get in early" in the lifespan of a circuit, to have a 2-3 year return from the effort, you are paying a premium on the gear needed. Often, that option is not even available (as in the case with OC48c year about 3 years ago, or OC192c/10GigE equipment until recently). If you engage when it becomes either feasible and/or affordable, the circuit is being retired shortly (examples: Abilene backbone two years ago, University of Florida Gainesville and the APAN/TransPAC situation as above, to date) and you face frequent transactions on the installation side to get the gear into and out of the field, placing high burden on the human side of things, both for PMA and at the sites hosting the systems. In practice, we pretty much face both problems nearly all of the time. [Jörg Micheel] I emailed some additional text to Lee Dolan to complete the visiting scholar paperwork for Klaus Mochalski. [Ronn Ritke] Weekly NLANR/MNA managers conference calls. [Hans-Werner Braun, Ronn Ritke, Tony McGregor, Jörg Micheel] Reports. [Maureen Curran, Lana Kennedy] - 30 -
|