Summary of Research Activities - April 2004
Development and distribution of measurement and analysis tools~ Continuing development of new metrics and real-time analysis for PMA I have been approached by Pere Barlet from UPC Barcelona to continue the collaboration on real-time work which we started during his visit to Hamilton in January and February. He has sent an email detailing his plans. [Jörg Micheel] ~ Progress on the reimplementation of AMP and the development of a new testing architecture The beta release for the new AMPlet package is ready to be released, pending resolution of copyright/license issues, on which I am continuing to work. (I will be meeting with the UCSD legal analyst re technology copyright when I travel to SDSC late next month.) I did some work on the AMPlet code, tidying things up after the integration of the throughput test code, doing a little more code documentation and extending the error messages when a self test fails. [Tony McGregor] I extended the amplet remote command processing facility so that it now returns the command status (command exists and is executable, command does not exist, too few resources) and the standard output of the command to the caller. That required creating a new thread to read from the SSL connection and write the status and stout to a pipe with the caller can read from. The iperf and bandwidth estimations tools really needed that to run reliably. [Tony McGregor] ~ IPMP and IPMP cross-traffic-from-trace (ctft) generator At Mark Allman's suggestion, Tony and I wrote a paper on how IPMP would be used to debug path faults (reordering, loss, jitter, capacity) and how the techniques differ from those that are currently used. The idea is to build support from operators for the protocol. The paper is titled "User Level Path Diagnosis with IPMP" and discusses how using IPMP would allow operators to understand where performance faults like reordering, loss, and jitter occur. It compares the IPMP methods to tulip methods that use existing protocol features. [Matthew Luckie, Tony McGregor] I had some discussions with Endace that ended with them offering to fund a student to create IPMP transmit code for the GigE DAG card. [Tony McGregor] Tony has been talking to a student to get an IPMP transmit firmware for the DAG. I am going to help supervise him. [Matthew Luckie]
Extending the Network Analysis Infrastructure (NAI) in support of new and developing HPC needs~ New (and developing) strategically important measurements and deployments I prepared the AMP monitor for Beijing, China. I took care of the documentation for shipment, and the monitor was sent. [Bud Hale] We had a second query for an AMP machine from China. We are going to suggest that the two Chinese sites coordinate amongst themselves, with our support. [Tony McGregor] Also continued to prepare the Internet2 machines. The first machines for the Internet2 Chicago location have been prepared. I am now relocating the mounting brackets to a center mount position and testing with the -48 VDC sourced power supply. Two Internet2 monitors were shipped: the first of the two will go into the Indianapolis I2 POP, and the second will go into the Chicago I2 POP. [Bud Hale] Received a request for an AMP monitor at Cambridge University in England. [Bud Hale] Meeting with Bruce Morgan at PAM. AARnet is upgrading the Australian backbone to 10GE. The two 10GE links from Sydney to the US are configured the same way as the previous two links to the US. The one NLANR AMP in Sydney may be able to measure both new high-speed links to the US. Also went over future plans for links between Australia and Asia. [Ronn Ritke] While at PAM2004 talked with Paul Schopis (OARnet) about hosting an OC48 monitor. Jörg followed up on this and it looks like it will be put on their link to the IND GigaPOP. [Ronn Ritke] Am taking several measures to increase site participation in the PMA project. Have emailed Kostas Pentikousis at SUNY Stony Brook, who is enthusiastic, but needs more information before proceeding. Similarly, approached Marwan Sleiman from UCONN, but have not had a response as yet. [Jörg Micheel] Bill Cleveland (ex Bell Labs) at Purdue is looking into an opportunity at his university for another monitor. I still need to get in touch with Greg Cole (NCSA) to see if we can arrange to put a pair of OC3MONs at Northwestern, for GLORIAD. [Jörg Micheel] Am discussing via emails with Matt Z, Rick Summerhill the IPLS instrumentation. [Jörg Micheel] ~ IPv6 and IPv6 Scamper Spent some time doing some work on the 'ring' tool getting it to work under Linux, and thinking about how I might do reverse path tracerouting with Scamper for IPv6 networks. The basic idea is to source route ttl limited probes via the end host back to the source of the traceroute, which will give us the reverse path. (This is in the queue to be worked on a later date, as time permits.) [Matthew Luckie] Exchanged emails with David Malone regarding 6to4 tunnel discovery. I sent him some data on what I did with AMP to discover the location of each AMP monitor's closest 6to4 anycast tunnel. I also sent him an address list to help with his work. David was referred to me by Pekka Savola based on some earlier work I did with Scamper so that was nice. David also offered to host a Scamper monitor. [Matthew Luckie] I sent an email to Bill Owens when I was preparing for the talks as I wanted to talk about something I saw on Internet2's IPv6, but found that NYSERNet was largely not routing to a bunch of the other AMP monitors. The visualizations for those couple of days are pretty interesting: [Matthew Luckie]
Outreach, application support, utilization improvement, and documentation activities~ Impact: Data-Users and Citings I am touched by the support we have received from the AMPATH team at FIU. Both Ernesto Rubi and Eric Johnson went the extra mile to help NLANR with time and equipment and also to make my stay worthwhile and productive. Eric noted how useful AMP has been for him to troubleshoot and respond to reported network changes and misconfigurations, leading to outages and degradation in quality-of-service reflected in trouble tickets. [Jörg Micheel] ~ Papers Submitted a paper entitled "User Level Path Diagnosis with IPMP" to the SIGCOMM Workshop - "Network Troubleshooting: Research, Theory and Operations Practice Meet Malfunctioning Reality." [Matthew Luckie, Tony McGregor] I am the co-author of another paper being submitted to SIGCOMM on debugging path differences between dual stacked v4/v6 nodes. The primary author on that paper is Kenjiro Cho (WIDE). [Matthew Luckie] A paper with Jian-Bo Gao was beyond the scope of one conference and was submitted to another conference. [Ronn Ritke] ~ Presentations and Conference/Meeting Participation PAM2004, Juan-les-Pins, Antibes, France, April 19-20, 2004 ~ Attended PAM and presented my paper. It went reasonably well; I had a few questions afterwards. I also talked to quite a few people about various projects that we might work on together. I had offers to host 2 new international AMP machines. [Tony McGregor] Attended the PAM2004 meeting. While there, met with Nevil Brownlee about future PAM meetings and talked with Paul Schopis (OARnet) about hosting an OC48 monitor. [Ronn Ritke] Quite to my surprise, this years PAM turned out to be very productive and fruitful. There were a number of very useful and interesting presentations which can serve as input for future developments and new ideas. [Jörg Micheel] Joint Engineering Team (JET) Meeting, Arlington, VA, April 2004 ~ Ronn, Tony, and Jörg attended the measurement workshop discussion session via teleconference, a main topic was discussion regarding requirements on measurements. I think the JET Meeting was very refreshing in terms of receiving an update from all the JET parties towards their status and goals and I am thankful to Kevin Thompson for inviting us. In the course of the discussion we also got feedback from Matt Zekauskas that Internet2 is positive about placing the 10 Gigabit monitors into Abilene at Indianapolis. [Jörg Micheel] CAIDA-WIDE workshop at ISI: attended and presented. The talks went reasonably well. I spoke on IPv6 DNS misconfigurations that I saw out of data I obtained from a DNS walk, and Scamper developments (path-MTU, alternative path discovery, measurement of those alternative paths, reverse-path traceroute). Brad and Kenjiro had some useful suggestions for future Scamper developments. [Matthew Luckie] ~ Collaborations and activities supporting network research Working with Matt Z, Rick Summerhill re IPLS instrumentation, [Jörg Micheel] Pere Barlet will be continuing his collaboration with us which will build on his real-time work performed during his visit to Hamilton in January and February. He has introduced us to a colleague of his working on active measurements. It looks like there could be an opportunity to develop a bandwidth estimation module for AMP, Tony had some preliminary discussions with him. [Jörg Micheel] In discussion with Ernesto Rubi and Eric Johnson, FIU (AMPATH) we determined the needs and expectations that AMPATH has towards NLANR in our joint project and it appears as if real-time analysis broken down on a per-VC basis would be interesting and useful. AMPATH already uses MRTG and it would be a matter of complementing those graphs with more detailed information (see http://www.net.fiu.edu/mrtg/ampathgsr.html). Also we had a long phone conference with one of Harvey Newman's (CalTECH) colleagues at FIU who is exploring the integration of AMP and PMA data into MonaLISA for Grid monitoring. [Jörg Micheel] The guys at UFL have also briefed me on their active monitoring project, they have already deployed around 30 1U systems with GigE interface around the campus and are also facing some 20-30 other units around Florida. I spoke with V. Alex Brennen <vab@cryptnet.net>, who is a new hire going to be responsible for this project. I briefed him on some of the AMP goals and architecture and urged him to be in touch with Tony, as there is a good chance for some excellent collaboration, specifically because of the plans to look into extending AMP into the campus and the status of the AMP software beta package. [Jörg Micheel] Spent a considerable amount of time talking to various people about PMA related matters. With Arne Oslebo and Olav Kvittem I discussed the possibility of receiving a couple of Gigabit traces from UNINETT Norway. With Markus Peuhkuri, Helsinki University of Technology, we discussed the opportunity of getting a couple of OC48c traces from the Finish research network. With Constantinos Dovrolis of GATECH we explored the opportunity of placing any type of monitor, possibly a GIGEMON, at their network access. Constantinos is also keen to get his hands onto our latest Gigabit and 10Gigabit traces and we will be sharing analysis results. From Paul Schopis of OARnet we have a positive response regarding placing an OC48MON at their feed to Abilene, which turns out to be Indianapolis, and we could arrange that along with the planned instrumentation of the IPLS backbone node next period. I took the opportunity to announce the availability of further PMAMONs during a session break at PAM2004. [Jörg Micheel] Brief meeting with Margaret Murray regarding TeraGrid monitoring. [Jörg Micheel] Spoke with Teri Simas about PRAGMA6 meeting preparations. [Ronn Ritke] Discussed an SDSC TeraGrid measurement proposal with Tony, Jörg, Bud, and Jim. Spoke with Margaret Murray and Vijay Samalan about it as well. We hope to have a conference call soon. [Ronn Ritke] I answered some questions on AMP and PMA for the OptiPuter group. [Ronn Ritke] Met with Amy Andrews from DOD about NLANR measurement activities. She requested a copy of the international collaborations white paper. [Ronn Ritke] I exchanged some email with Phil Dykstra in regard to collaborating with him on testing S2io 10GigE NICs. He is happy with that. And perhaps adding the NLANR voice to his request might help him get the cards loaned from S2io. [Bud Hale] Am spending considerably more time on network analysis for HPWREN, which will hopefully soon involve more of NLANR/Jörg, as Jörg has expressed interest in the passive data we are collecting for his/NLANR purposes. Jim had put a machine together for me that is powerful enough to do 24/7 traces plus analysis. However, presumably it being an AMD64 machine running 5.2.1, it is just as flaky as a similar (though different system board) AMD64 machine I have at home. Before the last crash I had an initial automated setup working that collects a 24 hour trace, then starts a new one, analyzes the old trace, and emails the results. Currently it displays the top N HPWREN addresses for sources and destinations. [Hans-Werner Braun, Jörg Micheel] Worked with Ian Pratt re PAM2004. [Jörg Micheel] PAM2005 reportedly may happen in Boston, MA, with Mark Crovella as the General Chair. [Jörg Micheel] Discussions with Klaus Mochalski in Leipzig regarding his proposal to visit SDSC in July and August this year, also in regard to Klaus Degner's visit October-January 2005. [Jörg Micheel] Ronn and I wrote and submitted an application to CISCO for the development of a 10GigE AMP. Bud and Jim also helped with costing and ideas for equipment. Most of the budget was for equipment. (2x Intel 10GigE nic, 2x S2io 10GigE nic, 2xPCI-x PCs). [Tony McGregor, Ronn Ritke] Phone call to John Towns to coordinate the NLANR review presentations. [Ronn Ritke] ~ Documentation, Web work, networked data, publications Took the long planned and extensively developed new NLANR/MNA home page live. The new home page's primary attribute is the "Latest Pings" section, which will be regularly updated. I designed the format of the section so it can be quickly scanned to see the contents. Chose which activities to use as the first "Pings"; wrote very brief descriptions for each and compiled the related link addresses. Also chose the activities for the highlights section. Lana had put together a top "20" activities list for me as background. Also wrote some text re "MOAT" so that visitors to our site will not be confused when they run across the acronym. Added NLANR/DAST to the page, using some language on their mission that I grabbed from their Web site. Had already written the acknowledgment text some time ago. http://mna.nlanr.net/ [Maureen C. Curran] I created a highlights list with pointers to information about our main projects. [Lana Kennedy] Before being able to take the home page live, I ran into one of the weirdest problems/bugs I have ever come across with a Web page: when you moused over the links in the directory section (bottom right hand side), only in this section, only in Internet Explorer, it expanded the whole right column to 3-4 times the original size. I ended up redesigning the backend of the whole page (and making some design changes as well), had it completely validated (html validation service, came out with no errors on two different validators - not lint pickers, but real validators), and STILL it had the bug. So I emailed my Web guru buddies at SDSC and one came back with the answer: contrary to the accepted and supposedly preferred practice of using percentages, when you have an image that you are using as an <hr> function, if you specify the width in pixels, you do not have any funky formatting or movable layout. One tiny little thing, which I had disregarded when I did my line by line checks since there is one in the other column that works fine. [Maureen C. Curran] Using the language that I wrote recently (overviews of each subproject), I totally revamped the Network Analysis Infrastructure (NAI) page. It has current status overviews of the PMA and AMP projects (and uses the new current maps for each). This is linked from the opening paragraph on our work as well as in the directory. I have temporarily excluded the BGP/network routing stuff from it. I will be adding it back (or rather splitting it off) when I have the chance to write the material. http://mna.nlanr.net/(infrastructure.html [Maureen C. Curran] Designed and created the "historic" Web page banner with both our logo and NSF's, sent out to moatstaff for cross browser checks because it has a lot of text (came out fine even on the problematic browsers). It is to be used on pages which the community may still want to see for historical info or background, but have not been modified or updated, so may have broken links and old info. Received helpful feedback from Gail who suggested I unbold the primary text; it has a much cleaner look now. Uploaded the new version to moat. [Maureen C. Curran] Tony did a quick auk script for me to be able to generate a list (in most accessed order) of the number of times a Web page has been accessed over x period. This will provide me with information so I can prioritize the order in which pages will have the Historic Page banner added to their html. [Maureen C. Curran] Created the MNA template m4 system templates (head, content, and bottom). Reviewed all the primary Web pages (on moat) and made a list of them and what type of page style/format they have (somewhat current, old, really old). [Maureen C. Curran] Worked w/Hans-Werner on the m4 system relative to those html files in the top directory of var/Web. Ran some test files; he took care of the permission problem and the system works great. [Maureen C. Curran] Converted the following current Web pages on moat to the m4 makefile system: [Maureen C. Curran]
Worked for several days trying to shape the hardware Web page. It turns out that quite a bit of background information is necessary to explain the nitty gritty details (and pitfalls) and I have so far only done works on the fibre optic splitter part. The question is whether too much detail will discourage the technically savvy from reading the parts still important, and how to structure the document for the various readers with different levels of background knowledge. Maureen and I discussed some of this during the Thursday conference call and she has offered assistance as soon as she is finished with her works on reports. [Jörg Micheel] Special Traces multiple index pages. Following up on something that Jörg and I discussed a bit ago, I created a new format for the Special Traces index page (table with 2 columns) to clean up the format. Checked the publishing dates of each Special Trace currently available, put them into chronological order for a second index page (most recent first). Added text re options for overview of Special Traces, cleaned up some of the descriptive language as well. Designed a format for the data set comparative studies index page. Then I created the three new index pages, trialed and errored a few of the format aspects and tested across browsers. Jörg and I met when he came to town and he gave me a full list of metrics by which the Special Traces can be grouped (as well as which belong in each group). We now have five groups (just had two for the test/draft versions). [Maureen C. Curran] Added San Diego-I to the Special Traces multiple index pages and created the new comparison groupings index page with the groups and members that Jörg had laid out for me. After which, I sent out an announcement about these pages (received good comments about them). [Maureen C. Curran] Have edited several scripts and Web pages relating to PMA infrastructure, pma.nlanr.net/Sites/ now reflecting the actual status. [Jörg Micheel] Started on revamping some of the aspects of my current implementation of the Weblogs after an excellent chat with Jörg. Got a number of graphs done and did some of the final touches to the draft script that will generate graphs based on the archived ftp activity from past months. Right now it has two main areas where it can break. The first being if someone erroneously puts an "!" in their anonymous login user name, and PERL complains about variables not being initialized if there is no data on the first day of each month. The first of these two bugs being the worst, as the second bug has no effect on the output just looks wrong. Right now the graphs show the amount of data transferred, the number of users that month, and the number of files transferred. [Chris Gross] Updated the AMP template m4 system templates to include changes that I have developed while working on the PMA pages recently. Converted the IPv6 and IPMP main pages to the template. After which, Tony converted many of the other primary AMP pages. [Maureen C. Curran] I spent a couple of hours reading about php, in preparation for implementing the new central site Web pages. I updated about 20 amp pages to use the new template, including the splash pages that Ben did. [Tony McGregor] I had asked Lana to take the AMP boiler plate language and a couple of the existing AMP pages language (intro, active, etc.) and create a rough draft of a new front page for AMP. She did an excellent job, but it does need to be split up into a couple of pages. Since I did not have time to do this before taking the new MNA home page live, I decided to use the version that Lana has done (it is up to date and looks loads better than the existing one) and Tony agreed. [Maureen C. Curran] Created a new temporary home page for AMP and sent it to Maureen, who recently took the page live. The text is too long, but it serves as a good placeholder for now, and Maureen will show me how to distill it later for the proper length. [Lana Kennedy] I made relatively significant changes to the data structure management for the data library. It now uses custom memory management which improves the overall performance significantly. I am working to generalize the list management and memory management code in the amp datalibrary. In particular, I am experimenting with using large memory blocks that are all chained together and then carving the memory that I need out of them, rather than calling the system for the memory. [Ben Reesman] I implemented a system to estimate the file sizes and allocate memory accordingly, replacing a simpler linked-list based approach that I had previously used. I put some effort into tweaking the logic that estimates the number of records in a file by its uncompressed size, and the results have improved. [Ben Reesman] I am trying to come up with a reasonable way to decide when the system database needs to be queried, and also a reasonable set of policies for what queries to conduct and when. Since there is no benefit in performing all the different queries together, it makes sense to only run the queries for the data that the users wants, however it is important to not allow inconsistencies to creep in if some of the data is newer than other parts of it. [Ben Reesman] ~ Ordered more print copies of the current issue of the NATimes for PAM2004. They were sent along with the OC192/10GigE - International one-page handouts. Ronn handled distribution and reports that there was a very good response to these handouts. [Ronn Ritke, Maureen Curran] Worked on graphics for the upcoming PAM2004 conference in France. Assisted Ronn Ritke in preparing his presentation for PAM2004. [Mike Gannis] Assisted Ronn Ritke with PowerPoint slides for upcoming review. [Mike Gannis] I did an outline and started to prepare my slides for the NSF review. [Tony McGregor] I emailed my draft NSF review slides to Tony and Jörg for comments. [Ronn Ritke]
Ongoing measurement and analysis, networked data, and infrastructure support~ Servers, system disks, and upgrades - AMP We have received five new AMP monitors from the vendor. These new machines are in one RU chassis that are only fourteen inches deep. These will be used for the Cambridge site as well as the Russian site. [Bud Hale] I spent some time trying to build a new kernel for the Debian amp server, but did not make much progress because I had to ask Jim or Bud to reboot it each time I made a bad kernel. Jim took that over and has it running. I did a few more performance tests. [Tony McGregor] I had the opportunity to concentrate on compiling Debian Linux Kernels. This came up with the new Debian AMP Server "AMPDB". It was an excellent exercise. I learned a lot and only had to sacrifice one machine to the KERNEL GODS. [Jim Hale] We had a whirlpool routing loop at XTRA, my ADSL service provider, and with Tony we have tried using AMP to trace and debug the problem. It appears as if only parts of the network were affected, I could not reach the US, but I could log into Waikato and Auckland, and go further from there in a second hop. The loop eventually disappeared after six hours, but it looks like it might be useful to keep my home machine as another node in the AMP infrastructure, if we can somehow figure a way to pay the traffic charges. [Jörg Micheel] I spent some time ordering and assembling the AMD 64 +3200 processor based network data collection machine for HWB for the HPWREN network. Upon installation, problems appeared to be occurring. After the AMD64 ITTRACKER network data collection monitor began showing problems, I offered to assemble a XEON based collector from a SuperMicro board left over from a previous project. So far the monitor seems to be showing positive results. I am anxiously awaiting results in monitor performance as well as network analysis. [Jim Hale] In reaction to the severe break-in experienced at SDSC, the ftp to HPSS was turned off. I discovered this upon starting the archive process on AMP. After much coordination with the HPSS people, a temporary work-around was applied. That was a TCP-wrapper type of access solution such that the legacy ftp is allowed from the AMP and VOLT servers. However it is emphasized that this is temporary and we are strongly urged to move to the HSI HPSS interface as soon as possible. I am pursuing the implementation of AMP data archiving on that interface. [Bud Hale] I continued to consult with the HPSS group on the affects of the recent security compromises at SDSC. A result of the compromises is that the HPSS group plan to terminate the use of ftp and pftp on the HPSS and stress the need to implement the HSI interface. I am continuing to work on implementing that for the AMP and VOLT server archiving. [Bud Hale] I am continuing to implement the HSI HPSS interface. I have the latest HSI client on both the AMP and VOLT servers in /usr/local/bin. I have a request to install the DCE keytab for the ACTMON user account. Also it seems that Colleen Shannon of CAIDA has some UNIX script code for HSI I can plagiarize and save some time and effort. [Bud Hale] The security exploit mentioned above was through an SSH vulnerability. There was some fear that some NLANR infrastructure monitors may have been compromised. I have been examining NLANR sites, one by one, to look for indications of break-ins. As time permitted, checked the AMP site monitors, one by one, for possible changes that might indicate a compromise. So far I have found nothing to indicate that any one of the NLANR sites was compromised. [Bud Hale] We experienced a disk failure on the AMP server. Jim was able to come in and start to work on replacing it right away. However the situation was worsened by the failure of a second disk. That also caused a bit of confusion as well as expand the data recovery problem. Full recovery was completed. During the recovery process issues arose regarding the Web page access with one machine down. It appears that long delays were encountered in user access. The cause for the delays is not clear and some additional study of the architecture of the AMP and VOLT servers and the Web page links may be needed. [Bud Hale, Jim Hale] ~ Servers, system disk, and upgrades - PMA Continued to work with CAIDA to improve the time distribution to NLANR and CAIDA equipment in the machine room. Working on details in connecting the NetOps GPS receiver to the NLANR and CAIDA TDS-24 time distribution units. These units have been using an EndRun Technologies Praecis CDMA time receiver. However the time tagging accuracy using a GPS receiver PPS signal is said to be at least an order of magnitude better. After some trouble shooting and corrections we have the GPS receiver adapter circuit working. However additional trouble shooting revealed the GPS PPS input to the NLANR TDS unit is inoperable. But, the CAIDA TDS unit GPS PPS input is functional. (The CAIDA unit is an Endace production unit while the NLANR unit currently installed there is an engineering prototype from Endace.) As of now we have the CAIDA TDS unit driven by the GPS receiver and the NLANR TDS driver by the EndRun Tech. CDMA receiver to continue providing time signals to the OC192 monitors and the IPv6 AMP monitoring in amp-sdsc. A number of options exist, including connecting the NLANR devices to the CAIDA TDS unit and/or repairing the GPS input to the NLANR unit. I communicated with Endace to attempt to get diagrams to use to repair the prototype TDS24 in the NLANR racks. However I was surprised when Endace replied that maintenance data could not be supplied since the TDS24 system is proprietary. [Bud Hale, Jim Hale] In regard to Endace TDS units, during a staff meeting this period we were reminded that some equipment, including a TDS-24 unit purchased from Endace for the Indianapolis router clamp installation, was left in Indianapolis for an Internet2 engineer to do some experiments. I am following up on this equipment. John Hicks has not provided an inventory of that equipment as yet. He has promised to get that done shortly and get it ready for shipping. [Bud Hale] I have continued to work on putting together the components for the new PMA server; worked out accessibility problems with some components. The new PMA server was completed this period. A dual AMD Opteron two TB server meant to replace the old PMA. [Jim Hale] I got the opportunity to work closely with Jörg again during his latest visit. This trip we installed the new PMA server. The servers early results seem very promising. I look forward to seeing the solution to the crashing problem. I think Jörg suspects these units could have some fine memory requirements. [Jim Hale] Back in San Diego I have spent time with the team, and with Jim we have launched the next generation pma.nlanr.net, which is the dual AMD Opteron machine. Not unexpectedly, we have run into some hard-ware problems, which we are about to corner and hopefully will be in a position to make it a stable platform soon. [Jörg Micheel] I noticed on the 28th we stopped receiving fping notices from the PMA server. Connection to PMA got really slow and failed half of the time. The new PMA server became unreachable as well as SDA and the OC192 machines. Further investigations showed the server was unable to reach the assigned nameserver. After changing the nameserver, I noticed the /etc/host file lacked entries. I corrected the file, though it looks like I will need to continue the investigation. [Jim Hale] Presently, we are working on enabling IPMI remote server management on the SDA monitor, as a prototype for upcoming PMAMON systems. [Jörg Micheel] Continuing to work on issues with the new PMA server. I am working with freebsd.org, Tyan Corporation on compatibility of the joint systems. Both firms are investigating questions. [Jim Hale] The HPSS has been down this week, no ftp access, which is a major pain when it comes to securing the data collected. I have been in touch with Rachel Chrisman to get HSI-kerberos passwords, when, all of a sudden, tonight the system started operating via ftp again, which is a relief. [Jörg Micheel] Existing measurement sites maintenance and troubleshooting: A total of 13 remote sites in the NAI infrastructure received attention during this period: 5 have been resolved and the monitors are again collecting data. Eight were still being investigated, or pending site action, at the end of the period. (Outages are considered "open" until the monitor is again collecting data.) AMP - 7 problem sites: 5 resolved, 2 open ~ AMP machines The amp-surf (SURFnet in Amsterdam, Holland) site lost the network interface again. Through the "out-of-band" access the ifconfig "down" and "up" command restored the interface. I have decided to ship a new 3Com GigE NIC to that site, but if that is the cause it will take several months to verify that conclusion. The site then had three more outages this period. All were caused again by the GigE interface losing connection the Ethernet, and were corrected through the "out-of-band" login. I have acquired some additional GigE NICs and will ship one to the site technician next week. I will also ship a GigE NIC to Chris Thomas at UCLA to add a GigE connection to that monitor. [Bud Hale] AMP site amp-umich (U. of Michigan, Ann Arbor) was discovered to have an ICMP block in the router on the monitor network. I was able to get site technicians to remove the block and it is back to full data collection. [Bud Hale] AMP site nai-a-utk (U. of Tennessee, Knoxville) was down due to a power cabling re-work at the site, but was quickly restored. [Bud Hale] Site amp-wustl (Washington U. at St. Louis) is out. Site technicians have been asked for support and replied that the machine had failed. The site technician reported it appears to power up but would not show video on a monitor. It appeared the best plan was a replacement monitor. The replacement was shipped and is on site, installed, and initialized, and is now back online. [Bud Hale] Site amp-ukans (U. of Kansas) had a short outage but came back online shortly after an inquiry. No report as yet as to the cause. [Bud Hale] Site amp-cudi (Internet2 site in Mexico City) went down over last weekend. A message was sent to the site technician and they reported back that the machine had been rebooted. It went down again, and another message of inquiry was sent; the site technician reported they were investigating. [Bud Hale] Site amp-csupomona (Cal. St. Pomona) went down early Friday. The site technician Ken Diliberto informed us that CENIC has changed the AMP monitor network to a /16. Unfortunately the notice was very short causing the need for some additional steps to get the machine back online. The machine is back online but only temporarily. It will be moved to another network soon. CENIC is in the process of network redesign at some of the Cal. State campuses. [Bud Hale] Otherwise on the AMP sites I am investigating some 100 percent losses between some sites such as amp-msoe (Milwaukee School of Engineering), amp-arizona (U. of Arizona), etc. [Bud Hale] ~ PMA machines After the PAM2004 conference Jörg went to Florida to attend to the AMPATH - nai-p-amp (AMPATH GigaPop in Miami at FIU) and UFL - nai-p-fla (U. of Florida at Gainesville) monitors:
After Jason Tasker replaced the fiber optic cable going to our monitor at Texas Gigapop Machine (nai-p-txg), collection was functioning well. Then, however, data collection was lost for some time. I thought the problem we were suffering earlier had returned. Turns out I overlooked a PMALOCK file that had not been erased. I removed the file and data collection returned. In spite of this, the site continues to be an issue. I thought I had it licked, but it will require some additional attention. I fear we may be losing the enthusiasm of the local technicians. If Jörg cannot attend to this issue on his current trip, maybe this is worth making a trip myself. The data during good collection periods seems very worthwhile. [Jim Hale] I worked on preparing to install the new kernel to the SDA monitor (nai-p-sda). I learned a ton watching Jörg work on SDA. The solution to the SDA machine Jörg discovered was, to me, a very subtle coding issue. It was very instructive seeing how he detected the difficulty, and then pursued where the difficulty resided, pinpointing the difficulty, and then examining why that was an issue, and finally watching how he resolved the issue. We then installed the Intelligent Platform Management Interface feature on the SDA. It holds great promise for future monitors. I look forward to a better understanding of this capability. [Jim Hale] I was able to get MRA rebooted and get it back on line though it still does not collect data. I have messages in to Fred Rowe and Bert Rossi to check our connections to the splitters. Tests show the monitor is operating fine, we are just not seeing any data. This has to be due to something that occurred at the remote end. I look forward to hearing from Fred or Bert. [Jim Hale] Ten PMA sites are up and collecting data. In continued trouble shooting of the nai-p-txg (Texas GigaPop) Jim discovered and removed a PMA lock file. It should be collecting again. Site nai-p-apn (APAN at the U. of Illinois, Chicago) apparently has the PMA lock file set. Jim has requested that Linda Winkler get it rebooted. [Bud Hale] As mentioned previously three PMA units are to be returned to SDSC from U. of Buffalo, U. of Tel Aviv, Israel and one from the MAX GigaPop in DC. The unit from U. of Buffalo was received last week. I have been regularly calling and e-mailing Dan Magorian of the U. of Maryland, College Park, Maryland asking him to return the equipment. The equipment included the OC48 monitor, the CDMA time receiver and the NetOptics splitter pair chassis in a rack mount bracket. This week we received the shipment but the NetOptics splitter pair chassis was not there. Upon inquiring of Dan Magorian as to the splitter chassis he replied that would not be returned. He said he plans to leave it installed for future use. While being cautious to avoid additional misunderstandings I will determine if other options need to be explored. [Bud Hale] The monitor from MAX Gigapop arrived. First thing, Chris Gross's home directory was moved to the PMA server. After that a post-mortem was performed, with interesting results. [Jim Hale] Jim transferred all of Chris Gross's code from the MAX machine to Chris's account on the PMA server. We will hold the machine intact for a while in case other material on the machine is needed. Then we will use it for another application. [Bud Hale] ~ Management And Administrative Weekly NLANR/MNA managers conference calls. [Hans-Werner Braun, Ronn Ritke, Tony McGregor, Jörg Micheel] Reports. [Maureen Curran, Lana Kennedy, Mike Gannis (January)] Compiled an overview of coding skills test results to date, listing which problems each completed, added up the relative point values, and the time they took on the test. I sent it to Tony so that he could make an early decision on the MN candidate (he has, by far, done the best on the test). We decided to invite him and to let the second preferential candidate know that he did not demonstrate the coding skills required for this position. Administered the last three coding skills tests. [Maureen C. Curran] Ran the test results of the top six test takers, had multiple problems at first, but after getting more info, was able to run the code properly (at least for the ones that worked). The results of the test ruled out three people whom we thought we would be interviewing. As Tony mentioned in his email, the results prove the worth of testing applicants. Tony and I had a long conversation regarding some logistical aspects of the interview process. Had a couple of follow-up phone calls with candidates. [Maureen C. Curran] I sent the test follow up emails to all test takers (how do you think you fared and what suggestions do you have for us for the future?). We had responses from all three finalists and found them very informative. Sent the overview page with info on us and the position to the three interviewees with instructions and time that we have agreed upon. Worked w/Tony re the questions for the interviews, put them together for us. Earlier in the week Tony and I came up with 5 questions to send to the interviewees in advance so they could think them over. [Maureen C. Curran] Interviews of the top three candidates took place on Friday April 23 (arranged to be held after PAM2004 so that Tony could travel from France to US before returning to NZ). Tony, Hans-Werner, Ronn, Bud, and I were the panel for the interviews. Tony divided up the questions which we had previously developed such that he handled the bulk of the technical questions, Bud handled documentation, Ronn the administrative and future goals, and I did the "how do they fit with us" questions. We discussed the interviewees and have narrowed it down to two and it is a tough choice - different skills sets and strengths (as well as possible weaknesses). I will be conducting the references not just to "anoint" our choice, but in fact to gather important additional information to help shape the decision. We added some candidate specific questions to the usual list we use. [Maureen C. Curran] Travelled to SDSC and interviewed for the AMP FTE. We are down to two candidates with a difficult choice between them. Talked to others at SDSC too, including a about management of students and some joint work with CAIDA. [Tony McGregor] Planning for Klaus Mochalski's visit in July and August this year and Klaus Degner's visit October-January 2005. [Jörg Micheel] Ronn and myself have also finished the research collaboration proposal NLANR-Endace with UCSD authorities and I have send a copy back to New Zealand for feedback. [Jörg Micheel] - 30 - |
|
|||||||
| ||||||||