EGEE-WLCG-OSG operations meeting
14th May 2007
The agenda can be found here: http://indico.cern.ch/conferenceDisplay.py?confId=16286
OSG grid operations:......... Rob Quick
Asia Pacific ROC:............. Min Tsai
Central Europe ROC:......... Marcin Radecki
OCC / CERN ROC:........... Maite Barroso, John Shade, Antonio Retico, Nick Thackray, Steve Traylen
French ROC:.................... Gilles, Osman, Rolf, Pierre
German/Swiss ROC:......... Clemens Koerdt, Sven
Italian ROC:...................... Paolo
Northern Europe ROC:....... Per, Jules
Russian ROC:................... Absent
South West Europe ROC:.. Absent
South East Europe ROC:... Kostas, Ioannis Liabotis
UK/Ireland ROC:............... Philippa
WLCG service coord.......... Harry Renshall, Jamie Shiers
WLCG Tier 1 Sites
CERN site:....................... Ignacio Reguero
FNAL:.............................. Joe Kaiser
ATLAS:............................ Simone, Alessandro
Reports were not received from:
Ø WLCG T1 sites:............................. NDGF; NIKHEF; SARA
Ø VOs:................................................ LHCb
Ø EGEE ROCs (production sites):. North Europe; Russia
Ø EGEE ROCs (PPS sites):........... Italy; NE; Russia; SWE; UKI; AP
Feedback on last meeting's minutes
No comments during the meeting
From ROC DECH (backup: ROC SouthEast Europe) to ROC UK/I (backup: ROC AsiaPacific)
close : 52
1st mail: 33
Quarantine : 20
2nd mail : 9
Alarms even if the test is not expired, why are the tests bothers for that? Will be changed to a notification one week before; please, report a bug.
Clemens: These are new alarms for COD.
Nick: Do you have any suggestions about it?
Clemens: I would say that these tests should not be critical at the moment.
It is generating too much noise and traffic in the CIC-on-Duty list
Judit: We’ve got an external request to make these tests critical and of course that increases the load. I think that you should open a ticket describing if the alarms are being incorrectly generated and we’ll have a look and fix it.
Maite: This request also came from the operations meeting and ROC managers.
Marcin: I though that this host certificate test would generate an alarm only if the host certificate is expired but it is generating alarms before the expiration.
Judit: That is the idea and I would like to ask to open a GGUS ticket about this.
Nick: Can the COD team provide the number of tickets created to host certificates that are not yet expired?
The problem is actually not if the certificate is expired or not but tickets are being generated to timeouts in the tests and not only to expired certificates.
Nick: Please submit GGUS tickets about these problems and we will fix them ASAP.
Extract from agenda:
· PPS-Update 29 released to the PPS. This contains, among others, the following high-prority patches:
o #898 LCG-CE modifications for DGAS support
o #1144 R-GMA Server fix for bugs #21558, #20090 and #23052
o a new version of the gLite 3.1 Worker Node (glite-WN-3.1.0-3) for SL4/i386 which addresses all known issues.
· Integration of SRM2.2 test SEs into the PPS progressing:
o CERN_PPS is for the time being publishing end-points in US in the information system
o SAM tests are being summitted to all published SRMs.
o Atlas transmitted some requirements on FTS channels for preliminary tests. They are being implemented at CERN_PPS
o In addition to the sites originally involved in the SRMv2 pilot testing, also PPS sites PIC, IFIC, CNAF, Birmingham, DESY, FZK are getting involved in this activity
· Release process Improved: From next week 6 PPS sites will perform pre-deployment testing in the PPS. Mario David, at LIP is coordinating this activity.
· Hand-over of the SAM PPS service to PPS-CYFRONET and PPS-RAL started (completion date: 8th June)
Administrators of SAM Admin's Page (SAMAP) requested PPS to
dedicate two services (BDII and WMS) to support SAMAP service redundancy.
The request is reasonable and so we are asking here for any PPS sites to volunteer to provide these services.
Issues coming from the ROCs
· UPDATE 29 - FTS2 migration: DB schema migration script con be run only once in the current release. So if it fails for any reasons, it needs to be tweaked in order to run again. [ROC CERN]
· UPDATE 29 - VOBOX: VOBOX couldn't be upgraded because of dependency problem. Bug reported (https://savannah.cern.ch/bugs/?26246). [ROC CERN]
· UPDATE 29 - PreGR-01-UoM Applied PPS Update 29 on site following the guidelines mentioned at the Release Notes. The Update caused a number of issues at the site and we are in the process of solving them. [ROC SEE]
EGEE issues coming from ROC reports
1. (ROC CentralEurope): [For information] Installting top-level BDII on SLC4. We compiled a wiki page with instruction on how to set up a toplevel BDII on SLC4: http://wiki.grid.cyfronet.pl/CoreServices/SLC4BDII An instance of that is running at zeus60.cyf-kr.edu.pl. We plan to put it in production round-robin DNS this week. Any comments appreciated
2. (ROC CentralEurope): Recent YAIM release introduced that SGM users started to be mapped on a pool of accounts instead of just one SGM account, but how the VO software is managed in SW_DIR directory at sites? The problem is: the VO software should be readable by VO users, so we set group rights to read the directory and the sgmuser to write eg. 0750, but now we have multiple users who should have write access to that directory. A document considering impact of the moving from one account mapping to a pool accounts written probably by YAIM team would be useful
Marcin: it seems that YAIM people started to implemented recommendation of not mapping many users to a single account; we think there might be some issues related to that, I would like to ask the YAIM team if they analysed the impact on the different service configurations and advice how to proceed.
INVITE YAIM TO COMMENT OFFLINE AND PREPARE FOR NEXT MEETING
3. (ROC France/IN2P3-CC): Might it be possible to improve YAIM in order to make possible the publication of several sub-clusters by CE ? Indeed, GlueSubCluster defines the memory max to be used by job. So if we could declare several sub-clusters, that would make possible to set memory size limitation by type of queues. For example, up to now, by specifying only one sub-cluster by CE, we cannot express that the memory size of the medium queue is less than the memory size of the long queue. This the reason of a lot of Atlas job failures (as discussed with Simone Campana.
Nobody from YAIM. To be followed off line
4. (ROC SouthEasternEurope): We would appreciate an update from SA3/JRA1 regarding the status of the development / certification of SL4 based MW both 32bit and 64bit. An indicative (or estimated) roadmap will also be helpfuf for us to plan ahead, as we've stopped deploying new application software in our regional VO waiting for the major upgrade / switch to SL4, because it affects user/application software as well
Nick: For now, the only thing we have in pre-production is the WN and I think that the UI is coming soon. Apart from that, I have no other information. ADD TO THE AGENDA FOR NEXT MEETING.
(ROC UK/I): Technical issues to do with
the email that CIC-Portal Alarms send:
a) The From field should be CIC-Portal@in2p3.fr and not just CIC-Portal. Otherwise intervening mail relays add their own spurious @host info and so the mail can be misidentified by mail browsers.
b) All emails from CIC-Portal, and in2p3.fr generally, are given a Spam-Assassin rating of DNS_FROM_RFC_ABUSE 0.37, plus whatever other spam score the contents of the message might incur. This would be avoided if in2p3.fr got itself de-listed from www.rfc-ignorant.com - that shouldn't be hard!
Osman: About the first point, I’m going to change it and I will contact our network team about the second one. ACTION.
Philippa: timestamp and name in RC reports so we can see who makes the comments? Yes, it will be done.
6. (ROC UK/I): Spam from "project-lcg-" mailing lists is currently at about 1 per hour. Predominantly project-lcg-security-* and project-lcg-vo-*. What is being done about this? eg. change the name of these mailing lists, and then keep them quiet.
Nick: I can’t think on any solution for this problem. We’ll pass the comment to the list owners.
Site Reports vs. Availability Reports (WLCG tier-1 sites)
Clarification on Site Reports vs. Availability Reports for WLCG tier-1 sites:
The availability reports do not replace the site reports. They are complimentary reports. Every site is still required to send the weekly site report, covering mw updates, interventions, issues, main operation activities, etc.
The availability reports are currently only requested to tier ones and they should explain any unavailability period of 2 or more hours. Please, note that there are 2 different text fields at the RC reports to fill them.
Tier 1 reports
The WLCG tier-1 site reports for this week can be found here: http://indico.cern.ch/materialDisplay.py?subContId=0&contribId=2&materialId=0&confId=16286
WLCG issues coming from ROC reports
(ROC SouthEasternEurope): GR-05-DEMOKRITOS reported that a CMS users is sending too many jobs to them that simply sleep, the reply from the user was that he was trying to do a stress test for the WMS, SEE ROC believes that this is wasting production resources (cpu slots) and that this kind of tests should be done in pre-production service not the production one. We are bringing this issue to the ops meeting because the user did not withdraw his jobs as he prommised in the correspondence the site admins had with him and we've got no reply to the ticket opened in GGUS. More info on the related ticket on GGUS: https://gus.fzk.de/pages/ticket_details.php?ticket=21715
Kostas: I feel that by doing that we are not using the PPS as far as we can. I don’t think that this is the best way to test a WMS in production.
Nick: The problem is that in PPS we don’t have the scale to do this kind of stress test.
Simone: Do you agree that a VO can submit any kind of executable as soon as it does not break the VO rules. This is not the case here. If a site doesn’t want to participate in these tests it is just a matter of sending an email and we can remove it from the list.
Kostas: It is not a matter of participating or not. I just would like to see a better use for the resources.
Daniele: I will contact the user and talk to him in order to solve this problem.
Upcoming WLCG Service Interventions
Change of LCG VOMS certificate, expiring 29th May, decision to change it on May 24th; 3 announcements will be sent. The RPMs will be released tomorrow.
FTS service review
See agenda for reports.
Gavin: We will choose a site per week and review its FTS service. The site of the week is CNAF.
We see ~60% of failures in files transferred to CNAF. Update on status of Castor 2 limitations plus timeline to solve it?
Daniele: failures in FTS caused by intermittent failures of some storage servers. Other Castor related issues which I will ask and report about.
ASGC: a lot of time writing to the SRM.
Min: Castor 1 or 2? The production one. Min will check offline with the storage experts and mail a summary to Gavin.
See wiki pages (https://twiki.cern.ch/twiki/bin/view/Atlas/TierZero20071 and https://twiki.cern.ch/twiki/bin/view/Atlas/ComputingOperations) for more information.
1) Can we have an update of the status of the job Priority in PPS (in particular, the configuration of the DENY tags)?
There is nobody from Valencia connected. We’ll contact them offline.
2) Can we have an update of the deployment (in PPS and/or production) of the LFC and DPM supporting secondary groups?
It is under certification.
processing: 'Spring07' MC production continues (with CMSSW_1_3_1, 20M
events produced in last 13 days: 46M evts/month rate).
Last week was week-3 of Cycle-3 of the CMS LoadTest07. ~600-1000 MB/s of
aggregate transfers over the WAN, ~300-600 MB/s on aggregate T1->T2 transfers
(~30 T2's participating). Details at:
No report this week.
Nick: ALICE run the tests form the information system in the grid. Can you give us a brief comment on that?
Patricia: We created daemons for the four VOs and gathered from the information systems and from the batch systems and we saw some big differences, sometimes in the order of hundreds of jobs.
Nick: Do you what is the timeout to get the batch system information reflected by the information system.
Patricia: It was about 10 minutes;
Patricia: We realized also that some times the LB provides unreliable information. In one situation it reported thousands of jobs when it should report a few hundreds only.
Nick: Can you send the link for these reports.
Patricia: Yes, I will do it:
Sven: Are those hundred jobs for a specific VO or for different VOS.
Patricia: They were related specifically to ALICE.
Service Challenge Coordination
WLCG Collaboration workshop September 1-2 2007, Victoria, BC, Canada (co-located with CHEP 2007)
Nothing to report.
Review of action items
The updated list action items can be found attached to the agenda.
** Please be aware that hotel accommodation in Stockholm is becoming limited for this week, so please register and book your hotel room as soon as possible. **
- Registration and general information:
The next meeting will be Monday, 21st May 2007 15:00 UTC (16:00 Swiss local time).
Attendees can join from 14:45 UTC (15:45 Swiss local time) onwards.
The meeting will start promptly at 15:00 UTC.
The WLCG section will start at the fixed time of 16:30.
To dial in to the conference:
a. Dial +41227676000
b. Enter access code 0157610