EGEE-WLCG-OSG operations meeting
7th May 2007
Agenda
The agenda can be found here: http://indico.cern.ch/conferenceDisplay.py?confId=15963
Attendance:
OSG grid operations:........ Absent
EGEE
Asia Pacific ROC:............. Min Tsai
Central Europe ROC:........ Marcin Radecki
OCC / CERN ROC:........... Maite Barroso, John Shade, Alexandre Duarte, Nick Thackray
French ROC:.................... Helene, Osman, Rolf, Pierre
German/Swiss ROC:......... Clemens Koerdt
Italian ROC:...................... Alessandro Cavalli, Alfredo
Northern Europe ROC:...... Julius Wolfrat
Russian ROC:................... Lev Shamardin
South West Europe ROC:. Gonzalo Merino
South East Europe ROC:.. Kostas, Ioannis Liabotis
UK/Ireland ROC:............... Absent (apologies received)
GGUS:............................. Clemens Koerdt
OSCT:.............................. Absent
WLCG
WLCG service coord......... Harry Renshall
WLCG Tier 1 Sites
ASGC:............................. Min
BNL:................................ Absent
CERN site:....................... Yvan Calas
FNAL:.............................. Joe Kaiser
FZK:................................ Clemens
IN2P3:.............................. Pierre
INFN:............................... Paolo, Alfredo
NDGF:............................. Absent
PIC:................................. Gonzalo Merino
RAL:................................ Absent (apologies received)
Sara/NIKHEF:................... Jules, Ron
TRIUMF:........................... Absent
VOs
Alice:............................... Absent
ATLAS:............................ Simone, Alessandro
CMS:............................... Daniele
LHCb:.............................. Roberto
Reports were not received from:
Ø WLCG T1 sites:............................. INFN, TRIUMF
Ø VOs:................................................
Ø EGEE ROCs (production sites): Italy
Ø EGEE ROCs (PPS sites):........... Italy, Russia
Feedback on last meeting's minutes
No comments during the meeting
EGEE Items
Grid-Operator-on-Duty handover
From ROC Central Europe (backup: ROC AsiaPacific) to ROC DECH (backup: ROC SouthEast Europe)
Maite: I don’t see the reports from last meeting. Please send it even if we don’t have the operations meeting.
Tickets:
· Backup team
Opened New :17
close : 31
new :
Quarantine : 14
2nd mail : 10
Extend : 22
· Issues :
1. cannot use " " in the contents of ticket .
Min: It is not a critical problem.
Maite: Please, open a ticket to GGUS.
Min: Ok. I will do that.
2. some site's SAM result didnt update to new .(alert #21446)
Osman: We had a problem with the web services and many SAM results were not updated. We fixed it this morning.
PPS reports
Extract from agenda:
o PPS-Update 29 released to the PPS. This contains:
§ 898 LCG-CE modifications for DGAS support
§ 1046 Condor plugin for lcg-info-dynamic-scheduler
§ 1079 Missing dependency for glite-CE
§ 1113 lcg-infosites obsoleting lcg-info-api-ldap
§ 1121 LFC/DPM 1.6.4-3
§ 1124 R-GMA Server fix for NumberFormatError??
§ 1144 R-GMA Server fix for bugs #21558, #20090 and #23052
§ 1147 glite-yaim-3.0.1-15
o A meeting with all PPS sites
(VRVS or phone conference)is being scheduled. The temptative date is:
Thursday 03 May 2007 from 15:00 to 16:3
The preliminary agenda is available at http://indico.cern.ch/conferenceDisplay.py?confId=15191
o Issues coming from the ROCs
1. None this week.
Phase out of classic SE
Nick: We have been discussing for a long time about phasing out the classic SE but we realized that we cannot do that right now because there is a functionality provided by the classics SEs that will not be provided by DPMs before the end of this year (POSIX interface to local file system).
EGEE issues coming from ROC reports
1. (ROC UKI, from last week): Do adhoc site submitted SAM tests get published into the database used to calculate site availability?
Judit: Yes but only when the tests are submitted by the OPS VO.
Clemens: How is the availability calculated? If we have a test failed and in the following moment we send another test from the SAM admin page what happens?
Judit: I can’t answer how exactly the numbers are calculated but the availability is calculated hourly.
2. (ROC SEE): Release notes of Update 23 to gLite were not complete yet again: https://gus.fzk.de/pages/ticket_details.php?ticket=21392 The quality of release notes to updates should be improved, and they have to reflect actions that need to be taken on production sites, i.e. on services already running, and not just to consider deployment of new services. Transparency of recent updates (specifically 21 and 22) is highly dubious, since we encountered problems that caused loss of jobs. This needs to be highlighted in the release notes. Another example: https://gus.fzk.de/pages/ticket_details.php?ticket=21155
Maite: We agree that it is important to have complete release notes and the release team is working hard on it. They answered promptly the two tickets.
Kostas: As the tickets were answered quickly the problem was solved.
3. (ROC SWE):
a) It would be nice to get the VO configuration template files for the new yaim (vo.d structure) from each VO to prevent misconfiguration.
Gergely: We don’t think we understood this question. Can you explain it better?
Gonçalo: It is related to the new YAIM, where there is a different file for each VO. It would be better if we could get this information in files instead of the VO cards so everybody would have the same configuration files for each VOs.
Gergely: We had a tool that should be used by the VO administrators to create these configurations files but the data in it became outdated. We can check again and try to update and use this tool again.
Helene: Jules was working on a tool to create these configuration files from the VO cards and make them available for download in the CIC site. I will check what the status of this tool is and I will forward it to the CIC mailing list (action 34).
Maite: We need also to verify that all the parameters generated by the tools are corrected.
Helene: We can ask the VO administrators to check if the parameters in the VO cards are correct and if they are correctly exported to the files.
Maite: I would like to ask all VO managers presents to check if these parameters are correct.
b) It would be nice to upgrade the documentation of gLite including description of the new yaim version. Small and new sites had problems deploying it.
Gergely: I didn’t understand also this question. To which documentation you refer?
Gonçalo: It is difficult to me to provide more detail on it now but we will send this by mail later.
Kai: It is related to specification of the version of YAIM that should be used to the installation.
4. (ROC SEE): Another problem is continuing nightmare with gCE SAM tests, which are mainly due to SAM WMS problems. We opened two GGUS tickets on this, but still there are no clues on what can be causing those problems: https://gus.fzk.de/pages/ticket_details.php?ticket=20732 https://gus.fzk.de/pages/ticket_details.php?ticket=21454 We even observe gCE SAM test where all individual tests pass with OK, but the overall status is JS. While the following GGUS ticket hints to a source of problems, we think that there may be other problems related to rb108.cern.ch where all these failures occur: https://gus.fzk.de/pages/ticket_details.php?ticket=20625.
Judit: We are investigating this issue since it really seems to be a problem with the WMS. We removed that problematic WMS and are not using it anymore to test submission.
Maite: Anything else from SEE?
Kostas: No. It is ok.
5. (ROC SWE): We see still the intermittent error: BDII Connection Timeout: bdii.pic.es:2170 from the replica manager test but this only happens with the replica manager client. How is this related?
Judit: This could be related to network connections or timeout configuration and the BDII at PIC should be investigated
Per: We had the same problem with our national top level BDII and we solve it by cutting some indexes in the BDII. I can send a link on a wiki page describing how to do that.
6. (ROC SWE): The gLite error "The job attribute PeriodicHold expression ''''Matched =!= TRUE && CurrentTime > QDate + 900'''' evaluated to TRUE. is just handles by the COD people like a site problem even it should be seen like a middleware problem.
Maite: Has anybody from the COD team any reaction on this comment.
Min: We didn’t discuss it.
Maite: I think that this kind of issue should be previously discussed by the COD before going to the operations meeting. Is this fine with the SWE?
Kai: Yes, this is fine.
Maite: Please, discuss this offline with Min.
Kostas: We had this problem also and it was related to the Resource Broker and not to the site.
7. (ROC UKI): The other point we wish to rise is that trying to "research" through the web obscure error messages thrown by the middleware does not seem to be a useful or efficient way to tackle problems. This has been raised in the past and error messaging hasn''t improved at all. Now that service level targets are becoming more important and are going to be based on SAM test results, being tagged red with a meaningless error message thrown by the middleware is quite unhelpful and won''t necessarily reflect correct figures of a site availability.
Maite: We will raise it again to know what is being done about it.
WLCG Items
Tier 1 reports
The WLCG tier-1 site reports for this week can be found here:
Plans for SRM v2.2 deployment in production
We are preparing for SRM v2.2 deployment in production.
For certification purposes we need sites to configure 2 REPLICA-ONLINE test spaces of 200MB each with dteam_test1 and dteam_test2 space token descriptions
Maite: It should be covered by Flavia but unfortunately she couldn’t attend the meeting. If you have any questions about it we can discuss this next week. She asked for volunteers able to deploy the SRM 2.2 in production as soon as possible and to have this configuration ready to start the tests.
WLCG issues coming from ROC reports
Nothing this week.
Upcoming WLCG Service Interventions
Nothing this week.
FTS service review
See agenda for reports.
Gavin: We will choose a site per week and review its FTS service. The site of the week is FZK. We found some problems related to open connections. The problem is known and the dCache developers are working on it.
Clemens: If you want to know details I have to talk to our dCache people here and then come back to you. (after the meeting it was agreed that Doris Ressmann will contact Gavin).
ATLAS service
See wiki pages (https://twiki.cern.ch/twiki/bin/view/Atlas/TierZero20071 and https://twiki.cern.ch/twiki/bin/view/Atlas/ComputingOperations) for more information.
MC Production: the fix for the Job Priority (publication of DENY tags) has been successfully tested in Nikhef. Tests are ongoing for Valencia (pre-production). If the latest tests will be successful as well, we will ask to push the fix in the rest of Pre-Production early this week.
CMS service
· Job processing: 'Spring07' MC production based on CMSSW_1_3_0 started on Apr25th and is in progress. All CMSSW_1_2_0 datasets and 95% of Spring07 GEN_SIM have been migrated to DBS-2 already. Progress on CMSSW_1_3_1 production also (>5Mevts in 7 days)
· Production data transfers: Spring07 GEN-SIM data shipping out of CERN/FNAL needed in order to give room to HLT processing: ~4 TB of data have been shipped, main destinations: FZK, ASGC, Legnaro, CNAF, Florida.
· Test data transfers: Last week was week-2 of Cycle-3 of the CMS LoadTest07 [*].Since Tuesday noon time (GVA), ~800-1000 MB/s of aggregate transfers over the WAN.
[*]http://www.cnaf.infn.it/~dbonacorsi/LoadTest07.htm
LHCb service
Point 1.
Instability of SRM endpoints at T1.
The reconstruction activity, after few days with all T1 sites were running happily, started to degrade because the SRM response started to be very slow (srm-get-metadata or the lcg-gt for staged files takes a long while). (RAL-CERN are currently suffering this problem: they are not doing well as during the last week). Could sysadmins do investigate?
IN2P3 seems to be much better since this morning.
NIKHEF and CNAF are OK.
We observed that a reboot of SRM would cure all problems.
Point 2.
Site should have a sensor that regularly does asrm-get-metadata on a existing test file and measures the time it takes. In case of slowness that sensor should trigger some alarm at site level; a similar test might also be part of SAM test suite.
Gonçalo: We have a lot of files on tape and I would like to know if the tests will monitor and compare latency of accessing files only on disk or also considering files on tapes.
Roberto: No Gonçalo, we are querying only the file metadata. We are not accessing the files in any way. We are aware about the delay of the storage system.
Gonçalo: We had some problems with LHCb files being staged on tape and highly fragmented.
Roberto: I believe that it must be related to the pre-staging system that we have been using for a while. Now we are only checking the SRM so it will not affect our results.
ALICE service
Alice has just updated the AliEn version to v2.13. We are updating the sites. Regarding the open issue I had last week (related to the test of different information providers: IS, GRIS, LB and batch system), it can be closed.
Via monaLisa we are printing this info since weeks and I have announced it via the support-eis list.
Service Challenge Coordination
Nothing to add.
OSG Items
Nothing to report.
Review of action items
The updated list action items can be found attached to the agenda and also here.
AOB
Next Meeting
The next meeting will be Monday, 14th May 2007 15:00 UTC (16:00 Swiss local time).
Attendees can join from 14:45 UTC (15:45 Swiss local time) onwards.
The meeting will start promptly at 15:00 UTC.
The WLCG section will start at the fixed time of 16:30.
To dial in to the conference:
a. Dial +41227676000
b. Enter access code 0157610