EGEE-WLCG-OSG operations meeting
18th June 2007
The agenda can be found here: http://indico.cern.ch/conferenceDisplay.py?confId=17780
OSG grid operations:........ Rob Quick
Asia Pacific ROC:............. Min Tsai
Central Europe ROC:........ Marcin Radecki
OCC / CERN ROC:........... Antonio Retico, Nick Thackray, Steve Traylen. Maite Barosso, Yvan Calas
French ROC:.................... Hélène Cordier
German/Swiss ROC:......... Clemens Koerdt, Sven Hermann
Italian ROC:...................... Luciano Gaido, Paolo
Northern Europe ROC:...... Jeff
Russian ROC:................... Lev
South West Europe ROC:. Gonzalo
South East Europe ROC:.. Kostas
UK/Ireland ROC:............... Philippa, Andy Newton, Jeremy Coles
WLCG service coord......... Jamie
WLCG Tier 1 Sites
ASGC:............................. Min Tsai
CERN site:....................... Yvan
FNAL:.............................. Jo Kaiser
RAL:................................ Derek Ross
ATLAS:............................ Simone, Alessandro
Reports were not received from:
Ø WLCG T1 sites:................... INFN, NDGF, PIC
Ø VOs:...................................... Alice, CMS
Ø EGEE ROCs (prod sites):.. North Europe
Ø EGEE ROCs (PPS sites):..
Feedback on last meeting's minutes
No comments during the meeting
From ROC SEE (backup: ROC DECH) to ROC Asia Pacific (backup: ROC UKI)
· New opened:
Steve: This site was already invited two weeks ago.
Kostas will follow-up with the site.
Nick: gLite 3.1 UI (SL4) was delivered to PPS. Currently undergoing pre-deployment testing.
Issues coming from the ROCs
(ROC DECH): Where to
find a downtime procedure for PPS?
Antonio (PPS): A downtime procedure was never agreed nor published. Some sites are following the production one, but it is not always appropriate. We need to think better about it. This topic will be included in the review of the PPS/COD interactions to be done in the next two weeks.
Both Philippa and Helene pointed out that there is a strong need for an agreement on the procedures for CODs about PPS.
Nick: A proposal will be drawn by people involved in this discussion (Philippa, Helene, Antonio, Nick, Clemens, Cyril ) and the conclusion will be presented at the operations meeting.
EGEE issues coming from ROC reports
1. (ROC Central Europe): From the previous week: Critical Problems with DPM due to SAM submission certificate role change. Advice to sites: http://wiki.grid.cyfronet.pl/DPM_workaround_for_SAM_jobs_certificate_role_change
Maite: the issue can be fixed either applying the workaround described in the wiki or upgrading the the DPM to a version higer than 1.6.4
CE: Reply accepted
2. (ROC Central Europe): Information for sites: in 2 weeks CYFRONET will start testing infrastructure of our SAM replica, that means we will start sending SAM tests to all sites in a manner like CERN\''s SAM does now. It will be done under DTEAM VO. We will send broadcast to all sites as soon as exact dates will be known. No particular load from these SAM jobs is expected as well as results of these jobs will not be taken as a problem indicator - it is to test the CYFRONET\''s infrastructure.
Marcin: The failover replica for SAM almost finished. We start submitting jobs. So all sites have to expect jobs flowing through with certificates issued by the polish CA.
3. (ROC Italy): WN and SLC4 issues:
a) What is the situation on PPS? Are ll VOS ready for moving to SL4? Do we have to contact each single VO to be sure of that (i\''m not thinking only about LHC VOs...)?
b) We are working on a deployment plan for italian sites to minimize the impact. What are the suggested/possible scenarios for migrating WNs to SL4? Any experience from tier1/tier2 sites is very appreciated.
(ROC Italy): CLASSIC_SE support When the Classic_SE profile will be phased out? Is there a
procedure for moving to dpm or dcache?
ANSWER: included in minutes http://indico.cern.ch/conferenceDisplay.py?confId=16563
Paolo: Is there any procedure to move from SE to DPM?
Steve: The procedure is very well documented and a lot of sites have already applied it.
(ROC Italy): ACCOUNT
SGM and PRD: A broadcast from Maarten Litmaath (\"sgm/prd account mapping
vs. new YAIM\") have announced a new version of YAIM that will allow sites
to keep using the traditional mapping of sgm/prd users to static accounts. We
need more details about how it works to give a better support to ours sites. Is
is available in pps? Is there any documentation?
YAIM: The mapping of sgm and prd to static accounts is now available in the latest certified yaim patch. https://savannah.cern.ch/patch/?1193 yaim 3.0.1-21 This would go into the PPS next Monday (25 June).
Luciano (ROC IT): The schedule for the release is acceptable, but we would need to study the YAIM function in advance in order to apply the changes in our customised version.
Antonio: As the YAIM function is not supposed to be changed during its pre-production phase, it is safe either to download it from the PPS repository or to take it directly from cvs.
6. (ROC SouthEast Europe): Detailed notes on installation and configuration process of Native SL4 gLite 3.1 wns are available on EGEE-SEE Wiki: http://wiki.egee-see.org/index.php/SL4_WN_glite-3.1
Tier 1 reports
The WLCG tier-1 site reports for this week can be found here: http://indico.cern.ch/materialDisplay.py?subContId=0&contribId=3&materialId=0&confId=17780
Maite: Special mention to BNL and SARA for the high quality of their report
Jaime: Very valuable reports much appreciated by the quality group.
TCG proposal and plans for job priorities and YAIM
Extract from agenda:
Very Short term: the FQAN VOViews should disappear from the information system. The VO:atlas view will then show inclusive information for ATLAS jobs submitted with any role. This means the FQAN VOViews should not come anymore with the default YAIM configuration (action for SA3) *and* they should disappear from the sites which already have deployed it, both via YAIM and by hand (action for SA1)
Short term: the DENY tag short term solution should be considered. This means, the official EGEE path for certification and deployment should be followed. The deny tags approach should be tested in PPS, and, once proved to work, cautiously deployed, starting from NIKHEF, than the other T1s, one by one, coordinated with the experiments for testing.
(not too) Long Term: the job priority mechanism should be reconsider, also considering scalability issues of the current mechanism.
It's important for ATLAS that the VeryShort term solution is done as soon as possible
Simone: With "short term" ATLAS means that we expect SA1 to arrange things in order to remove the tag not later than 2 weeks from now.
Sven (ROC DECH): A list of sites where to remove the tag would be useful.
Simone: A list of sites will be edited and sent (action 39)
Antonio: Suggests to open a GGUS ticket (to be replicated for all the ROCs in order to monitor the change process.
WLCG issues coming from ROC reports
The issue regarding WN and SL4 was discussed at this point:
Nick: The deployment of natively compiled gLite3.1 SL4 WN and UI is in progress (WNs delivered in production, UI currently in PPS). HEP VOs where asked to check the WNs against their software. Non-HEP VOs were not explicitely asked as we couldn’t wait for all VOs to test.
One issue found by ATLAS due to a missing OS library on the WN (not required by the middleware) triggered a (reprise of the) discussion about the way dependancies in the software from the experiments are dealth with. Several suggestion have been moved forward to help the VOs, but so far the VOs are still supposed to deal direclty with the sites to solve possible local problems.
Luciano (CNAF T1): It is very important to know which are the problems for VOs running specific software. The proposal of having one "dummy" package per VO in a repository, where dependancies are defined, would be of great help for sites.
Nick: People to study the technical solutions are currently on leave, so the proposals put forward so far are still purely speculative.
Luciano (CNAF-T1): We are under high pressure by the experiments for the installation of the new OS, but still no official information about the ability of the VO to run on the SLC4 OS is missing.
Paolo (CNAF-T1): There are
two ways of organising the migration to SLC4 for our site:
1) One-shot migration (all the farm). This is not possible if we are not sure that all the VOs are able to run. An official statement from the VO is missing here.
2) Gradual migration: additional CEs with queues to support SLC4 WN where to direct only experiment which are "ready".
Question: Do the HEP VOs need to be able to continue working also on SL3?
Maite: Before the discussion gets on I would like to recall that this answer is supposed to come out from the management board.
Roberto (LHCb): LHCb has still some known issues, which can be managed. It is vital for us that SL4 and LSC3 farms be correctly advertised in the information system. Upon this condition, LHCb is ready to run on SLC4.
Simone (ATLAS): This is not an official statement: As for LHCb there are versions still used of the ATLAS software which will never run on SLC4. Again correct information published is mandatory.
Luciano: It is possible to have queues to both versions of the OS, but it is a considerable overhead for sites, so we would like to receive from the VOs as many hints as we can useful to cut unused scenarios.
Luca dell'agnello (CNAF-T1): In last gGDB Markus (SA3) said that the middleware for WNs was officially released. Why experments did not finish the test it in PPS? The milestone for sites (this month) has not been changed on account of that .
Maite: It is however reasonable to expect that big T1s maintain both version for a while. This was done at CERN. One-shot migration is not priactical for T1s.
Nick (PPS): We have done everything possible to encourage experiments to test but also experiments are busy (they are running also other testing in PPS). In particular this issue with the library was unfortunately found just after the releasse of the new middleware, but in fact it is a very generic and middleware-independant problem to be addressed for the operations and not specifically related to SLC4.
Roberto (LHCb): Points out that LHCb has been doing not only SLC4 testing, but also STORM, SRMv2, FTSv2 ...
Upcoming WLCG Service Interventions
· Gavin: FTS 2.0 at CERN intervention completed but the service is not back (fragmentaiton on the server). Broadcast to be sent.
· FZK/GridKa is unreachable on 27/6 from 05:00 UTC to 18:00 UTC. Network connections are restored after 18:00 UTC but maintenance will continue till 28/6 18:00 UTC. Services are impacted during the whole period of 27/6 05:00 UTC till 28/6 18:00 UTC.
only for Tier-2 sites using a DPM: Questionnaire about file size and file system
We'd like to conduct a set of performance tests against type of file systems. To tune the filesystem parameters, we need some realistic information.
To this purpose, we'd appreciate if you could fill the questionnaire here:
Answers to the questionnaire should be sent to email@example.com as soon as possible.
Thanks by advance for your collaboration,
Lana: The purpose of the questionnaire is to gather realistic information in order to steer performance tests and try and optimise basic file system configuration.
Sven (ROC DECH): Was the questionnaire sent via a broadcast to the sites?
Maite: Yes, to all production sites on Friday.
FTS service review
See agenda for reports.
Still problems in the job priority mechanism.
ATLAS sites should NOT upgrade toFQAN VOVIEWS.
Some ATLAS sites reported issues with production and sgm pool accounts as configured by the new YAIM
Simone (ATLAS): Due to migration to version 0.3 of the distributed DM system, reduced activity from ATLAS has to be expected. In particular this week no files should be moved within the the T0 throughput test. Usual traffic from production to be expected yet.
Not present at the meeting.
GRIDKA issues to be escalated:
· NFS problem: it has to be optimizedbecause we experienced that when the jobs running oon the site get close to 500,the performances of the system go dramatically down.
· SRM got overloaded bycontinous transfer requests because the disk space got full. The subsequentslowness caused many troubles in the other activity. Would it be possible to fixthe SRM implementation (by tuning some parameters) at GRIDKA so that SRMprotects it self by such kind of situations?
Sven (comment from gridka): The issue has been already raised from the LHCb representative in gridka. We hope to find a solution, but the problem did not reach dcache experts before this morning. It is currently under investigation (now checking tape connections). If needed we can free disk space.
Roberto: Freeing space would be only a temporary benefit. As the overload seems to be caused by continuous tranfer requests, one suggestion we received (Maarten) is that there could be a protection enabled in dcache against DOS attacks.
Sven : This is a point for developers. The best contact to investigate in this direction is Doris.
Not present at the meeting.
Service Challenge Coordination
Jaime: T0-T1s multi VO transfer test next week.
Nothing to report.
Review of action items
The updated list action items can be found attached to the agenda.
The next meeting will be Monday, 25th June 2007 14:00 UTC (16:00 Swiss local time).
Attendees can join from 13:45 UTC (15:45 Swiss local time) onwards.
The meeting will start promptly at 14:00 UTC.
The WLCG section will start at the fixed time of 16:30.
To dial in to the conference:
a. Dial +41227676000
b. Enter access code 0157610