EGEE-WLCG-OSG operations meeting
27th August 2007
Agenda
The agenda can be found here: http://indico.cern.ch/conferenceDisplay.py?confId=20374
Attendance:
OSG grid operations:......... absent
EGEE
Asia Pacific ROC:............. Min
Central Europe ROC:......... Marcin
OCC / CERN ROC:........... Antonio, Nick, Steve T., Alexandre
French ROC:.................... Pierre, Rolf, Pierre-emanuel, Osman
German/Swiss ROC:......... Clemens
Italian ROC:...................... Paolo, Alessandro
Northern Europe ROC:....... Jules
Russian ROC:................... Lev, Alexander
South East Europe ROC:... Kostas
South West Europe ROC:.. absent
UK/Ireland ROC:............... absent
GGUS:............................. Helmut
OSCT:.............................. Absent
WLCG
WLCG service coord.......... Harry
Grid database services...... Maria
WLCG Tier 1 Sites
ASGC:............................. Min
BNL:................................ absent
CERN site:....................... Alexandre
FNAL:.............................. Jo Kaiser
FZK:................................ Clemens, Doris
IN2P3:.............................. Pierre
INFN:............................... Paolo
NDGF:............................. Tore
NIKHEF:........................... Jules
PIC:................................. absent
RAL:................................ Derek, Matt
SARA:............................. Jules
TRIUMF:........................... Rod
VOs
Alice:............................... absent
ATLAS:............................ Alessandro, Simone
BioMed............................ absent
CMS:............................... absent
LHCb:.............................. Roberto
Reports were not received from:
Ø VOs:......................................
Ø EGEE ROCs (prod sites):..
Ø EGEE ROCs (PPS sites):..
Feedback on last meeting's minutes
No comments during the meeting.
EGEE Items
Grid-Operator-on-Duty handover
From: ROC UK/I / ROC CERN
To: ROC Italy / ROC Russia
Issues from Russian ROC:
·
We
didn’t see any activity from the ROC CERN on the last week. There is a huge
backlog now.
Nick will follow this up with the CERN ROC.
Migration to SL4 WNs
All tier-1 sites think that they have achieved the migration of WNs to SL4 in order to fulful their pledges.
SE Europe (Kostas): When are we going to have the other services on SL4?
Steve: There is a page the gives you the progress on the migration.
Nick: At the pre-CHEP operations workshop it was announced that some services are going to be released in two to three weeks. I will ask Markus to send this information to the list and bring an update on the next Operations Meeting.
PPS reports
Extract from agenda:
Issues from EGEE ROCs:
· SAM Client at Cyfronet has been reconfigured to use glite-wms-* commands. There is an unknown problem during matchamaking of jobs directed to ce110.cern.ch - under investigation (with Ulrich Schwickerath)[ROC CE].
Release News:
No further questions or comments.
EGEE issues coming from ROC reports
1. NE: The SAM job never completes successfully if one of the tests times out or blocks/hangs. For our express queue this results in the SAM job running out of wallclocktime and getting killed, which in turn means no SAM results are being published and the entire site fails for all tests. Is there a way to prevent this from happening, by killing off hanging/blocking tests in the SAM job within a few minutes? Or else it would be nice to know what the expected runtime of the SAM job is, i.e. with what wallclock specifications should it run. Is anything like that specified anywhere?
Answer from SAM team (Piotr): The answer is simple. For the time being we have the following:
timeouts for SAM test jobs:
test timeout: 10 minutes
job timeout: 30 minutes
The first one is for the individual tests, the second one for the whole test job, however as all the tests are executed in parallel, the job timeout should never be reached unless there is a huge problem with publishing the results.
Both are configurable in SAM Client config file, and can be changed if needed and agreed.
Nick: Do you how long the express queue is?
SARA: 10 minutes
Nick: I will pass this info to Piotr.
SARA: If I understand well, if we change the timeout to 20 minutes it should solve the problem. We will try it to see if it solves the problem.
2. Should Memory size published at GlueHostMainMemoryRAMSize be per WN or CPU core
Steve: It should be per job slot. This is used to know how much memory is available for the job in the WN.
Kostas: Can we make a point to document this somewhere, maybe on the yaim documentation?
Nick: Ok.
3. Pierre: About Action 53. Is this a recommendation? Did the VOs agree with this decision?
Nick: I believe that the answer is yes. So, this should be the default setup for each VO. If they desire anything different they should specify in the VO card.
Pierre: Ok, but the VOs should be aware that there are some limitations with this approach since many different users can be mapped to the same prd account.
Maria: I think that this issue was discussed in GDB and the VO id card was elected as the way to communicate to the sites what they and how they want to be configured.
Pierre: So, I can say to my site that the VO id card is the reference to that.
Maria: Yes. They have to deploy what is specified in the VO id card in order to support a given VO in their site.
WLCG Items
WLCG issues coming from ROC reports
1. None this week.
WLCG Service Interventions
Extract from agenda:
· Decommisioning of SL3 WNs at DESY-HH: Queues of CEs grid-ce0 and grid-ce2.desy.de will be drained 21.9.07 and finally shutdown on 24.9.07. WNs will be reinstalled with SL4 and included in the existing new CE grid-ce3.desy.de.
Also:
· CCIN2P3 will be offline for one day next Tuesday, 18.
· RAL will put one of their RBs offline. It will be broadcast.
FTS service review
See agenda for reports.
INFN: We had some problem last week in our computing centre but everything should be running quite better this week.
Steve: There are some FTS workarounds for the latest version which can be found here: https://twiki.cern.ch/twiki/bin/view/LCG/FtsKnownIssues20
ATLAS service
Extract from agenda:
We noticed that some sites, in order to run ATLAS prod, have tried to installpython32. https://gus.fzk.de/pages/ticket_details.php?ticket=25690&from=allt Wewant to remember that python32 is not supported anymore, and we'll discuss inthe next ATLAS taskforce (one week from now) which policy is better for ATLAS.
CMS service
No report. No representative present.
LHCb service
Nothing to report.
ALICE service
No report. No representative present.
Service Challenge Coordination
Extract from agenda:
The ATLAS M4 cosmic ray run is scheduled from 23 August to 3 September.
See https://twiki.cern.ch/twiki/bin/view/Atlas/ComputingOperatonsPlanningM4
The CMS CSA07 service challenge is due to start on 10 September and run
for 30 days. See https://twiki.cern.ch/twiki/bin/view/CMS/CSA07Plan
OSG Items
Maria: I would like to have a permanent point in the Operations Meeting to discuss the tickets with OSG.
Nick: We will start this from next Monday.
Review of action items
The updated list action items can be found attached to the agenda.
AOB
· Pierre: Another comment about the CCIN2P3 downtime. Due to network problems the CIC portal will be unreachable for one hour on the Tuesday 18 September.
Next Meeting
The next meeting will be Monday, 17th September 2007 14:00 UTC (16:00 Swiss local time).
Attendees can join from 13:45 UTC (15:45 Swiss local time) onwards.
The meeting will start promptly at 14:00 UTC.
The WLCG section will start at the fixed time of 16:30.
To dial in to the conference:
a. Dial +41227676000
b. Enter access code 0157610