EGEE-WLCG-OSG operations meeting

27th August 2007

Agenda

The agenda can be found here: http://indico.cern.ch/conferenceDisplay.py?confId=20374

Attendance:

OSG grid operations:......... absent

EGEE

Asia Pacific ROC:............. Min

Central Europe ROC:......... Marcin

OCC / CERN ROC:........... Antonio, Nick, Steve T., Alexandre

French ROC:.................... Pierre, Rolf, Pierre-emanuel, Osman

German/Swiss ROC:......... Clemens

Italian ROC:...................... Paolo, Alessandro

Northern Europe ROC:....... Jules

Russian ROC:................... Lev, Alexander

South East Europe ROC:... Kostas

South West Europe ROC:.. absent

UK/Ireland ROC:............... absent

GGUS:............................. Helmut

OSCT:.............................. Absent

WLCG

WLCG service coord.......... Harry

Grid database services...... Maria

WLCG Tier 1 Sites

ASGC:............................. Min

BNL:................................ absent

CERN site:....................... Alexandre

FNAL:.............................. Jo Kaiser

FZK:................................ Clemens, Doris

IN2P3:.............................. Pierre

INFN:............................... Paolo

NDGF:............................. Tore

NIKHEF:........................... Jules

PIC:................................. absent

RAL:................................ Derek, Matt

SARA:............................. Jules

TRIUMF:........................... Rod

VOs

Alice:............................... absent

ATLAS:............................ Alessandro, Simone

BioMed............................ absent

CMS:............................... absent

LHCb:.............................. Roberto

 

Reports were not received from:

Ø       VOs:......................................

Ø       EGEE ROCs (prod sites):..

Ø       EGEE ROCs (PPS sites):..

Feedback on last meeting's minutes

No comments during the meeting.

EGEE Items

Grid-Operator-on-Duty handover

From: ROC UK/I / ROC CERN

To: ROC Italy / ROC Russia

Issues from Russian ROC:

·         We didn’t see any activity from the ROC CERN on the last week. There is a huge backlog now.
Nick will follow this up with the CERN ROC.

Migration to SL4 WNs

All tier-1 sites think that they have achieved the migration of WNs to SL4 in order to fulful their pledges.

SE Europe (Kostas):  When are we going to have the other services on SL4?

Steve:  There is a page the gives you the progress on the migration.

Nick:  At the pre-CHEP operations workshop it was announced that some services are going to be released in two to three weeks.  I will ask Markus to send this information to the list and bring an update on the next Operations Meeting.

PPS reports

Extract from agenda:

Issues from EGEE ROCs:

·         SAM Client at Cyfronet has been reconfigured to use glite-wms-* commands. There is an unknown problem during matchamaking of jobs directed to ce110.cern.ch - under investigation (with Ulrich Schwickerath)[ROC CE].

Release News:

No further questions or comments.

EGEE issues coming from ROC reports

1.       NE: The SAM job never completes successfully if one of the tests times out or blocks/hangs. For our express queue this results in the SAM job running out of wallclocktime and getting killed, which in turn means no SAM results are being published and the entire site fails for all tests. Is there a way to prevent this from happening, by killing off hanging/blocking tests in the SAM job within a few minutes? Or else it would be nice to know what the expected runtime of the SAM job is, i.e. with what wallclock specifications should it run. Is anything like that specified anywhere?

Answer from SAM team (Piotr): The answer is simple. For the time being we have the following:

timeouts for SAM test jobs:

   test timeout: 10 minutes

   job timeout: 30 minutes

The first one is for the individual tests, the second one for the whole test job, however as all the tests are executed in parallel, the job timeout should never be reached unless there is a huge problem with publishing the results.

Both are configurable in SAM Client config file, and can be changed if needed and agreed.

Nick: Do you how long the express queue is?

SARA: 10 minutes

Nick: I will pass this info to Piotr.

SARA: If I understand well, if we change the timeout to 20 minutes it should solve the problem. We will try it to see if it solves the problem.

2.       Should Memory size published at GlueHostMainMemoryRAMSize be per WN or CPU core

Steve: It should be per job slot. This is used to know how much memory is available for the job in the WN.

Kostas: Can we make a point to document this somewhere, maybe on the yaim documentation?

Nick: Ok.

3.       Pierre: About Action 53. Is this a recommendation? Did the VOs agree with this decision?

Nick: I believe that the answer is yes. So, this should be the default setup for each VO. If they desire anything different they should specify in the VO card.

Pierre: Ok, but the VOs should be aware that there are some limitations with this approach since many different users can be mapped to the same prd account.

Maria: I think that this issue was discussed in GDB and the VO id card was elected as the way to communicate to the sites what they and how they want to be configured.

Pierre: So, I can say to my site that the VO id card is the reference to that.

Maria: Yes. They have to deploy what is specified in the VO id card in order to support a given VO in their site.

WLCG Items

WLCG issues coming from ROC reports

1.       None this week.

WLCG Service Interventions

Extract from agenda:

·       Decommisioning of SL3 WNs at DESY-HH: Queues of CEs grid-ce0 and grid-ce2.desy.de will be drained 21.9.07 and finally shutdown on 24.9.07. WNs will be reinstalled with SL4 and included in the existing new CE grid-ce3.desy.de.

Also:

·         CCIN2P3 will be offline for one day next Tuesday, 18.

·         RAL will put one of their RBs offline. It will be broadcast.

FTS service review

See agenda for reports.

INFN: We had some problem last week in our computing centre but everything should be running quite better this week.

Steve: There are some FTS workarounds for the latest version which can be found here:  https://twiki.cern.ch/twiki/bin/view/LCG/FtsKnownIssues20

ATLAS service

Extract from agenda:

We noticed that some sites, in order to run ATLAS prod, have tried to installpython32. https://gus.fzk.de/pages/ticket_details.php?ticket=25690&from=allt Wewant to remember that python32 is not supported anymore, and we'll discuss inthe next ATLAS taskforce (one week from now) which policy is better for ATLAS.

CMS service

No report.  No representative present.

LHCb service

Nothing to report.

ALICE service

No report.  No representative present.

Service Challenge Coordination

Extract from agenda:

The ATLAS M4 cosmic ray run is scheduled from 23 August to 3 September.

See https://twiki.cern.ch/twiki/bin/view/Atlas/ComputingOperatonsPlanningM4

The CMS CSA07 service challenge is due to start on 10 September and run

for 30 days. See https://twiki.cern.ch/twiki/bin/view/CMS/CSA07Plan

OSG Items

Maria: I would like to have a permanent point in the Operations Meeting to discuss the tickets with OSG.

Nick:  We will start this from next Monday.

Review of action items

The updated list action items can be found attached to the agenda.

AOB

·         Pierre: Another comment about the CCIN2P3 downtime. Due to network problems the CIC portal will be unreachable for one hour on the Tuesday 18 September.

Next Meeting

The next meeting will be Monday, 17th September 2007 14:00 UTC (16:00 Swiss local time).

Attendees can join from 13:45 UTC (15:45 Swiss local time) onwards. 

The meeting will start promptly at 14:00 UTC.

The WLCG section will start at the fixed time of 16:30. 

To dial in to the conference:

a. Dial +41227676000

b. Enter access code 0157610