EGEE-WLCG-OSG operations meeting

16 April 2007

Agenda

The agenda can be found here: http://indico.cern.ch/conferenceDisplay.py?confId=14831

Attendance:

OSG grid operations:......... Rob

EGEE

Asia Pacific ROC:............. Min Tsai

Central Europe ROC:......... Marcin Radecki

OCC / CERN ROC:........... Nick Thackray, Maite Barroso, Jonh Shade, Steve Traylen, Dusan Dragovic

French ROC:.................... Pierre Girard

German/Swiss ROC:......... Sven

Italian ROC:...................... Alessandro Cavalli, Paolo Veronesi

Northern Europe ROC:....... Johan Karlson

Russian ROC:................... Absent

South West Europe ROC:.. Kai Neuffer

South East Europe ROC:... Ioannis Liabotis

UK/Ireland ROC:............... Philippa Strange, John Gordon

US-ATLAS:....................... Absent

US-CMS:.......................... Absent

GGUS:............................. Absent

OSCT:.............................. Absent

WLCG

WLCG service coord.......... Harry Renshall, Alberto Aimar

WLCG Tier 1 Sites

ASGC:............................. Min

BNL:................................ Absent

CERN site:....................... Yvan Calas

FNAL:.............................. Absent

FZK:................................ Sven

IN2P3:.............................. Pierre

INFN:............................... Paolo, Alessandro

NDGF:............................. Leif Nixon

PIC:................................. Kai

RAL:................................ Absent

Sara/NIKHEF:................... Ron

TRIUMF:........................... Absent

VOs

Alice:............................... Patricia

ATLAS:............................ Absent

BioMed:........................... Absent

CMS:............................... Absent

LHCb:.............................. Absent

 

Reports were not received from:

Ø       WLCG T1 sites:   BNL; NDGF; SARA; TRIUMF

Ø       VOs:   ATLAS; LHCb

Ø       EGEE ROCs (production sites):   Russia

Ø       EGEE ROCs (PPS sites):   Italy, North Europe, Russia

Feedback on last meeting's minutes

No comments during the meeting

EGEE Items

Grid-Operator-on-Duty handover

From ROC CERN (backup: ROC SW Europe) to ROC Russia (backup: ROC UK/I)

·         Tickets:
New : 38
1st mail : 31
2nd mail : 13
close : 26
Quarantine : 23
Site OK : 47

·         Notes:

1.       General gCE middleware problem:
A general problem has been detected with the gCE job submission: Got a job held event, reason: "The job attribute PeriodicHold expression 'Matched =!= TRUE && CurrentTime > QDate + 900' evaluated to TRUE" Developers confirmed that there is a bug in the communication between a gCE and the WMS, that causes this error.

2.       PPS middleware problem:
Some sites show the following problem: Time to Match History : http://goc02.grid-support.ac.uk/cgi-bin/rb.py?RB=lcg2rb2.ific.uv.es Publication Date (UTC) : Wed, 11 Apr 2007 06:35:02 +0000 /opt/edg/bin/edg-job-submit output : JobID : None
Selected Virtual Organisation name (from --config-vo option): ops
**** Error: API_NATIVE_ERROR **** Error while calling the "NSClient::multi" native api IOException: Unable to connect to remote (lcg2rb2.ific.uv.es:7772)
**** Error: UI_NO_NS_CONTACT **** Unable to contact any Network Server

3.       There was also a problem with work sharing between mean and backup team because of a problem with the Dashboard filter, which was fixed.

 

No questions to the handover.

PPS reports

Extract from agenda:

 

R-GMA report

Issues raised last week related to R-GMA: Most of the problems are caused by JobWraper test, interaction between JobWraper monitoring and R-GMA. Issues should be send to R-GMA team. For some sites it working fine and one of them solved this problem restarting tomcat every hour

 

 

Update on move of MW to SLC4 – Laurence Field

SLC4 WN in PPS for a week now. Laurence did not hear any serious problem till now.

Dependency issues and run time problems, hoping to have it ready for next Monday.

Working with WMS, fighting with dependency problems, INFN is testing in parallel.

We are also trying to install DPM to solve dependency issues.

 

Questions:

https://grid-deployment.web.cern.ch/grid-deployment//cgi-bin/reports.cgi?action=index

 

For the rest it is nearly impossible to give an estimate, many factors and unknowns. We can give you the list of the priorities.

 

Job wrapper tests - Piotr

We have to disable the current version of the job wrapper tests. This has to be done in ALL WNs. The recipe is attached to the agenda.

RECOMMENDATION TO ALL SITES: disable the job wrapper tests in all WNs following the instructions attached to the agenda page

 

Then next version will not use R-GMA to avoid the present problems, has to be certified, it will take a few weeks. This is a temporary solution while we decide the long term strategy.

John: will this also go through PPS? Yes. Were these problems seen there? No, the scale was different, much less number of jobs. The problem mainly affects big sites with many job slots.

John: how do the sites get the information about introduction of new features in production? Minor changes in release notes, major through ops meeting. This was discussed for the job wrapper tests, though we did not anticipated the present load problems as we did not experience them in certification/PPS.

 

Antony: link with instructions for monboxes, Tomcat Error Check Script:

http://hepunx.rl.ac.uk/egee/jra1-uk/r-gma/tomcat-error-check.html

 

EGEE issues coming from ROC reports

1.       (ROC SE Europe): For Information: SL4 Worker Nodes have been installed in AEGIS01-PHY-SCL and are running with no problems so far. Installation notes have been published in the SEE wiki: http://wiki.egee-see.org/index.php/SL4_WN

Nick: pure SLC4 WNs will be available son, been set up at cern to be used by the experiments

2.       (ROC SE Europe): We would appreciate a clear and updated timeline regarding the availability of gLite components (per service, if possible) for SL4/64bit and SL4/32bit. Is it possible to setup a wiki page with such a timeline, updated regularly based on plans and changes according to the development/certification progress?

Some sites want to upgrade hardware and for this reason it should be defined when SL4/64bit version will be ready. Dashboard with states and list of priorities was created. It is hard to give estimate when it will be ready but approximately SL4/32bit will be ready for one or two months and SL4/64bit for three months. Need to see how many packages failing under SL4/64bit.

3.       (ROC SE Europe): It would be nice to have a wiki page with all the already available information on installation SL4 glite services (even with workarounds). Could it be possible for all to coordinate and put all related links to a wiki page? SEE ROC has already published some information on SL4 WN (see point 1)

Is there any goc wiki entry for this? If so, please, send us the link for the minutes. ACTION.

4.        (ROC SW Europe): At PIC we observe intermitent SAM failures on the SRM tests with the error message "BDII Connection Timeout: sam-bdii.cern.ch:2170". The CE SAM tests running in the WNs are already using the regional top-BDIIs, but it seems that the SRM, SE, etc SAM tests (launched centrally from CERN) all of them use the CERN top-BDII, which is highly loaded and often times out. Could these central SAM tests use non-CERN top-BDIIs to balance the load? Is the new lcg-utils with 60sec timeout (GFAL>=1.8.1) being used for these SAM tests?

      The SAM team is looking at this and trying to find a solution in the next couple of weeks: certification of the new version of sensors which do not use intensive BDII, reducing number of queries and installation of new LCG utils which doesn’t have timeout problem

WLCG Items

Tier 1 reports

The WLCG tier-1 site reports for this week can be found here:

http://indico.cern.ch/getFile.py/access?subContId=9&contribId=3&resId=0&materialId=slides&confId=14831

T0 and T1 site availability in the weekly reports – Alberto Aimar

See presentation attached to the agenda page. Summary: Since one year the SAM test results are aggregated to calculate the monthly site availability for T0/T1 sites was created. Current target is 88% availability and until the end of the year it will go to 93% for T1 sites. The plan is to change report from monthly to weekly and it will be integrated to CIC portal site reports. All downtimes longer than two hours should be commented.

John: test failing at the SAM BDII, discussed at the MB last week, with this a site cannot do better than 94%, so either remove the tests or fix this.

Piotr: we are trying to fix it. One way is with a new version of the sensors with reduced number of queries to the BDII (being certified) and another is a new version of lcg-utils that we will deploy in a few days.

WLCG issues coming from ROC reports

Nothing this week.

Upcoming WLCG Service Interventions

 FTS service review

Nobody was present at the meeting from the FTS team to discuss the main issues. Please, read the FTS report index and the Transfer Operations Wiki.

ATLAS service

See wiki pages (https://twiki.cern.ch/twiki/bin/view/Atlas/TierZero20071 and https://twiki.cern.ch/twiki/bin/view/Atlas/ComputingOperations) for more information.

LFC 1.6.3 - status of ASGC? Min: we were having problems and debugging with CERN support, in progress.

CMS service

·       Job processing: Status of left-overs of MC production with CMSSW_120 is being evaluated. Good news is that about 10M Min Bias DIGI-RECO events have been produced so far and are available for analysis on global DBS to CMS users: theseare sufficient for the HLT group to start working with CMSSW_120: the rest will be DIGI-RECOed with 13X. The Minbias GEN-SIM production (up to 26M at the moment) will be continued by all teams until further notice. Needed CMSSW new versions (123/13x) are being installed CMS-wide, and new round of MC prod is starting soon.

·       Data transfers: last week was week-2 of Cycle-2 of the CMSLoadTest07 (see [*]) with focus on T0-T1 routes and T1-T2 regional routes. Operations were smooth. Concerning T1's participation: all days of the week we had all 7 T1s. Concerning performances, we ran at 300-500 MB/s of aggregate transfer rate to all T1's (was 300-350 last week). Best day: 27/3, with >450MB/s of aggregated daily average. T1-T2 exercises are still quite different from region to region. Concerning T2's participation: ~31 (/42) T2's. Concerning performances, we ran at ~500 MB/s of aggregate transfer rate from T1's to T2's(last week: 250-400 MB/s). Next week: focus also on T2-T1 and T1-T2 non-regional routes.
[*] http://www.cnaf.infn.it/~dbonacorsi/LoadTest07.htm

There were no further comments.

ALICE service

We have been testing the number of running jobs vs. waiting jobs using 4 different information sources: batch system, RB, local gris and top BDII at 3 sites: CERN, FZK and CNAF. We have seen large discrepancies in the RB with the rest of the information sources:

http://pcalimonitor.cern.ch:8889/display?page=SAM/compare

 
At Gridka, LB reports ~9000 and in reality only 40, so it seems the finished jobs are not properly logged.
 It should be solved with the new glite-WMS so we will begin to test it with this system.
 
Because of the interest of this test for the rest of VOs we will begin to do it for ATLAS and CMS also.

LHCb service

Nobody present and no questions/comments.

Service Challenge Coordination

Nothing to report.

OSG Items

Nothing to report.

Review of action items

The updated list action items can be found attached to the agenda and also here.

AOB

·         Sven: question about Patricia’s report, there is a ticket open and we are investigating what the reason of the cause is.

Next Meeting

The next meeting will be Monday, 23th April 2007 15:00 UTC (16:00 Swiss local time).

Attendees can join from 14:45 UTC (15:45 Swiss local time) onwards. 

The meeting will start promptly at 15:00 UTC.

The WLCG section will start at the fixed time of 16:30. 

To dial in to the conference:

a. Dial +41227676000

b. Enter access code 0157610