EGEE-WLCG-OSG operations meeting
16 April 2007
Agenda
The agenda can be found here: http://indico.cern.ch/conferenceDisplay.py?confId=14831
Attendance:
OSG grid operations:......... Rob
EGEE
Asia Pacific ROC:............. Min Tsai
Central Europe ROC:......... Marcin Radecki
OCC / CERN ROC:........... Nick Thackray, Maite Barroso, Jonh Shade, Steve Traylen, Dusan Dragovic
French ROC:.................... Pierre Girard
German/Swiss ROC:......... Sven
Italian ROC:...................... Alessandro Cavalli, Paolo Veronesi
Northern Europe ROC:....... Johan Karlson
Russian ROC:................... Absent
South West Europe ROC:.. Kai Neuffer
South East Europe ROC:... Ioannis Liabotis
UK/Ireland ROC:............... Philippa Strange, John Gordon
US-ATLAS:....................... Absent
US-CMS:.......................... Absent
GGUS:............................. Absent
OSCT:.............................. Absent
WLCG
WLCG service coord.......... Harry Renshall, Alberto Aimar
WLCG Tier 1 Sites
ASGC:............................. Min
BNL:................................ Absent
CERN site:....................... Yvan Calas
FNAL:.............................. Absent
FZK:................................ Sven
IN2P3:.............................. Pierre
INFN:............................... Paolo, Alessandro
NDGF:............................. Leif Nixon
PIC:................................. Kai
RAL:................................ Absent
Sara/NIKHEF:................... Ron
TRIUMF:........................... Absent
VOs
Alice:............................... Patricia
ATLAS:............................ Absent
BioMed:........................... Absent
CMS:............................... Absent
LHCb:.............................. Absent
Reports were not received from:
Ø WLCG T1 sites: BNL; NDGF; SARA; TRIUMF
Ø VOs: ATLAS; LHCb
Ø EGEE ROCs (production sites): Russia
Ø EGEE ROCs (PPS sites): Italy, North Europe, Russia
Feedback on last meeting's minutes
No comments during the meeting
EGEE Items
Grid-Operator-on-Duty handover
From ROC CERN (backup: ROC SW Europe) to ROC Russia (backup: ROC UK/I)
·
Tickets:
New : 38
1st mail : 31
2nd mail : 13
close : 26
Quarantine : 23
Site OK : 47
· Notes:
1. General gCE middleware
problem:
A general problem has been detected with the gCE job submission: Got a job held
event, reason: "The job attribute PeriodicHold expression 'Matched =!=
TRUE && CurrentTime > QDate + 900' evaluated to TRUE"
Developers confirmed that there is a bug in the communication between a gCE and
the WMS, that causes this error.
2. PPS middleware
problem:
Some sites show the following problem: Time to Match History :
http://goc02.grid-support.ac.uk/cgi-bin/rb.py?RB=lcg2rb2.ific.uv.es Publication
Date (UTC) : Wed, 11 Apr 2007 06:35:02 +0000 /opt/edg/bin/edg-job-submit output
: JobID : None
Selected Virtual Organisation name (from --config-vo option): ops
**** Error: API_NATIVE_ERROR **** Error while calling the
"NSClient::multi" native api IOException: Unable to connect to remote
(lcg2rb2.ific.uv.es:7772)
**** Error: UI_NO_NS_CONTACT **** Unable to contact any Network Server
3. There was also a problem with work sharing between mean and backup team because of a problem with the Dashboard filter, which was fixed.
No questions to the handover.
PPS reports
Extract from agenda:
Issues raised last week related to R-GMA: Most of the problems are caused by JobWraper test, interaction between JobWraper monitoring and R-GMA. Issues should be send to R-GMA team. For some sites it working fine and one of them solved this problem restarting tomcat every hour
Update on move of MW to SLC4 – Laurence Field
SLC4 WN in PPS for a week now. Laurence did not hear any serious problem till now.
Dependency issues and run time problems, hoping to have it ready for next Monday.
Working with WMS, fighting with dependency problems, INFN is testing in parallel.
We are also trying to install DPM to solve dependency issues.
Questions:
https://grid-deployment.web.cern.ch/grid-deployment//cgi-bin/reports.cgi?action=index
For the rest it is nearly impossible to give an estimate, many factors and unknowns. We can give you the list of the priorities.
Job wrapper tests - Piotr
We have to disable the current version of the job wrapper tests. This has to be done in ALL WNs. The recipe is attached to the agenda.
RECOMMENDATION TO ALL SITES: disable the job wrapper tests in all WNs following the instructions attached to the agenda page
Then next version will not use R-GMA to avoid the present problems, has to be certified, it will take a few weeks. This is a temporary solution while we decide the long term strategy.
John: will this also go through PPS? Yes. Were these problems seen there? No, the scale was different, much less number of jobs. The problem mainly affects big sites with many job slots.
John: how do the sites get the information about introduction of new features in production? Minor changes in release notes, major through ops meeting. This was discussed for the job wrapper tests, though we did not anticipated the present load problems as we did not experience them in certification/PPS.
http://hepunx.rl.ac.uk/egee/jra1-uk/r-gma/tomcat-error-check.html
EGEE issues coming from ROC reports
1. (ROC SE Europe): For Information: SL4 Worker Nodes have been installed in AEGIS01-PHY-SCL and are running with no problems so far. Installation notes have been published in the SEE wiki: http://wiki.egee-see.org/index.php/SL4_WN
Nick: pure SLC4 WNs will be available son, been set up at cern to be used by the experiments
2. (ROC SE Europe): We would appreciate a clear and updated timeline regarding the availability of gLite components (per service, if possible) for SL4/64bit and SL4/32bit. Is it possible to setup a wiki page with such a timeline, updated regularly based on plans and changes according to the development/certification progress?
Some sites want to upgrade hardware and for this reason it should be defined when SL4/64bit version will be ready. Dashboard with states and list of priorities was created. It is hard to give estimate when it will be ready but approximately SL4/32bit will be ready for one or two months and SL4/64bit for three months. Need to see how many packages failing under SL4/64bit.
3. (ROC SE Europe): It would be nice to have a wiki page with all the already available information on installation SL4 glite services (even with workarounds). Could it be possible for all to coordinate and put all related links to a wiki page? SEE ROC has already published some information on SL4 WN (see point 1)
Is there any goc wiki entry for this? If so, please, send us the link for the minutes. ACTION.
4. (ROC SW Europe): At PIC we observe intermitent SAM failures on the SRM tests with the error message "BDII Connection Timeout: sam-bdii.cern.ch:2170". The CE SAM tests running in the WNs are already using the regional top-BDIIs, but it seems that the SRM, SE, etc SAM tests (launched centrally from CERN) all of them use the CERN top-BDII, which is highly loaded and often times out. Could these central SAM tests use non-CERN top-BDIIs to balance the load? Is the new lcg-utils with 60sec timeout (GFAL>=1.8.1) being used for these SAM tests?
The SAM team is looking at this and trying to find a solution in the next couple of weeks: certification of the new version of sensors which do not use intensive BDII, reducing number of queries and installation of new LCG utils which doesn’t have timeout problem
WLCG Items
Tier 1 reports
The WLCG tier-1 site reports for this week can be found here:
T0 and T1 site availability in the weekly reports – Alberto Aimar
See presentation attached to the agenda page. Summary: Since one year the SAM test results are aggregated to calculate the monthly site availability for T0/T1 sites was created. Current target is 88% availability and until the end of the year it will go to 93% for T1 sites. The plan is to change report from monthly to weekly and it will be integrated to CIC portal site reports. All downtimes longer than two hours should be commented.
John: test failing at the SAM BDII, discussed at the MB last week, with this a site cannot do better than 94%, so either remove the tests or fix this.
Piotr: we are trying to fix it. One way is with a new version of the sensors with reduced number of queries to the BDII (being certified) and another is a new version of lcg-utils that we will deploy in a few days.
WLCG issues coming from ROC reports
Nothing this week.
Upcoming WLCG Service Interventions
FTS service review
Nobody was present at the meeting from the FTS team to discuss the main issues. Please, read the FTS report index and the Transfer Operations Wiki.
ATLAS service
See wiki pages (https://twiki.cern.ch/twiki/bin/view/Atlas/TierZero20071 and https://twiki.cern.ch/twiki/bin/view/Atlas/ComputingOperations) for more information.
LFC 1.6.3 - status of ASGC? Min: we were having problems and debugging with CERN support, in progress.
CMS service
· Job processing: Status of left-overs of MC production with CMSSW_120 is being evaluated. Good news is that about 10M Min Bias DIGI-RECO events have been produced so far and are available for analysis on global DBS to CMS users: theseare sufficient for the HLT group to start working with CMSSW_120: the rest will be DIGI-RECOed with 13X. The Minbias GEN-SIM production (up to 26M at the moment) will be continued by all teams until further notice. Needed CMSSW new versions (123/13x) are being installed CMS-wide, and new round of MC prod is starting soon.
· Data transfers:
last week was week-2 of Cycle-2 of the CMSLoadTest07 (see [*]) with focus on
T0-T1 routes and T1-T2 regional routes. Operations were smooth. Concerning T1's
participation: all days of the week we had all 7 T1s. Concerning performances,
we ran at 300-500 MB/s of aggregate transfer rate to all T1's (was 300-350 last
week). Best day: 27/3, with >450MB/s of aggregated daily average. T1-T2
exercises are still quite different from region to region. Concerning T2's
participation: ~31 (/42) T2's. Concerning performances, we ran at ~500 MB/s of
aggregate transfer rate from T1's to T2's(last week: 250-400 MB/s). Next week:
focus also on T2-T1 and T1-T2 non-regional routes.
[*] http://www.cnaf.infn.it/~dbonacorsi/LoadTest07.htm
There were no further comments.
ALICE service
We have been testing the number of running jobs vs. waiting jobs using 4 different information sources: batch system, RB, local gris and top BDII at 3 sites: CERN, FZK and CNAF. We have seen large discrepancies in the RB with the rest of the information sources:
http://pcalimonitor.cern.ch:8889/display?page=SAM/compare
At Gridka, LB reports ~9000 and in reality only 40, so it seems the finished jobs are not properly logged. It should be solved with the new glite-WMS so we will begin to test it with this system. Because of the interest of this test for the rest of VOs we will begin to do it for ATLAS and CMS also.
LHCb service
Nobody present and no questions/comments.
Service Challenge Coordination
Nothing to report.
OSG Items
Nothing to report.
Review of action items
The updated list action items can be found attached to the agenda and also here.
AOB
· Sven: question about Patricia’s report, there is a ticket open and we are investigating what the reason of the cause is.
Next Meeting
The next meeting will be Monday, 23th April 2007 15:00 UTC (16:00 Swiss local time).
Attendees can join from 14:45 UTC (15:45 Swiss local time) onwards.
The meeting will start promptly at 15:00 UTC.
The WLCG section will start at the fixed time of 16:30.
To dial in to the conference:
a. Dial +41227676000
b. Enter access code 0157610