Weekly Operations' Meeting Action List

Status as of 18 December 2006

                                                                                                                                                                        Due Date colour key:

Red: action is overdue

Yellow: action is due at or before next meeting

White: action is due some time after the next meeting

Open Action Items

Number

Description

Assigned

To

Status

Due date

67

Timescale for move to Torque2?

Progress on 2006-10-30: In progress.

Progress on 2006-11-01: Expected to be in certification within 2-3 weeks.

Progress on 2006-11-20: The counter was restarted last Friday, so it will go to PPS in 2-3 weeks from now.

Estimated timeline: in PPS by ~15th Dec

Progress on 2006-12-18: it was released to PPS on Monday, and removed on Tuesday due to a critical problem found.

OCC

In progress

15/12/06

70

Conclude on "Policy for security updates of third party software".

The gLite integration team policy is: the external packages are not guaranteed to be maintained. They are provided for convenience. They are maintained by their providers.

The reality is that they will be maintained on best effort.

To be clarified with the security team.

Progress on 2006-11-27: being discussed with OSCT and SA3

Progress on 2006-12-04: This item was discussed during the meeting.  Waiting for SA3 to create the final list of external packages which need to be maintained.

OCC

In progress

15/01/07

71

ATLAS to check if they know of any conflicts between SL kernel version 2.6 and either the application software or the middleware.

Progress on 2006-12-11: Atlas was not present at the meeting, we’ll check offline

Progress on 2006-12-18:

Alessandro DeSalvo says in respect of the ATLAS application there is no problem. The only issue might raise (but not sure at all) from the Oracle client in the production system (which anyway has only 4 instances in the all Grid) and the Data Management Clients in the VOBOXES. So as long as this discussion does not refer to VOBOXes, this is OK. The VOBOXes (only 10 nodes for atlas, one at each T1) will need to be considered some time soon. I will get in touch with Miguel for this.

The action can be closed.

Simone

Closed

11/12/06

75

Provide DPM to ATLAS for testing purposes in PPS service.

Nick

New

15/01/07

76

VOs to update their mailing lists of grid users so that grid operational messages are communicated to all users when necessary.

OCC / VOs

New

15/01/07


Closed Action Items

Number

Description

Assigned

To

Due date

Date closed

65

Request from sites for certificate reload of LFC certificates on service start up and reload. Contact LFC developers  to ask them about this.

Progress on 2006-10-30: In progress

Progress on 2006-11-20: this is documented at the following link:

https://uimon.cern.ch/twiki/bin/view/LCG/TheLCGTroubleshootingGuide#How_to_replace_host_certificates

Check that this solves the original request

Progress on 2006-11-27: No comment received about the content of the previous link. We assume it is fine and close the action.

Action closed.

OCC

06/11/06

27/11/06/

4/12/06

64

Does the security contacts list include the security contacts for the ROCs as well as for the sites?

Progress on 200610-30: The list now includes the ROC security contacts.

Ian Neilson

30/10/06/

30/10/06

63

Russian ROC to suspend the site Kharkov-KIPT-LCG2.

Progress on 200610-23: Issue resolved. Close.

ROC Russia

24/10/06

24/10/06

2006-10-16—1

Alice VO to write down their proposal regarding using the sgm account for submitting jobs from the VO box and creating a new service account dedicated to software installation.

Progress on 200610-23: This was done and distributed to the Grid Operations mailing list on 20/10/06 (when it was received from Alice).  See the minutes of the meeting regarding the outcome of the discussions.

Alice VO

23/10/06

23/10/06

2006-08-28—1

Are there any tools to help with cleaning up the MySQL database of an RB?  If not, are there any procedures described anywhere for doing this safely?

Progress on 2006-09-21: There are no such tools. However, Marcin noted that they have some tools in CE ROC and he will send information to everyone.

DECH: But we should have a procedure for this. Need an action on the developers to tell us what is the best way to do it.

OCC: Please open a ticket about it.

Closed.

OCC /
Maarten Litmaath

18/09/06

21/09/06

2006-08-07—1

Get VO operations contact mailing lists and include them as “generic contacts” in the EGEE broadcast tool

Progress on 2006-08-14: Waiting for mailing list information from LHCb.

Progress on 2006-08-21: Waiting for mailing list information from LHCb.

Progress on 2006-08-28: Now have this information. This would be interesting for all VOs, not only HEP VOs.  Close.

OCC

14/08/06

28/08/06

28/08/06

2006-07-24—2

Need to get a clear list of RPMs per service component (e.g. CE contains components: gridFTP, BDII, Globus, etc.)

Progress on 2006-07-31: More input from the French ROC:

In fact, you provide us the rpms list by node profile. Correct me if I'm wrong, but such a node profile is proposed by the project as a template.

Therefore, there are 2 ways to deal with it:

1) you blindly install it and hope it will works. In that case, you learn a lot about a node when it starts to go wrong ;). This approach becomes more difficult when you are already running a non-grid production and then your environment is already set-up, or when you use another install tool than YAIM.

2) you decide to customize your installation to adapt it to both your site environment and the way you operate your site. As an example, by default, LCG CE was delivered with the GIIS service on it, this default installation is certainly well-adapted for small sites, but for a site as IN2P3-CC that runs core services, it was not a good idea. So 2 years ago, we decided to move the GIIS to a dedicated node (As you know, this is current recommendation made few times ago to the big sites). In order to be able to to do so, you need to clearly know and understand the components/services of the type of node (As an example, a lcg CE is made of a gatekeeper, a GRIS, a gridftp, and globus as a common sub-component). If you want to deal separately with those CE components, you must know the RPMs list by component.

In the case of Cal, he requires the RPMs list to make the Quattor templates for the services (not for the node profiles). With those service templates, site administrators have the possibility to configure their site as they want, and in particular, hosting CE services in different machines if it is better for their site configuration.

The "node profile" approach, with a flat list of RPMs, was a very efficient and pragmatic way to start LCG/EGEE project and then to be able to deploy the middleware over more than 150 sites from scratch.

But, at the time being, the production quality becomes our priority and then, it is important that site administrator be able to better deal with their local set-up. I think that adding a per-service packaging level will improve the understanding and, by the way, the job of site administrators.

The gLiteCE node is a good example.  It contains the following services:

1) bdii

2) gridftp (or globus if you want to lump them together)

3) dgas (possibly just the client)

4) rgma

One could also argue that the security configuration (CAs, mkgridmap,

etc.) should be a service as well.

The worst node (and the most critical) is the UI, where the client code for every service is lumped into a single profile.  This is a disservice to system administrators and users alike.  They often want to try a single service; forcing them to recreate a list for the service or install everything isn't very "user friendly".

Progress on 2006-08-07: The release team acknowledges this request: We're well aware of this issue, it has been raised before, and it is the direction we'd like to proceed. In fact, we're likely to start trying to decompose the CE, but given the current priorities I don't want to make any promises, so the issue is best left 'on ice'.

The only other thing is to repeat that if anyone has actually done this work for their own reasons we'd be interested in trying to incorporate it.

French ROC requests estimation on when this could be done (in a few weeks, months?)

Progress on 2006-08-14: Waiting for estimate from SA3.

Progress on 2006-08-21: Answer from SA3: “I can't see us looking at this until after the SLC4 work, thus start of next year. But I repeat, this is not currently prioritised, so could easily be superseded.”

Progress on 2006-08-28: Maite: this would be a good candidate for colaborative community support.  Otherwise there will probably be no progress until next year.  This item can be closed and will be moved to the ROC managers meeting.

OCC / SA3

7/08/06

28/08/06

28/08/06

2006-07-10—3

Make public the SFT requirement to the sites: SFT jobs must run within 30 min

Progress on 2006-07-17: This information will be published.

Progress on 2006-07-24: Should be ready within 1-2 days. in progress.

Progress on 2006-07-31: Should be ready within 1-2 days.

Progress on 2006-08-07: No news

SFT team

31/07/06

14/08/06

2006-05-02—4

Chase Russian ROC on their failure to answer ATLAS trouble ticket.

Progress on 2006-05-08: Nobody from the Russian ROC. Simone got an answer last week just after it was raised in the operations meeting, but had a similar case later in the week.

Progress on 2006-05-22: The ticket is now closed. The action can be closed.

Nick

 

 

2006-05-02—3

Report on the procedure for disseminating new VOMS server certificates to all grid sites.

Progress on 2006-05-22: Maria Dimou: The VOMS server certificates are distributed in an RPM as part of the release. There are also plans to upload them to the VO id page in the CIC portal. There is ticket opened to request that the VOMS server DN is used instead of the whole certificate (the DN does not change when the certificate expires), but some security people is not happy with this proposal. The action can be closed.

Maite

 

 

2006-05-02—1

Agenda item for next COD meeting: clarify the use of Suspended and Uncertified. Reflect the decision in the Ops manual.

Progress on 2006-05-22: This was done at the COD meeting. The action can be closed.

Hélène

 

 

2006-02-06—2

Check that the link to the Operations Manual is correct on the CIC portal.

Progress on 2006-02-20: Done.

Gilles/

Osman

 

 

2005-11-21—5

The R-GMA test in the SFT monitoring should be set to critical and the effects of doing this to be reported at next week's meeting.

Progress on 2005-12-12: The R-GMA test was set to critical. However, an urgent upgrade to R-GMA will be announced today (see AOB in the minutes) so this test will be set to non-critical for a short while. This action item will be left open to track when the test will be set back to critical.

Progress on 2006-01-09: The SFT test has been set back to non-critical as there are many sites having problems with secure R-GMA. This will be discussed at an R-GMA meeting this week and Piotr will report back next week.

Progress on 2006-02-20: The status of R-GMA in production is under investigation (see the minutes for this week).

Progress on 2006-02-27: There are no problems known about by the R-GMA developers. Fixes to common site problems can be found on the GOC wiki (URL will be included in the minutes). Any problems that can’t be fixed through reference to the wiki pages should be logged in GGUS.

Progress on 2006-03-06: No new problems reported, so the action will be closed.

Nick

 

 

2005-12-12—6

Flavia needs feedback on the following document: Summary of Open Issues reported by LHC experiments

Progress on 2006-01-09: Nick reiterated that feedback is needed on this document and that it should be given to Flavia by the end of January.

Progress on 2006-01-23: See meeting minutes.

Progress on 2006-02-06: Will ask Flavia to report on this item.

Progress on 2006-02-27: Some comments received form Tier-1s and some sites. Sites that did not provide feedback yet are still on time to send it to Flavia.

Flavia

 

 

2005-12-19—4

Clarify the policy on whether sites have to publish job monitoring information.

Progress on 2006-02-27: nobody remembers what this action was about. Closed.

Piotr /

Nick

 

 

2006-01-30—1

Osman to make available GGUS ticket earlier than 12 noon and for more days.

Progress on 2006-02-20: Osman: The action is still manual, but it is OK. This item can be closed.

Osman

 

 

2006-01-30—2

Piotr to draft the AUP for the new monitoring VO.

Progress on 2006-02-20: Done.

Piotr

 

 

2006-01-30—3

ROC Managers are requested to add details on the Savannah ticket Maria is going to create dealing with VOMRS see AP 2005-12-12-5

Progress on 2006-03-06: Moved to ROC managers action list

ROC

managers

 

 

2005-12-19—5

UK/I ROC raised the issue that there were a number of JS and JL errors that don't show on the automatically generated weekly report.

Progress on 2006-01-09: In fact there is a general problem, seen by several ROCs that problems which are seen in the SFT are not appearing in the auto-generated reports. Gilles and Osman will look into this and report back.

Progress on 2006-02-20: Gilles said that the fix for this problem will be in the next release of the CIC Portal (maybe sooner).

Progress on 2006-02-27: The problem did not appear any more, so it looks solved. The action will be closed and reopened in case the problem appears again.

Gilles /

Osman

 

 

2006-02-06—3

If a site is in scheduled down time but has monitoring set to on, do the monitoring results for that site appear in the weekly ROC reports and the metrics?

Progress on 2006-02-20: Piotr said that in the prototype metrics charts the status of "Scheduled Downtime" takes precedence over SFT test results, therefore sites in scheduled down time are shown as unavailable, but any SFT test results are ignored. This is the required behaviour.
However, it was not clear whether the weekly ROC reports show this behaviour. That is, even when a site is in scheduled downtime, the results of SFT tests might show up on the report. Gilles/Osman will investigate and fix this if necessary.

Progress on 2006-02-27: Downtimes are not stored in the SFT history, so SFT could run while a site is in scheduled downtime and the results logged in the ROC report. This will be solved in the new SAME framework. Meanwhile, it could be solved on the CIC-portal site, crosschecking SFT results and scheduled downtime.

Progress on 2006-03-06: This has been fixed in the present version of the CIC-portal; the sites in scheduled downtime do not appear any more in the ROC reports.

Gilles /

Osman / Piotr

 

 

2005-08-08—2

Nick to ask if the solution to the naming of the OS can be placed in the LCG release notes (or some other publication).

Progress on 2005-08-22: Nick spoke to Laurence Field. Laurence is deciding on the best place to put this information.

Progress on 2005-10-17: Nick to chase this up with Laurence and Markus again.

Progress on 2005-10-31: Nick to chase this up with Laurence and Markus again.

Progress on 2005-11-21: Nick has spoken to Jonathan Schaeffer who is writing an administrator's guide to the glue schema. Jonathan will include this information in to the document. Document is due by the end of the year. This action item will be kept open for tracking.

Progress on 2005-12-19: Jonathan is writing it, and it will be available on the wiki.

Progress on 2005-02-27: Nick requested an update from Jonathan

Progress on 2005-03-06: Jonathan has documented it. It is available at: http://goc.grid.sinica.edu.tw/gocwiki/How_to_publish_the_OS_name

Progress on 2005-03-20: no comments received, action closed.

Jonathan Schaeffer

 

 

2006-04-03—1

VOs to update/create their ID card in the CIC portal. (Maite to remind them by e-mail)

Progress on 2006-04-24: Done

VOs/Maite

 

 

2006-01-09—2

Put the process for decommissioning a site onto the COD-6 agenda.

Progress on 2006-01-23: This was discussed at COD-6 so Nick will put a proposal together and bring it before the meeting.

Progress on 2006-03-06: Action on Nick to write a proposal for site decommissioning procedure.

Progress on 2006-04-03: A new status of "Closed" will be available in the GOC database for decommissioned sites.

Nick

 

 

2005-12-12—5

Not all ROC managers have sufficient VOMS-RS privileges to create sub-groups.

Progress on 2005-12-19: Report from Maria:

- not everyone was correctly put into the system, apologies for that.

- forgot to prepare the ground for everyone as it should be.

- preparing the message.

- will go through VOMS-RS members and check for holes.

- from France, Pierre has registered, Fred, Helene, will register as dteam members, but don't know how many of them to make ROC managers.

- ROC managers should let Maria know who is the representative for their ROC (more than one is welcome).

Maria's proposal:

- delegation will go down the levels... (site manager will approve candidate users) - this is a terrible idea...

- either ROC manager should be one who approves candidate users or at all costs make represantatives on Site Manager level (pressure on VOMRS developers) - to decide - action point on all.

- Maria can't do it as there will be no info on site's users in GOC DB anymore.

Progress on 2006-01-09: Maria was not present. Nick will follow this up with Maria.

Progress on 2006-01-30: Dteam AUP: it must be changed to a new VO dealing with monitoring. Piotr will draft the new VO's AUP.

VOMRS: Maria will create a Savannah entry with the enhancements request

Progress on 2006-02-06: Piotr is setting up new VO to replace dteam for monitoring purposes.

Progress on 2006-02-20: The AUP has to be discussed and approved at the ROC-managers meeting tomorrow (21 Feb).

Progress on 2006-02-27: The AUP for the ops VO has been approved by the ROC mgrs. The registration service is being setup by Maria Dimou.

Progress on 2006-03-06: Maria Dimou will set up this week the registration server

Progress on 2006-03-20: The new VO will be added to YAIM this week and then deployed to the sites.

Progress on 2006-04-03: Kostas is still needing answers from Piotr.

Progress on 2006-04-24: Piotr answered Kostas in the ROC mgrs mailing list. No further questions. Action closed.

Piotr

 

 

2006-03-13—1

Some sites in SW Europe ROC seeing random failures on RM tests. Not understood yet.

UK + Northern ROC see this problem as well

Maite will collect relevant information (logs, etc) and send this problem to CERN operations and relevant developers so it can be debugged.

Progress on 2006-04-03: Seems that upgrading to LCG 2.7 helps/solves the problem. Will continue to monitor.

Progress on 2006-04-24: The problem did not reappear. Action closed.

Maite

 

 

2006-04-03—4

Rolf to send link for the BioMed data challenge document to the Ops Meeting mailing list.

Progress on 2006-04-24: The link was sent by Maite. Action closed.

Rolf

 

 

2006-05-08—1

Check with the ENOC support team what to do with sites with poor network quality. Issue raised by Asia-Pacific ROC

Progress on 2006-05-22: Maite and Nick will meet with them later this week and this is one of the points to discuss.

Progress on 2006-06-06: ENOC are monitoring network quality CERN ó tier 1 sites. This will be expanded. Can close this item.

Nick/Maite

 

 

2006-05-08—3

Check the deployment status of the ops VO and report next week

Progress on 2006-05-15: This is being rolled out, but not yet complete

Progress on 2006-05-22: Piotr is working on a the update plan, with tasks to be done and proposed dates. This will be discussed at next meeting.

Progress on 2006-06-06: plan is accepted by ops meeting and ROC managers (i.e. no comments received). Today is last chance to give comments.

Piotr

 

 

2006-05-22—1

Send information about existing SE sensors for disk space accounting, mail it to: sc@infn.it

Progress on 2006-06-06: Tiziana has already received information from Jeremy Coles of the UK.

All ROCs/Tier-1s

 

 

2006-05-22—2

Investigate SFT related problems:

-          SFT RM tests having intermittent problems with non-local services (due to missing/invalid credentials) registering as a local site failure.

-          SFT test jobs regularly submit themselves to 10min site queue but run for longer than 10 minutes and terminate prematurely, registering as a permanent error (even if all of the tests passed!).

-          Remove the metric link (https://lcg-sft.cern.ch/sft/metrics.html) as it is not maintained any more

Piotr

 

 

2006-06-06—1

Maite to e-mail ROC managers to remind them that they are now responsible for extracting ALL relevant points for discussion at the operations meeting from the site reports of their region.

Progress on 2006-06-12: This was done. The issues were extracted this way starting today. The action can be closed.

Maite

 

 

2006-06-06—2

Service challenge team to decide if sites must upgrade their LCG flavour CE from LCG 2.7.0 to gLite 3.0.0.

Progress on 2006-06-12: Mail exchange between Sven, Jamie and the ROC managers, this is clear now. The action can be closed.

Maite

 

 

2006-06-06—4

SEE ROC to put their notes on migration to gLite 3.0.0 in to release “issues” page of GOC wiki.

Progress on 2006-06-12: This was done. The action can be closed.

SEE ROC

 

 

2006-04-03—5

Thorsten to clarify the handover process for the TPMs and document it.

Progress on 2006-04-24: The GGUS team is waiting for the feedback/report from UKI. Jeremy is on vacation, the report is expected to be sent next week.

Progress on 2006-05-02: GGUS team still waiting for feedback from UK/Ireland ROC.

Progress on 2006-05-22: Feedback given at the ROC managers meeting in Krakow

Progress on 2006-06-06: Feedback was given to GGUS at Krakow. GGUS to discuss this at ESC meeting on Thursday. Written report will then be given. Assignment of action will be changed to GGUS team.

Progress on 2006-06-12: It was discussed with Alistair, Diana and Mario David. They will answer with a written report

Progress on 2006-06-26: This has been discussed at the operations workshop. The GGUS team is building a plan with all the issues to be fixed. It will be tracked by the ROC managers

GGUS team

26/06/06

26/06/06

2006-05-29—2

Compile the list of sites with wrong site contact mail in GOCDB

Progress on 2006-06-06: Maite has forwarded this to ROC managers.  All ROCs to make sure that this is fixed.

Progress on 2006-06-12: Reminder to ROCs to follow this with the listed sites

Progress on 2006-06-26: Gilles directly contacted ROC-support at few sites with wrong email address, and now they're ok - can be CLOSED

ROC managers

26/06/06

26/06/06

2006-06-06—6

All ROCs to chase up trouble tickets submitted by FNAL regarding poor data transfer performance.

Progress on 2006-06-12: Few of them are resolved. More information on the ticket numbers will be sent offline.

Progress on 2006-06-26: These tickets are now resolved, it can be closed.

All ROCs

26/06/06

26/06/06

2006-06-12—1

Harry and Gavin to distribute information about how to configure the FTS channels with different shares, to serve the needs of the overlapping VO activities in the coming weeks. This was requested by Tiziana Ferrari (CNAF).

Progress on 2006-06-26: This has been done.

Harry, Gavin

26/06/06

26/06/06

2006-06-12—2

Combine the production and PPS reports on the CIC portal

Progress on 2006-06-26: This has been done.

CIC team

26/06/06

26/06/06

2006-03-20—2

Request to the GOC DB administrators to provide  functionality in the GUI to allow the deletion of a person from the list of contacts for a site.

Progress on 2006-04-03: Request is in the list for new functionality for GOC database.

Progress on 2006-05-02: The implementation of this will be ready in approximately 2 weeks from now.

Progress on 2006-05-22: Not ready yet according to the GOCDB list of feature requests

Progress on 2006-06-06: Matt is working on this. It should be available within 2 weeks.

Progress on 2006-06-12: will be available early next week.

Progress on 2006-06-26: Done. We wait one week so the sites have a chance to check it and raise any problem they might find

Progress on 2006-07-03: This was discussed in the meeting and can be closed.

Matt Thorpe (UK)

03/07/06

03/07/06

2006-05-02—5

Ask NA4 to push VOs to complete VO ID cards with at least the minimum required information.

Progress on 2006-05-22: NA4 is working on the list of VO registration fields to be filled by all VOs. Once this is ready, it will be implemented in the CIC portal and they will chase all the VOs so it is filled. Rolf: sites should open tickets when they need some VO information that is not available form the portal.

Progress on 2006-06-06: NA4 gave request to CIC portal developers to implement new fields in VO ID cards.
Gilles: work in progress. Prototype expected in 1-2 weeks for testing. Cal is involved.
Alessandra: where did the list come from.
Maite: all ROC managers have it. please contact your ROC manager.

Progress on 2006-06-12: Gilles is currently implementing the new registration form according to the VO registration procedure and in interaction with Cal Loomis . A first prototype should be available for tests at the end of the week.

Progress on 2006-06-26: The registration form has been put online this afternoon in production.  Will be officially announced to VO-managers and ROC-managers.  VO ID card update form has also been changed.

Progress on 2006-07-03: NA4 (Cal Loomis and Frederic Schaer) are asking the VOs to populate the new VO ID cards.  This item can now be closed.

CIC portal team

03/07/06

03/07/06

2006-05-15—1

All T1 sites to define channels to all other T1s and supported T2s and demonstrate functionality of transfers between sites

Progress on 2006-05-22: Gavin will check the status of this action and send an update.

Progress on 2006-06-06: Gavin couldn’t attend the meeting. He will give an update off-line.

Progress on 2006-06-12: Gavin couldn’t attend the meeting. He will give an update off-line.

Progress on 2006-06-26: Update requested to the SC team

Progress on 2006-07-03: Gavin said that this can now be closed.

 

ASGC

TRIUMF

FNAL

BNL

NIKHEF/SARA

CNAF

IN2P3

PIC

NDGF

03/07/06

03/07/06

2006-05-29—1

Add explanation about the timestamp appearing on the RC reports in the CIC portal.

Investigate how to change the timestamp presented now (SFT execution time) to the one appearing on the SFT pages (publication time)

Progress on 2006-06-06: Osman: in progress

Progress on 2006-06-12: Osman: in progress

Progress on 2006-06-26: Osman : explanation added on RC/ROC report new field which is "publication time".  Action will be closed next week if no problems reported

Progress on 2006-07-03: It was agreed that this can be closed and another action opened if needed in the future.

CIC portal team

03/07/06

03/07/06

2006-06-26—1

Block sites that did not upgrade yet to the fix for R-GMA critical bug, as they are bringing the service down

Progress on 2006-07-03: This was done.  Most R-GMA servers were fixed with the new RPM.  The blocks will be removed and any further problems will be dealt with in the usual way.

Maite

03/07/06

26/06/06

2006-04-03—2

VOs to give feedback on whether they can supply information on what software they have installed, where it is and what version it is. (Maite to remind by e-mail)

Progress on 2006-04-24: Maite to follow if this one was already discussed last week

Progress on 2006-05-02: Maite to contact Jeremy to find out the status of this.

Progress on 2006-05-22: Information supplied by Alice, LHCb and CMS attached to the agenda under AOB. Please read it to discuss it next week.

Progress on 2006-06-06: Jeremy now has all the information. He will report at a future meeting.

Progress on 2006-07-03: Philippa will ask Jeremy to sumbit a written report for the meeting as he will be on holiday for several weeks.

Progress on 2006-07-10: this point was discussed at the meeting. See the minutes for more information. The action can be closed.

Philippa/
Jeremy

 

03/07/06

 

10/07/06

2006-04-03—3

Nick/Maite to ask Judit to implement new functionality in FCR tool.

Progress on 2006-04-24: Implementation started. It will be ready by 20th May.

Progress on 2006-05-02: Implementation is still due to be 20th May.

Progress on 2006-05-22: Some delay due to the porting of FCR to PPS

Progress on 2006-06-06: Judit is re-engineering FCR to use SAME oracle db.  probably 1-2 weeks of dev left

Progress on 2006-06-12: implementation ongoing

Progress on 2006-06-26: prototype ready, target to make it public is next week

Progress on 2006-07-03: Judit wasn’t available for the meeting so Nick will get a status update.

Update on 2006-07-05: Judit expects to have the service in production by 19 July.

Progress on 2006-07-10: This is now implemented in the new FCR version. The action can be closed

Judit

10/07/06

 

10/07/06

 

2006-05-08—2

Find proper place/tool to make VO YAIM settings available. Issue raised by central Europe. The proposed option is the combination of Oliver’s tool plus CIC portal.

Progress on 2006-05-22: After talking to Oliver, Nick sent the proposal to the CIC portal team. Once they agree, he will circulate it for comments before the implementation starts.

Progress on 2006-06-06: Gilles - in progress.

Progress on 2006-06-12: The proposal is simple enough to synchronize just by mail. Gilles will contact Oliver and Dimitar to get a read access to their DB, and discuss with them what information we should share.

Progress on 2006-06-26: Gilles- has all information needed. Will now implement it.  Estimate is ~ 2 weeks

Progress on 2006-07-03: In progress.

Progress on 2006-07-10: It has been implemented and broadcasted so it starts to be used. The action can be closed.

CIC portal team

10/07/06

10/07/06

2006-06-06—5

Need to get gssklog working at CERN-PROD site (for LHCb).

Progress on 2006-06-12: this is one solution that will enable lhcb and alice to write into afs. It has 2 parts: a daemon that people runs and that need to be configured. What does one do on the grid side? The experts of this one has left cern and the other one has left the FIO group. Harry will chase them. Ongoing. It goes through LCMAPS now.

Progress on 2006-06-26: Harry and Roberto will discuss offline to get more details

Progress on 2006-07-03: In progress

Progress on 2006-07-10: it has been fixed, users will get tokens and be able to install in AFS. No feedback received so it will be put into production. The action can be closed.

Harry

10/07/06

10/07/06

2006-06-26—2

Gather all problems reported today related to the site/ROC reports in the CIC portal, and discuss/solve them with the CIC team

Progress on 2006-07-03: In progress.  All the information has been gathered and discussions are still ongoing.

Progress on 2006-07-10: This is done, and all problems have been fixed. The action can be closed.

DECH ROC

 

03/07/06

 

10/07/06

2006-07-3—3

ROC managers to decide if they want their sites to inform the ROC if the site would like to change their site name in the GOC database.

Progress on 2006-07-10: A couple of things can get broken (gstat, accounting) at the moment, so sites should contact their ROCs before changing the names, agreed.

OCC

10/07/06

24/07/06

2006-07-3—4

SARA seeing problems with installing the WN on Debian using the tarball.  The relocatable tarball should support Debian “out-of-the-box”.

Update 2006-07-05: Tickets are being converted to Savannah bugs where appropriate.  Deplloyment team say that they will fix what they can for the 3.0.2 tarball and will aim to get the rest at least documented.

Progress on 2006-07-10: Sara: all reported problems are fixed or being fixed. The action can be closed.

OCC

10/07/06

10/07/06

2006-07-3—5

The latest APEL RPM is not included in the apt-repository of 3.0.

Update 2006-07-05: Nick raised this at the EMT meeting.  Status can be tracked in GGUS here: https://savannah.cern.ch/bugs/?func=detailitem&item_id=17391

and in Savannah here: https://savannah.cern.ch/bugs/?func=detailitem&item_id=17391

Progress on 2006-07-10: links included, it is being tracked. The action can be closed.

OCC

10/07/06

10/07/06

2006-06-06—7

Harry to put all HEP data challenge info into VO data challenge pages in CIC portal.

Progress on 2006-06-12: Ongoing. Harry will send the link to Gilles and he will add it.

Progress on 2006-06-26: Harry has sent the info to Gilles - Gilles will publish it on the CIC portal.  It will be editable by VO managers, and readable by all.

Progress on 2006-07-03: In progress

Progress on 2006-07-10: Harry was not present

Progress on 2006-07-17: This has been done by Gilles. The action can be closed.

Harry Renshall

17/07/06

03/07/06

10/07/06

2006-06-26—3

Take all input from today’s discussion plus operations WS to provide more information about releases/upgrades: web page, single channel for announcements, standard subject

Progress on 2006-07-03: In progress

Progress on 2006-07-10: Proposal discussed with roc managers, to be discussed here next week and used for next release

Progress on 2006-07-17: Will be put in place for next set of updates. The action can be closed.

OCC

17/07/06

10/07/06

19/07/06

2006-07-3—6

Remove GGUS stats from the top of the weekly ROC reports (generated in CIC portal).  This needs to be checked with the ROC managers.

Progress on 2006-07-10: Additional Request to remove it from CIC reports

Progress on 2006-07-17: the GGUS stats have been removed form the ROC and RC reports. The action can be closed

OCC

17/07/06

17/07/06

2006-07-3—7

Roberto to find out from Nick Brooke if the data coming out of CERN during the rest of SC4 should be kept.

Progress on 2006-07-10: LHCb just left the meeting, to be followed up offline

Progress on 2006-07-17: Basically these exact data from SC04 will *not* be kept but replaced with some other sensible data. So basically the message is: the storage required by LHCb will be reused for other purpose by the community after October 2006. The action can be closed.

Roberto

17/07/06

10/07/06

2006-07-10—1

Investigate the problem of mapping VOMS roles in the SE, as reported by LHCb (Joel Closier). Report at next meeting.

Progress on 2006-07-17: Ricardo from LHCb investigated this at PIC. They are happy with the VOMS mapping both in the CE and SE. The action can be closed.

RAL, PIC and LHCb

17/07/06

17/07/06

2006-07-10—2

LHCb experienced problems with the CE ranking last week. This problem occurs when the CE disappears, it is mostly related to the Data Mgt tools, when they query and find a CE and the in the same query one second time the CE disappears. LHCb should report a GGUS tickets to the related sites and give more details so it can be investigated

Progress on 2006-07-17: Roberto has opened a generic ticket with all this information. The problems are being investigated by the IS developers and will be solved in coming updates. The action can be closed.

LHCb

17/07/06

17/07/06

2006-06-06—3

Run time environment to show whether CE is gLite or LCG flavour.  This will be available in the next update to gLite 3.0.0.

Progress on 2006-07-03: In progress.

Progress on 2006-07-10: Nick will check with Kostas (to understand the original request) and with the deployment team

Progress on 2006-07-21: Nick spoke with the Deployment team (Oliver Keeble). It was agreed that the labels “gLite-CE” and “LCG-CE” should be used in the GlueHostApplicationSoftwareRunTimeEnvironment parameter to denote whether a CE is a gLite flavour CE or an LCG flavour CE.

Nick

10/07/06

19/07/06

24/07/06

2006-06-12—3

Atlas Tier-0/Tier-1 starts 19th June; they are starting to ramp up now. We are waiting to have the SRM names and paths, we have them from 6 sites, we are missing Taiwan, BNL (missing paths), NDGF and NIKHEF.

Progress on 2006-06-26: Now only NDGF is missing.  The Northern ROC should follow up this with NDGF.

Progress on 2006-07-10: NDGF still missing: Anders will hunt them

Progress on 2006-07-17: NDGF still missing

Progress on 2006-07-24: This has been addressed and can be closed.

NDGF and NE ROC

03/07/06

19/07/06

24/07/06

2006-07-3—1

Phone call to ROC needs to be reinstated as part of the escalation procedure.

Progress on 2006-07-10: Will be discussed in next COD meeting

Progress on 2006-07-17: Waiting for COD minutes

Progress on 2006-07-24: At COD9 it was agreed to avoid phone call and to invite the roc + site to the weekly ops meeeting. Close.

OCC

19/07/06

24/07/06

2006-07-3—2

Needs to be explicitly stated in the operations manual that it is the responsibility of the ROC to ensure that tickets raised by the grid operator-on-duty team are updated (so that the grid operator-on-duty team can see progress and don’t start the escalation procedure).

Progress on 2006-07-10: Will be discussed in next COD meeting

Progress on 2006-07-17: Waiting for COD minutes

Progress on 2006-07-24: Added in. Waiting for approval by ROC managers. Close.

OCC

10/07/06

24/07/06

2006-07-24—1

Ask COD team if they can carry out correlation in time of sites failing SAM/SFT, to try to determine when the failures are due to the failure of a core service.

Progress on 2006-07-31: To be requested to the COD team

Progress on 2006-08-07: Requested. The CIC – Sam interface will be ready in September. This action can be closed.

OCC

7/08/06

7/08/06

2006-07-24—3

Discuss with LHCb regarding their problems with sites in scheduled downtime.

Progress on 2006-07-31: in progress, being investigated

Progress on 2006-08-07: no progress this week

Progress on 2006-08-14: LHCb did not attend the meeting and no off-line input was received from them so this action item could not be progressed.

Progress on 2006-08-21: No progress as Roberto is on vacation.

Progress on 2006-08-28: Roberto will set up a meeting with Nick, Maite and LHCb to discuss possible solutions.

Progress on 2006-09-21: The meeting was not set up yet. Will chase with Roberto.

Progress on 2006-10-05: this has been superseded by FCR accidentally being enabled on main CERN top-level BDII. Can be closed.

LHCb

28/08/06

05/10/06

2006-08-07—2

Check with RB developers the impact of limitation of number of queued jobs per VO or per user in the batch system on RB and users

Progress on 2006-08-14: Still under investigation

Progress on 2006-08-21: Answer from Maarten: “The old RB cannot handle such a limit.  Jobs will be matched to the CE, sent there, and fail.  I do not know if the new WMS can deal with a limit. At CERN we have multiple LCG-CEs fronting the batch system and we reboot a CE if it gets overloaded.  The gLite-CE should have fewer scalability problems (certainly in the long run).”

We will also check with the RB developers from INFN.

Progress on 2006-08-28: The answer to this depends on what information is published by the CE, regarding free slots in queues, that the WMS and RB can use. Keep open until this information is known.

Progress on 2006-09-21: In progress. There is no information about this in the IS, it will be requested to the IS developers for future releases. The RB does not handle this limit. The question has been forwarded to the RB developers to understand the effects on it.

Progress on 2006-10-05: this information is not currently in GLUE schema and so cannot be published. A request has been submitted for it to be in the next version of GLUE.  The match-making code will need to be modified to use this information. The action can be closed

OCC

14/08/06

05/10/06

2006-08-14—1

Check with BioMed VO if there are sufficient WMSs in the production service to meet their data challenge needs.

Progress on 2006-08-21: Answer from Biomed: we need as many WMSs as possible; they used 19 in their last challenge. We’ll try to invite them for next meeting.

Progress on 2006-08-28: We are in discussion with BioMed. This action will be handed over to the WLCG Resource Scheduling Meeting.

Progress on 2006-09-21: The Biomed VO representatives attended this week’s meeting. Yannick said it was not clear if they would use the gLite WMS or not.

Progress on 2006-10-05: Biomed is not ready to use the gLite WMS, they will only use the LCG RBs for their data challenge needs. This action can be closed.

BioMed

21/08/06

05/10/06

2006-10-05—3

Change SFT/SAM admin tool to only allow a site/ROC admin to submit if they have admin rights over the site (taken form GOCDB)

Progress on 200610-09: According to Rafal this is already the case. Gilles: yes, I tried it and this is the case. This action can be closed.

SFT admin team (Rafal)

09/10/06

16/10/06

2006-10-05—2

Clarify this point with the developers: (NE ROC): A major concern for the Netherlands is the possible drop of support for VOMS-enabled Pre-WS GRAM on the gLite-CE. A number of the VOs that we support use Nimrod to submit jobs which works on Pre-WS (VOMS-enabled) GRAM. At least as long as Globus packages are in their toolkit. Also see remarks made for SARA-MATRIX site.

Mail already sent to Oliver Keeble and John White.

Progress on 2006-10-09: Oliver: The glite-CE still has a gatekeeper, but only the fork jobmanager. Does this constitute 'drop of support for VOMS-enabled Pre-WS GRAM on the gLite-CE'? There is certainly no schedule to change this. Can we close this action? (Raised by NE ROC)

Progress on 2006-10-16: Nick: Is there an answer to Oliver’s question? Per & Ron: We will check and report back next week.

Progress on 2006-10-23: Ron to put together a case for gLite CEs supporting traditional GRAM interfaces.

Progress on 2006-11-23: This is to be brought to the attention of the TCG by Jeff Templon for discussion.

Progress on 2006-11-13: Decision from the TCG:

The strategy is that: we basically we follow the plan that Ian outlined, we move to the (standard vdt) GT4-pre-webservice gram on the gLite CE, the lcg-ce will continue to be supported as is and we hopefully can cease the support by june. This depends on the quality of the gLite CE. People who want to configure all the standard vdt job managers on the gLite CE are free to do so, however for the moment we will not provide a certification of that. We invite sites which do that to become part of the certification/pre-production service but it is not part of the core SA3 responsibility.

The practicalities of this will be discussed between the effected sites (in particular NIKHEF) and SA3 and the TCG will be kept informed.

At the same time we ask cream if it would be possible to expose a GT4 WS interface in addition to the cream one.

The general policy of EGEE is to support multiple interfaces on the CE to the extent that this is feasible and required by the EGEE applications and/or EGEE sites.

This action can be closed.

Per / Ron

16/10/06

20/11/06

66

Create and publish recipe for sites which want to install CA updates using APT autoupdate.

Progress on 2006-10-30: In progress.

Progress on 2006-11-01: The easiest (only?) way to do this is to run a cron.

OCC

13/11/06

20/11/06

68

Nick to raise GGUS tickets 14461, 14462 and 14463 at the EMT.

Progress on 2006-11-01: Done. These will be taken into account during the work related to action item 67. Close.

Nick

06/11/06

20/11/06

2006-08-14—2

Some HEP VOs are contacting UK sites regarding missing VO software. This needs to be investigated and if necessary raised at the LCG Technical meeting (mid-September).

Progress on 2006-08-21: This will be discussed at the WLCG Service Challenge Technical Meeting on 15th September.

Progress on 2006-08-28: This can be closed. A report will be made at this meeting after the discussion at the WLCG SC Technical Meeting.

Progress on 2006-10-05: no progress

Progress on 2006-10-09: Meeting to be organized with VO sw managers. Other ROC/site which also sees this as an issue? No answer

Progress on 2006-10-16: Nick to check if this meeting has taken place.

Progress on 2006-10-23: No meeting has taken place. It’s not clear that this is an issue outside of the UK/I region.  UK/I ROC need either to show this is a wider problem, or close the action.

Progress on 2006-11-06: UK/I reported that the same problem had been reported by BaBar in Italian sites.

Progress on 2006-11-20: No more information about this, and no other ROCs/sites are seeing this problem. The problem is considered internal to the UKI ROC. Action closed.

UK/I ROC

25/09/06

 

 

 

 

 

 

20/11/06

2006-10-05—1

Change 2 days expiration limit for COD tickets to take into account only working days (exclude weekends)

Progress on 2006-10-09: Gilles: this is already on the to do list for the CIC portal. Estimation? Before next COD.

Progress on 2006-10-23: In progress.

Progress on 2006-11-20: It is in our todo list. It is not finished yet but in progress. I think that we can close this task because we have it in our to do list in the CIC Portal:

It is action ID 6 at http://cic.in2p3.fr/index.php?section=home&page=currentdevelopments

Action closed

CIC portal team

13/11/06

20/11/06

2006-10-16—2

Request from UKI ROC: It would be desirable if concise instructions on installation of host certificates for the different services would be collected in a single location. This URL could then be passed to the admins along with the new certificates.  Nick will ask by when this could be done.

Progress on 2006-10-23: In progress.

Progress on 2006-10-30: SA3 are working on this.

Progress on 2006-11-13: Should be done next week

Progress on 2006-11-20: This is now done, available at:

https://uimon.cern.ch/twiki/bin/view/LCG/TheLCGTroubleshootingGuide#How_to_replace_host_certificates

Action closed

OCC / SA3

23/10/06

20/11/06

69

OSG need help with understanding how to fill in the weekly forms.

Progress on 2006-11-13: Nobody from OSG attended the meeting

Progress on 2006-11-20: Help was provided after the meeting to Dantong Yu, from BNL. Anybody else having problems to fill ROC reports should contact Maite and Nick.

Progress on 2006-11-27: No further request received. Action closed.

OCC

20/11/06

27/11/06

72

Need to follow up the questions regarding raising of alarms and auto-generated content of site reports.  This should be done at the ROC managers’ meetings.

Progress on 2006-12-11: The action has been moved to the ROC managers. Once they agree on a proposal, it will be presented at the operations meeting. The action can be closed.

OCC

15/01/06

18/12/06

73

Need FTS team to comment on the following:

“FTS configuration still problematic. Dynamically created services.xml seems crazy. Not all sites in BDII at the same time. Why not maintain one centrally and distribute that? [raised by TRIUMF]”

Progress on 2006-12-11: Answer from the FTS team:

It's not optimal. The long-term correct solution is to switch on the direct BDII interface for FTS so that it doesn't need a local services.xml file at all, but given not all sites are always there, this can be a problem, and the caching mechanism needs to be tested further before we recommend it. We also need to optimise the code in order to not overload the BDII.

 

Regarding specifics:

 

> FTS configuration still problematic. Dynamically created services.xml

> seems crazy. Not all sites in bdii at the same time.

 

This is covered by the script - if you run it as recommended it will keep existing service entries, even if these entries are currently not in BDII at the time you run the script.

 

> Why not maintain one centrally and distribute that?

 

There is one already on the wiki page that provides the script, but it may not be up-to-date. I will update it for people to the current latest version (this is now done). Bear in mind that this version will point your FTS clients to the CERN-PROD FTS instance, not your one, so you will still need to edit it.

 

I'd still recommend generating yourself using the specified procedure.

The action can be closed.

OCC/
FTS

11/12/06

18/12/06

74

ATLAS would like a statement from the UK/I ROC and/or the Cambridge site as to whether the SE at that site will really be in scheduled downtime from 27 November until Xmas.

Progress on 2006-12-11: This has been clarified between Atlas, the ROC and the site. The SE was only unavailable for 1 day. The general procedure to communicate and escalate these issues will be discussed at next meeting. The action can be closed.

UK/I ROC

11/12/06

18/12/06