fighting for truth, justice, and a kick-butt lotus notes experience.

DAOS problems after update to Domino 12.0.1 IF1 with reproducible server crash

 Februar 15 2022 03:06:38 PM

Preliminary:




Based on the experience below, when using DAOS on Domino Server 12.0.1 IF1, I cannot currently recommend and would wait until this is resolved before updating to Domino 12.0.1.

We have a support case open with HCL on this and hope this can be resolved quickly.

Update 2022-02-16:
HCL already looked into it and offered us via the Case a new hotfix  (HF24). So if you already run into the same issue, you should open a Support Case and request the hotfix, too.

Update 2022-02-22:
HCL published a new Technote today:
https://support.hcltechsw.com/csm?id=kb_article&sysparm_article=KB0096771

HCL developers are actively working on these issues. Our Performance team was able to reproduce these issues under a heavy workload and is in the process of testing our fixes under that workload.


If you are encountering any of these issues or something similar, please open a Support ticket to have your issue analyzed and escalated. (Include console logs and NSD if applicable.) If your issue is determined to be one of the issues that HCL has tested and verified, Support can provide a hotfix to you.


HCL will produce a 12.0.1 IF2 release containing the fixes as soon as possible.



Update 2022-03-05:
HCL published a new 12.0.1 IF2, which contains four DAOS fixes.


DCKTCARNVR        Fixed an issue where error may result in long held locks on daoscat.nsf during replication
SPPPCAMM6Y        Fixed an issue where there were multiple locks on daoscat.nsf

HPRHCASE7N        Fixed Domino crashes related to DAOS

BSPRCBQLLJ        Fixed deadlock and performance issues related to DAOS


We are planing to try IF2 during this week to see if our issue is solved with the IF2 too.


Update 2022-04-13:
The update of the Domino servers to Domino v12.0.1 IF2 was successfull and without any DAOS issues.
So if you are planning to upgrade to Domino v12.0.1 you should install IF2.
If you are already running 12.0.1 you should install IF2, too.

On last hint and leason learned: If you will need to rebuild the DAOS catalog because it's corrupted or missing, you should execute the command offline. Not from the console, when the server is up and running.



So what happened?


After a successfull update installation from v11 to Domino v12.0.1 and Interimsfix 1 (Hotfix 11), the first restart was normal.
But after about 30 minutes "Long Held Lock Dump" appears and a while later the server was unresponsive for users.


On the server console we saw many messages like this:

[22C4:0142-27D8] LkMgr BEGIN Long Held Lock Dump ------------------
[22C4:0142-27D8] Lock(Mode=X  * LockID(CONTLONGKEY DB=f:\Domino\data\daoscat.nsf RRV=14545618 len=48 hKey=0xC0190341 SkipLastDWORD)) Waiters countNonIntentLocks = 1 countIntentLocks = 0, queuLength = 2
[22C4:0142-27D8]    Req(Status=Granted Mode=X Class=Manual Nest=0 Cnt=1 0000
[22C4:0142-27D8]        Tran=0 Func=N/A x\ehashr6.c:899 [27C8:0002-000000000000275C])



After restarting and checking the daos status, we observed that the the daos status is out of sync. After this we submitted a load daosmgr resync.
But the resync didn't come to an end and the server was unresponsive again, showing these messages:

semaphore invalid or not allocated


Notes client were no longer able to connect to the server and even the Server Console was not able to send console commands any more.

After all we decided in our situation to downgrad back to 11.0.1FP4, rebuildt the daoscatalog and no more errors occured.


The same behavior occurred on a second large mail server as well. And led to the fact that this server was also no longer available for clients and could only be terminated hard via nsd -kill.


The problem should be solved with 12.0.1 IF1, but unfortunately it is not:


https://support.hcltechsw.com/csm?id=kb_article&sysparm_article=KB0095401
Kommentare

1Christian Henseler  02/15/2022 4:01:23 PM  DAOS problems after update to Domino 12.0.1 IF1 with reproducible server crash

Isn't it related to the issue Yuriy recently blogged about:

{ Link }

2Detlev Poettgen  02/15/2022 4:45:10 PM  DAOS problems after update to Domino 12.0.1 IF1 with reproducible server crash

Yes the problem looks similar.

HCL is working on the solution and has send us the Hotfix 24 via the HCL Case today.

3Adam Osborne  02/17/2022 10:30:34 PM  DAOS problems after update to Domino 12.0.1 IF1 with reproducible server crash

Thanks for the information on this issue.

Is there an SPR or associated problem number we can reference with HCL to get the hotfix and track the problem?

Cheers

Adam

4Heinrich Nellen  02/21/2022 9:33:08 AM  DAOS problems after update to Domino 12.0.1 IF1 with reproducible server crash

Did Hotfix 24 solve the problem?

Cheers Heinrich

5Detlev Poettgen  02/21/2022 2:58:09 PM  DAOS problems after update to Domino 12.0.1 IF1 with reproducible server crash

The customer affected above was offered Hotfix 24 via the Support Case, but as described we made the decision to go back to v11 right after the problems occurred.

Currently, the affected customer is not willing to try an update to Domino 12.0.1 with HF24 again, but would like to wait until a new official version with a solution to the problem is available.

In general, a downgrade is of course always critical and definitely not recommended. In our case, no ODS versions have been upgraded yet and we knew that the DAOS Catalog needs to be rebuilt. To get the server back online quickly, a solution had to be found quickly and we could not wait to open a new HCL Case and wait 5 minutes, or one hour or five hours or a day to get someting ...

6Heinrich Nellen  02/28/2022 4:29:04 PM  DAOS problems after update to Domino 12.0.1 IF1 with reproducible server crash

Sorry, I missed that you did a downgrade.

As far as we expierenced, the problem is not strictly linked to daos, it seems that it happens with heavily used databases:

"An error occurred while deleting old named foundsets in clubusy.nsf: The caller's SemWait timeout expired." (like in Yuriy's blog)

It ended with "Recovery Manager: Log file is full" and we had to restart the server.

The problem did not occure immediately after the update from 10.x to 12.0.1 IF1. It began 2 days later. It did not occure on the secondary cluster node.

We got the hotfix (HF 29 for linux) and implemented it. Now we are monitoring whether the issue is solved.

Cheers Heinrich

7Detlev Poettgen  03/01/2022 10:03:48 AM  DAOS problems after update to Domino 12.0.1 IF1 with reproducible server crash

Thank you for sharing Heinrich

  •  
  • Hinweis zum Datenschutz und Datennutzung:
    Bitte lesen Sie unseren Hinweis zum Datenschutz bevor Sie hier einen Kommentar erstellen.
    Zur Erstellung eines Kommentar werden folgende Daten benötigt:
    - Name
    - Mailadresse
    Der Name kann auch ein Nickname/Pseudonym sein und wird hier auf diesem Blog zu Ihrem Kommentar angezeigt. Die Email-Adresse dient im Fall einer inhaltlichen Unklarheit Ihres Kommentars für persönliche Rückfragen durch mich, Detlev Pöttgen.
    Sowohl Ihr Name als auch Ihre Mailadresse werden nicht für andere Zwecke (Stichwort: Werbung) verwendet und auch nicht an Dritte übermittelt.
    Ihr Kommentar inkl. Ihrer übermittelten Kontaktdaten kann jederzeit auf Ihren Wunsch hin wieder gelöscht werden. Senden Sie in diesem Fall bitte eine Mail an blog(a)poettgen(punkt)eu

  • Note on data protection and data usage:
    Please read our Notes on Data Protection before posting a comment here.
    The following data is required to create a comment:
    - Name
    - Mail address
    The name can also be a nickname/pseudonym and will be displayed here on this blog with your comment. The email address will be used for personal questions by me, Detlev Pöttgen, in the event that the content of your comment is unclear.
    Neither your name nor your e-mail address will be used for any other purposes (like advertising) and will not be passed on to third parties.
    Your comment including your transmitted contact data can be deleted at any time on your request. In this case please send an email to blog(a)poettgen(dot)eu

Archive