Search This Blog

Thursday, January 23, 2014

Lync 2013 WebConf instability, events 41024, 41026, 41025, 42001, 41999


We observe some errors in Lync 2013 Front End not regularly, but sometimes every 20-30 minutes. There are somewhat more errors during the night or weekend, but it is very difficult to find exact time pattern. Later on you will understand why the error appear during a quiet period, rather than during a full load.



Log Name:      Lync Server
Source:        LS Data MCU
Date:          1/22/2014 8:50:06 AM
Event ID:      41024
Task Category: (1018)
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      xx
Description:
No connectivity with one of the Web Conferencing Edge Servers.


Edge Server Machine FQDN: yy, Port:8057
If the problem persists this event will be logged again after 20 minutes
Cause: Service may be unavailable or Network connectivity may have been compromised.


Log Name:      Lync Server
Source:        LS Data MCU
Date:          1/22/2014 8:50:06 AM
Event ID:      41026
Task Category: (1018)
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      xx
Description:
No connectivity with any of Web Conferencing Edge Servers. External Lync clients cannot use Web Conferencing modality.


Cause: Service may be unavailable or Network connectivity may have been compromised.
Resolution:
Verify all Web Conferencing Edge Services in the topology are running, and network connectivity is available.
Log Name:      Lync Server
Source:        LS Data MCU
Date:          1/22/2014 8:50:06 AM
Event ID:      41025
Task Category: (1018)
Level:         Information
Keywords:      Classic
User:          N/A
Computer:      xx
Description:
Connection to the Web Conferencing Edge Server has succeeded
Edge Server Machine FQDN: yy, Port:8057



At the same time on the Egde server we see the reflection of the same issue:
Log Name:      Lync Server
Source:        LS Web Conferencing Edge Server
Date:          1/22/2014 5:07:45 PM
Event ID:      42001
Task Category: (1023)
Level:         Information
Keywords:      Classic
User:          N/A
Computer:      yy
Description:
Web Conferencing Server disconnected


Connection from Web Conferencing Server from xx  disconnected.
This event is reported only once in 30 minutes even if other Web Conferencing Servers will disconnect during said period.
Cause: This can happen if the Web Conferencing Server was unavailable or taken down for maintenance
Resolution:
Make sure that the Web Conferencing Server is up and running



Log Name:      Lync Server
Source:        LS Web Conferencing Edge Server
Date:          1/22/2014 4:44:13 PM
Event ID:      41999
Task Category: (1023)
Level:         Information
Keywords:      Classic
User:          N/A
Computer:      yy
Description:
Web Conferencing Server connected successfully


Web Conferencing Server with FQDN xx connected successfully


Our initial ideas - backup load, Antivirus, TOE, RSS, TCP offload on Virtual Machine and host - did not succeed. Now it is time to check the network:
In between there is a firewall PaloAlto. Default session timeout 3600 sec.



SSL application timeout is 1800 sec.



We can check in the Monitor tab the session on Edge WebConf port 8057. The session was detected as ssl and therefore the timeout is set to be 1800 seconds. If you keep refreshing you will see TTL value for the session:




Edge sends session keeping heartbeat packets each 300 seconds (5 minutes)



But Palo Alto does not see that session is alive (TTL is ticking down, despite to the fact that there are packets every 5 minutes). And as a result it will drop a session after 1800 seconds. Lync will try to send a keepalive packet, but because the session has been dropped we see several TCP re-transmissions, then Lync will rise the errors and will try to reestablish new session.  
This PA behavior is due to mechanism of offload to gain performance: https://live.paloaltonetworks.com/docs/DOC-3950
In this keepalive session to obtain 16 packets, the length of the session must be (16x300) = 4800 sec.
So the solution is to override application setting in PaloAlto to set the session timeout to be 4800 seconds, refer to https://live.paloaltonetworks.com/docs/DOC-1071 

 




3 comments:

Unknown said...

Great article. I have hunted this problem for a while now. But is the solution only applicable to Palo Alto firewalls or is it a general advice to set the session timeout for SSL to 4800s on every firewall type?

Daniyar said...

Session timeout for general SSL should be 1800s

jake george said...
This comment has been removed by a blog administrator.