Home Dashboard Directory Help
Search

Error 18456 (Severity: 14, State: 11) on Virgin SQL Server 2008 R2 Cluster by SQL Ranger


Status: 

Closed
 as By Design Help for as By Design


4
0
Sign in
to vote
Type: Bug
ID: 663166
Opened: 4/18/2011 11:00:07 PM
Access Restriction: Public
0
Workaround(s)
view
0
User(s) can reproduce this bug

Description

Whenever we failover most of the time (but not always) we get the following errors logged in ERRORLOG:

Source -     Logon
Message - Error: 18456, Severity: 14, State: 11.
Source - Logon
Message - Login failed for user 'DOMAIN\COMPUTER-NODE2$'. Reason: Token-based server access validation failed with an infrastructure error. Check for previous errors. [CLIENT: 172.21.0.229]

If we stop and restart SQL Server on the same node we do not seem to get this error.

So the error always being generated from the "passive" node's IP address.

It's difficult to troubleshoot this problem as it happens around when automatic recovery completes. There seem to be no resources on the internet for this error.

The majority of the time there are 9 instances of this error being generated, but aocaasionally there are less.

I tried TF 4013 hoping that it wouild log some more information about the root cause process, without avail.

Various error logs are occasionally reporting different errors (such as SQLWEP).

(I can attach various error logs if you'd like.)
Details
Sign in to post a comment.
Posted by SQL Ranger on 5/9/2011 at 7:02 PM
Thanks Lyudmilla!

Hi Seth!

As I suspected this is a false positive of sorts, or certainly something that you should not be concenrned about.

However it would be better if these errors did not occur.

Seth, this problem ONLY occurs during a failover of the cluster. And ONLY for the first couple of seconds as the SQL Server instance is starting up. Then we get no errors in the ERRORLOG.

So when the "active node" fails over to the "remote node" the SQLWEP generates some initial errors when we start up remotely and then it disappears. I assume that is because the SQLWEP is running under a local privileged account which does not have network access to the "remote node". But I am not sure what happens after those initial few errors. Does the "system" recognise that SQL Server's EXE is no longer running in memory and therefore stop querying? Or does something else happen?

I am wondering if there is some "elegant" solution to prevent these error messages from being generated in our cluster environment.
Posted by Microsoft on 5/8/2011 at 1:08 PM
Closing feedback as issue appears resolved.
Posted by Microsoft on 5/8/2011 at 1:08 PM
Closing feedback as issue appears resolved.
Posted by Sethu Srinivasan on 5/4/2011 at 5:28 PM
Hi Victor,
SQL Agent WMI Alerting mechanism issues WQL query to WMI to gather the results from WQL query. If the WQL query is issued against SQL Server namespace, WMI Service host process runs SQL Event Notification Provider (SQLWEP) in the security context of WMI host process.
SQLWEP gathers event data from the Service broker queue (WMIEventProviderNotificationQueue) in msdb
following MSDN link gives you more information on how WMI provider for server events works.
http://msdn.microsoft.com/en-us/library/ms181893(v=SQL.90).aspx

Following MSDN link gives you more information on possible accounts that WMI host process can run under. http://msdn.microsoft.com/en-us/library/aa392783(v=vs.85).aspx

You can take following steps:
1) Grant the WMI host process account context with neccessary minimum permissions
Please take a look at "Permissions and Event Notification Scope" in http://msdn.microsoft.com/en-us/library/ms186371.aspx
2) Run SQL Profiler, enable WMI alert and check is there are any failed connections / queries failed due to insufficient permissions
3) verify if WMI alert was triggered as expected when a alert condition is reached

Let us know if this works for you

Thanks
Sethu Srinivasan [MSFT]
SQL Server
http://blogs.msdn.com/sqlagent


Posted by Microsoft on 5/3/2011 at 5:55 PM
Hi Victor,

The WMI provider is a local (non clustered) resource, and will not failover. It needs to be accessed locally (per instance).
So, making the WMI service a "clustered app" is not supported, unfortunately.
You can disable WMI Event Alerts in cluster.
Of course you can ignore the errors, but there won't be much sense in running the WMI service in this case.

Thanks,
Lyudmila
Posted by SQL Ranger on 5/2/2011 at 5:16 PM
Hi Lyudmila,

Yes, it does seem to be WMI Alerts that are the cause. Worked it out yesterday when I went back to the client's site. Just before I was about to go home, so I did not reply here. Sorry.

I looked at the default trace which indicated that it was Process ID 6936 which was causing the login errors.

So then it was a matter of running TASKLIST.EXE on the other node to see what Process ID 6936 was:

Image Name,PID,Session Name,Session#,Mem Usage,Status,User Name,CPU Time,Window Title
WmiPrvSE.exe,6936,Services,0,15,384 K,Unknown,NT AUTHORITY\SYSTEM,0:00:03,N/A

Once I disabled the WMI Event Alert and the corresponding T-SQL job the error no longer appears when we fail over the cluster.

So the WMI is running in it's default - under the LocalSystem account.

No it has not been configured to run as a clustered application.

What would be the "official" thing to do here?

a) Run WMI Service under account with higher privileges?
b) Disable WMI Event Alerts in cluster
c) Make the WMI service a "clustered app" (Which I assume is not a good idea)
d) Ignore these false positive errors (They only manifest themselves upon startup, and there's only about 3-9 of them normally)

Could you please advise what the "official" Microsoft recommendation would be?

THANK YOU VERY MUCH,

Victor
Posted by Microsoft on 5/2/2011 at 4:03 PM
One more question: can you check if the WMI service is running under LocalSystem account? Did you configure WMI service as part of cluster configuration? It looks like WMI service (used by Alert app) is running under LocalSystem, after failover it still tries to connect to SQL Server, but since it is a different box now, LocalSystem account outside the box becoming a machine account.

Lyudmila
Posted by Microsoft on 5/2/2011 at 12:06 PM
Thank you for the clarification. However I need some more help from you.
How do you configure SQL Server Event Alerts and WMI event alerts? Do you use SQL Agent as WMI App?
Can you disable these alerts and see if the error persists?

Thank you,
Lyudmila
Posted by SQL Ranger on 4/27/2011 at 7:19 PM
Thanks Alex and Lyudmila,

This is a brand new SQL Server 208 R2 Failover Cluster, that has yet to be made into a PROD environment, there there is very little cusomisation beyond Alerts and Operators. That's why I am so surprised that we are getting these errors.

I have not come across any errors with such machine accounts before. I would assume that the SQL Server setup takes care of everything there. There is no need to grant them SQL access, etc.

So there is no service that is running under the machine account that is trying to connect to SQL Server.

Thus there is no login for it.

We checked the service account. It has not been locked out. The password has been set to never expire. Password cannot be changed. So that does not seem to be it.

(As an aside I have seem some "chatter" on the internet that point the problem to either time not being synchronised with AD and the different nodes, problems with SPNs (duplicate entries in AD), or insufficient rights. I am not sure how relevant they are.)

But my problem is that I cannot identify what process is spawning these errors.

In nay case I have attached the output of the RING_BUFFER_SECUIRTY_ERROR after a failover. I have also attached the SQL Server error log (plus the previous one), and some Windows logs.

I hope this helps.

Thanks muchly (Spasibo),

Victor

My "suspicion" is that some clustering component that runs under the machine account is responsible. The problem seems to manifest itself straight after a failover only. It does not seem to occur after that.

Posted by Microsoft on 4/27/2011 at 4:28 PM
Also, you may potentially hit this error if the service account has expired but the system has not been restarted or the service account's password is expired. Let me know if this is the case.
As an additional information, could you collect an output from following query right after you see this error in the error log:
Select * from sys.dm_os_ring_buffers
where ring_buffer_type = 'RING_BUFFER_SECURITY_ERROR'

Thanks,
Lyudmila
Posted by Microsoft on 4/27/2011 at 1:18 PM
This error shows that some service running under machine account (CREDITCORP\CREDIT-NODE2$) is trying to login, but the account either doesn’t exist in SQL Server or was revoked connect permission.
-Is there any know service which is running under machine account and may try to connect to SQL Servr?
-Is this account exists in SQL Server as a login (or through group membership)?

Thanks,
Lyudmila
Posted by Microsoft on 4/19/2011 at 10:38 AM
Thank you for reporting this issue - we are investigating and we will get back to you shortly.

Thanks,

Alex Grach
Sign in to post a workaround.
File Name Submitted By Submitted On File Size  
ERRORLOG.log (restricted) 4/18/2011 -
Log files.zip (restricted) 4/27/2011 -