Acknowledge rmclomv so it shuts up? - Systems Maintenance(Archived)

We have a ticket opened up about this in our ticket system so it gets fixed:
Aug 26 10:41:56 ourhost rmclomv: [ID 974741 kern.error] PSU # PS0 has FAILED.
How do I make rmclomv shut up about it? The constant reporting is causing our monitoring system (and us) grief.
Thanks for any advice! 

Aug 26 10:41:56 ourhost rmclomv: [ID 974741 kern.error] PSU # PS0 has FAILED.Well, you probably ought to figure out what the system is telling you before "shutting it up." I believe this message indicates a problem with the system's #0 power supply. 

As I said, we already opened a ticket in our ticketing system so that this hardware problem gets fixed.
Now we want to shut it up. 

The appropriate way to shut it up is to replace the faulty power supply. The inappropriate way is to remove kernel.error from syslog.conf and then bounce syslog. But then any other messages tossed to kernel.error also get removed.
alan

Related

System rebooted suddenly

Dear Friends,
One of our production system got rebooted yesterday at 5 PM, without any manual intervention by the administrators. The output of the last commnad is also strange where we can see "system boot" without any system "system down" message.
root#prdapp2 # last | more
root pts/1 sys2 Sat Dec 6 17:38 - 17:44 (00:05)
reboot system boot Sat Dec 6 17:11
root pts/1 sys1 Sat Dec 6 09:27 - 09:44 (00:17)
As you know the "systm down" message should be there in "last" even if the shutdown is due to powerloss.
My Investigations
-------------------------
I could not find anything unusual on system logs except on /var/adm/messages.mem_er.
root#srvr2 # tail -5 messages.mem_err
Aug 6 11:30:40 srvr2 AFSR 0x00000002<CE>.00000023 AFAR 0x000000a1.f5389320
Aug 6 11:30:40 srvr2 Fault_PC <unknown> Esynd 0x0023 Slot A: J2901
Aug 6 11:30:40 srvr2 SUNW,UltraSPARC-IV: [ID 365676 kern.info] [AFT0] errID 0x0042daa3.8b095a00 Corrected Memory Error on Slot A: J2901 is Persistent
Aug 6 11:30:40 srvr2 SUNW,UltraSPARC-IV: [ID 747193 kern.info] [AFT0] errID 0x0042daa3.8b095a00 Data Bit 46 was in error and corrected
Aug 6 11:30:40 sevr2 unix: [ID 566906 kern.warning] WARNING: [AFT0] Most recent 3 soft errors from Memory Module Slot A: J2901 exceed threshold (N=2, T=24h:00m) triggering page retire
Based on this output I did some memoy tests using cediag and seems to fine.
root#srvr2 # /opt/SUNWcest/bin/cediag -L
cediag: Revision: 1.94 # 2007/07/26 22:55:08 UTC
cediag: Analysed System: SunOS 5.9 with KUP 118558-10 (MPR active)
cediag: Pages Retired: 0 (0.00%)
cediag: findings: 0 datapath fault message(s) found
cediag: findings: 0 UE(s) found - there is no rule#3 match
cediag: findings: 0 DIMMs with a failure pattern matching rule#4a
cediag: findings: 0 DIMMs with a failure pattern matching rule#4b
cediag: findings: 0 DIMMs with a failure pattern matching rule#5
cediag: findings: 0 DIMMs with a failure pattern matching rule#5 supplemental
How the system is rebooted suddenly without writing anything on the "messages"?
Any thoughts would be appreciable.
Hello,
While booting up, some messages have been written to "messages".
-------------------------------------------------------------------------------------------------------------Dec 6 17:11:35 prdapp2 savecore: [ID 570001 auth.error] reboot after
panic: bad kernel MMU miss at TL 2Dec 6 17:11:35 prdapp2 savecore: [ID 748169 auth.error] saving system
crash dump in /var/crash/prdapp2/*.0-------------------------------------------------------------------------------------------------------------
Thank,
Sal. 
2 things to mention
1)List a stacktrace via mdb
2) How quick do you need this fixed, as it is a prod system, I would replace all memory first, if that doesn't work
Change the system board
Frank 
A stacktrace?
Yes, that could help.
RAM problems? Systemboard issues?
No, that's utterly wrong.
The tiny excerpt in the original post shows timestamps that are far from the event.
The snippet for the CE event on the DIMM J2901 of board A was on August 6th.
The panic event was on December 6th.
I'd be more concerned with the ancient kernel patch level.
My suggestion is to open a support case with Sun and let them go through an Explorer and also the core files. 
Savecore: [ID 570001 auth.error] reboot after
panic: bad kernel MMU miss at TL 2Dec 6 17:11:35 prdapp2 savecore: [ID 748169 auth.error] saving system
MMU misses are more likely to be related with hardware, the MMU is responsible for creating a memory mapping
between physical and virtual pages and if the kernel does have issues with managing memory than it looks like to me
these would be hardware related as the kernel MMU Is pretty stable compared with flaky hardware
I would like to say that the previous guy who posted read more on
the internals of the kernel before stating dodgy patches and dodgy
kernel versions
As for sun, they'll have a look at your crashdump, ask you to patch, but ultimatley advise you to swap out hardware
Frank 
Dear Friends,
What Frank suggested was correct.The system board has to be replaced, where on processor might be deffective. The following are the suggetions after the coredump analysis.
==== stack # 0x2a1008d4d00 (sp: 0x2a1008d4501) ====
0x0(jmpl?)()
unix:syscall_trap+0x88()
unix:trap(0x2a1008d7ba0) - frame recycled
unix:ktl0+0x48()
-- trap data type: 0x31 (data access MMU miss) rp: 0x2a1008d7370 LEAF --
pc: 0x100c42c unix:bzero+0x4: ldx [%g7 + 0x80], %o5
npc: 0x100c430 unix:bzero+0x8: brz,pt %o5, unix:bzero+0x24
+!!  The system panicked due to stack overflow because+
+!!  the panic thread repeatedly got the trap 0x31 when+
+!!  doing cpu_fast_ecc_error().  The reason is because+
+!!  %g7 is bogus.+
+!!  Given that the cpu got fast ecc error and %g7 got+
+!!  corrupted, CPU 2 could be defective.+
Thanks for all those who are working with my post.
Sal.

NFS4ERR_EXPIRED repeatedly creating performance issues

I am running several X4500s and serve the filesystems over NFS4. I have noticed that frequently, though without any traceable
reason that clients struggle to maintain the mounts
Mar 12 13:32:09 rosalind nfs: [ID 581112 kern.info] NOTICE: [NFS4][Server: c-store4][Mntpt: /data/rw16]NFS Starting recovery for mount
/data/rw16 (mi 0x30028479000 mi_recovflags [0x1]) on server c-store4, rnode_pt1 ./mccarthy/METAL_GRADIENTS/1_1_100_100_small_b/hi_time_
res (0x6006e278138), rnode_pt2 ./mccarthy/METAL_GRADIENTS/1_1_100_100_small_b/hi_time_res/snapshot_011.hdf5 (0x300227da7d0)
Mar 12 13:32:09 rosalind nfs: [ID 273629 kern.info] NOTICE: [NFS4][Server: c-store4][Mntpt: /data/rw16]NFS Recovery done for mount /dat
a/rw16 (mi 0x30028479000) on server c-store4, rnode_pt1 ./mccarthy/METAL_GRADIENTS/1_1_100_100_small_b/hi_time_res (0x6006e278138), rno
de_pt2 ./mccarthy/METAL_GRADIENTS/1_1_100_100_small_b/hi_time_res/snapshot_011.hdf5 (0x300227da7d0)
Mar 12 13:32:09 rosalind nfs: [ID 236337 kern.info] NOTICE: [NFS4][Server: c-store4][Mntpt: /data/rw16]NFS op OP_OPEN got error NFS4ERR
EXPIRED causing recovery action NRCLIENTID.
This is repeated quite a few times, and then the problem stops.
I drawled the web, but could find no indication for any problems.
That type of behaviour causes delays of file access, even on small files, which can last several seconds or even longer.
Any ideas?
Lydia 
I am also experiencing these issues with T5220 NFS servers and blade 2500 workstations. Only about 30 workstations are auto-mounting this share but an application depends upon the availability of this share and it drops out and locks the application. The file the app worked on is intact but is a huge inconvenience to the user. I would be curious of any statistics to gather to further resolve this issue.
Sol 10 5/08 with recommended patch Cluster from August 08.
Thanks,
Josh

var: file system full, then store server fails

Mail server stopped. This problem started a few days ago. Several threads indicated this similar error.
Solaris 10, JES 6.2
Started seeing from the console /var: file system full. Var is at 46%. In looking at some of the threads, we did the following:
1. stop-msg
2. rm -r files in /opt/SUNWmsgsr/data/store/mboxlist
3. start-msg
4. reconstruct -m
This did resolve the issue for about 15 hours.
Now, even trying to run start-msg store, fails.
kite:root bash /opt/SUNWmsgsr/sbin # ./start-msg
Connecting to watcher ...
Launching watcher ...
Starting ens server ... 7806
Starting store server .... 7807
checking store server status ..... failed
On the console the error displayed is as follows:
Jun 24 06:46:22 kite.lethbridgecollege.ab.ca CNS Transport[7151]: abort on signal: 6 check for core file in: /var/tmp/cc-transport
Jun 24 06:46:23 kite.lethbridgecollege.ab.ca ufs: NOTICE: alloc: /var: file system full
thoughts and or suggestions. 
First, a note about "fast recovery":
Removing the mboxlist databases and doing reconstruct -m really should be an absolutely last resort. I don't want to generate a lot of support calls by saying "never do that", but it really should be needed VERY rarely.
And after you do that, you should look at the default log to see what stored did when it started up. It should have restored the DBs from the last snapshot. So if you have a recurring problem and you repeatedly get around it by doing this, then the problem may also exist in the snapshot.
Now, back to your problem at hand:
If you see messages on the console saying that /var is full, that sounds like an OS or filesystem issue rather than a Messaging Server issue. If the file system is full -- or more to the point, if the normal interfaces indicate is full -- then Messaging Server is going to fail. The MTA tries to be graceful about this. It will stop accepting messages into the channel queues if the file system is nearly full. Other things may not fail as gracefully. And if it is something unrelated to Messaging which is filling a file system used by Messaging, it may do so too quickly for the normal monitoring to notice in time.
Please cut-n-paste the actual messages about the file system being full. They should be in your /var/adm/messages as well as on the console.
If you think the system is erroneously complaining about the file system being full, you need to open a support case with the OS or filesystem support people about that.
If the filesystem filling up has caused some damage to the mboxlist databases, fast recover and/or causing stored to recover from previous snapshot is probably the best alternative. But you really need to get to the root cause of the system complaining that the file system is full. 
In looking at the /var/adm/messages file, things look normal until June 22 at 12:53. Here's a snippet:
Jun 22 12:58:52 kite.lethbridgecollege.ab.ca CNS Transport[865]: [ID 515452 daemon.warning] MLM ping curl error: 'couldn'
t connect to host', code: 7
Jun 22 13:02:36 kite.lethbridgecollege.ab.ca CNS Transport[865]: [ID 235725 daemon.warning] MLM ping error: curl msg 'cou
ldn't connect to host', curl rc: 7
Jun 22 13:02:37 kite.lethbridgecollege.ab.ca CNS Transport[865]: [ID 126843 daemon.notice] cctransport started
Jun 22 13:02:38 kite.lethbridgecollege.ab.ca CNS Transport[865]: [ID 451796 daemon.error] abort on signal: 6 check for co
re file in: /var/tmp/cc-transport
Jun 22 13:02:43 kite.lethbridgecollege.ab.ca xntpd[305]: [ID 774427 daemon.notice] time reset (step) -1.791793 s
Jun 22 13:06:24 kite.lethbridgecollege.ab.ca CNS Transport[3154]: [ID 515452 daemon.warning] MLM ping curl error: 'couldn 
You originally mentioned an error on the console about /var being full. I don't see anything in the messages you pasted about that. But there is a message suggesting that something dumped core. None of those messages seem to have anything to do directly with Messaging Server. But if something is dumping core, that could be what is filling your disk.
You should probably use the coreadm command to control what happens with process core dumps. See My Oracle Support knowledge article 1001674.1, which you should also be able to find in SunSolve: http://sunsolve.sun.com/search/document.do?assetkey=1-71-1001674.1-1 
thanks for the info. Can't get to sunsolve....user not authorized. sorry.
From df -h
/dev/md/dsk/d3 34G 28G 5.1G 85% /var
From Console
Jun 24 07:49:57 kite.lethbridgecollege.ab.ca CNS Transport[11407]: abort on sign
al: 6 check for core file in: /var/tmp/cc-transport
Jun 24 07:57:31 kite.lethbridgecollege.ab.ca CNS Transport[12063]: abort on sign
al: 6 check for core file in: /var/tmp/cc-transport
Jun 24 07:57:34 kite.lethbridgecollege.ab.ca ufs: NOTICE: alloc: /var: file syst
em full
Jun 24 08:05:05 kite.lethbridgecollege.ab.ca CNS Transport[12402]: abort on sign
al: 6 check for core file in: /var/tmp/cc-transport
Jun 24 08:05:07 kite.lethbridgecollege.ab.ca ufs: NOTICE: alloc: /var: file syst
em full 
Those errors are all about something called "Sun Update Connection" aka SunUC, which has nothing to do with Messaging Server. I found stuff about SunUC in this forum [http://forums.sun.com/forum.jspa?forumID=871].
Something - perhaps SunUC, perhaps Messaging Server, perhaps something else - is filling your disk and causing Messaging Server to fail. You will need to investigate what is using the disk space and solve that root cause.
Since this seems to have caused problems with the mboxlist databases, recovering Messaging Server will mean moving those aside and restarting stored so it can recover from the most recent snapshot and then run reconstruct -m. But until you solve the root cause, you are likely to continue having that problem.
Using coreadm to select where core dumps will be placed (and placing them somewhere other than where the Messaging Server data is stored) may allow you to avoid this problem causing a catastrophe for Messaging Server.
As a generally best practice, you should have your Messaging Server data on a separate file system (often multiple separate file systems) from everything else both for performance reasons and to avoid things like this.
I see you have a support case open about this. So we should continue the discussion in that case rather than here. 
fwiw... the conclusion was that the partitions were on separate file system on an array but mboxlist and dbdata/snapshots were on /var on the internal disk. One of the occurrences of /var being full must have coincided with a snapshot, causing that snapshot to fail. And subsequent snapshots failed, probably because of the bad one. So the most recent snapshot for stored to try to use was always that bad one. So stored started, restored the snapshot, and then dumped core. Moving the bad snapshot out of the way solved the problem.
Also moved mboxlist to the array. Moving dbdata to the array might also be a good idea. And setting store.dbtmpdir to something like /tmp/msgdbtmp would also be a good idea.

SRSS 4.2 readMessage::socket looping limited exceeded

I have 1 user out of 45 Sun Rays that keeps getting rebooted.
The Sun Ray session terminates, the /var/opt/SUNWut/log/messages says:
sunray-servers utauthd: [ID XXXXXX user.info] Worker1 NOTICE: readMessage::socket looping limit exceeded. Close it.
The unit power cycles then reconnects to the RDP session and with less then 1 minute it reboots again.
The user can take her smart card to another Sun Ray terminal and work without any issue. I replaced the Sun Ray2 unit with a new one, same result. I switched the network jack on the wall, and I switched the network port on the switch, still same result.
HELP! 
More information:
Now the user tried to use her smartcard at another Sun Ray and now that Sun Ray is rebooting. I tried a spare smart card and same issue.
sunray-servers autauthd: [ID 715479 user.info] Worker7 UNEXPECTED: during send to: java.net.SocketOuputStream#xxxxxx error=java.net.SocketException: Broken pipe

v490 Temperature sensor problem

Aug 28 21:09:45 GNMS-APP picld[139]: [ID 690984 daemon.error] WARNING: Temperatu
re Sensor CPU0_DIE_TEMPERATURE_SENSOR returning faulty temp
V490 shutdown after the message. How to resolve thie problem?
prgdiag -V shows the CPU0 & CPU2 temperature as 127 degree centigrade 
user8891763 wrote:
How to resolve thie problem?
prgdiag -V shows the CPU0 & CPU2 temperature as 127 degree centigradeStop whatever is causing the overheat issue!
Seriously!
You need someone to power the system down, take it apart and physically examine it.
Go look at whatever is near board A.
There may be accumulated crud inside.
There may be failed fans inside.
There may be something covering the outside of the chassis that is preventing air from flowing properly.
Ther isn't anything complicated about this.

Categories

Resources