• Antelope Release 5.10 Linux CentOS release 7.6.1810 (Core) 3.10.0 2020-05-12

 

NAME

debugging - debugging problems with the Antelope real time system

OVERVIEW

Problems crop up regularly when running a real time system. The causes are diverse: human error, configuration problems, hardware, and software bugs. This suggests some approaches.

Understand the symptoms

The first step is to figure out as precisely what the problem is. Look carefully at the problem. Make sure it's what you first thought. Try to verify the symptom(s) in multiple different ways. Suspect the tools as well as the system.

Read!

Collect data

If the problem is still happening, this may be a good time to collect data about it for later examination. You might be tempted to try a quick fix -- something like rebooting or restarting a program. Resist this pressure for a little while, and aim for a more final solution. Quick fixes are likely to be temporary. It may be quite difficult to make a particular problem recur, so you should collect all the data you can while it's happening. Without information, you are unlikely to solve the real problem, and it will likely recur at the least convenient moment.

Run rtsnapshot to collect a variety of general information. Then use your knowledge of the problem to collect more specific data. Try to save this information into a file that is an exact recording of what you try and what the result is. Look at script(1) for one way of accomplishing this; otherwise, direct output of various commands to files. See the section DEBUGGING RUNNING PROCESSES for some suggestions for looking at a problem process.

Quick Fixes

After you have investigated the situation carefully, and collected in files as much information as you can about the problem, then you might try quick fixes like killing a process or shutting down the whole system and restarting.

However, if the problem is due to resources -- i.e., disks full or not enough memory -- you should shut down the whole system, fix the resource problem, and then restart. For disk full problems, this may mean cleaning up old data files, perhaps with rtdbclean, or excess log files, using truncate_log. If the problem is memory, you might try adding swap space. The better solution in this case is to add physical memory, however.

Analysis

With time to reflect on the problem, try to reproduce it in more restricted circumstances, perhaps on a different machine. Try to reduce it to the smallest possible example. If inspiration fails, just try changing random things to see if you can make the problem show up or disappear.

If you can isolate the problem to a situation which is reproducible, you are well on the way to solving the problem. If the program is in a BRTT program, you're probably ready to submit a bug report. See bugs(5).

It's essential to be able to reproduce the problem, preferably in much simpler circumstances. BRTT is very unlikely to be able to help if you can't get to this point. And whether the problem is in Antelope or some local programs or configuration, you can't be certain you've fixed the problem unless you have a way of reproducing it. Blind fixes are difficult to verify. Maybe other circumstances have caused the problem to disappear, or maybe you've just succeeded in making some problem less frequent.

Dig deeper

Don't settle for the superficial solutions. Look for the ultimate cause of the problems and fix that.

RESOURCE PROBLEMS

Not Enough Memory

If you have a machine which runs out of memory, things will get very, very slow and you will see messages like "can't fork" and "no more processes". For smooth processing, you must ensure that you never run out of memory. This means having enough physical memory for the typical processing requirements, and configuring enough swap space to handle any brief requirements for much more memory. Your usual processing load should fit easily into memory. If you're interested in fast locations during earthquakes, this is especially important.

Sometimes, processes do not manage their memory correctly, and their memory usage continuously grows. This is a big problem for programs which run all the time. If you notice that a program is always using more memory, you need to fix it (or get it fixed by BRTT if it's an Antelope program).

Not Enough cpu

If you have a machine where the load average is always above 1 or the cpu usage is more than 50%, it's probably overloaded. When there's a swarm of earthquakes, it may slow down to the point where it's well behind real time. The only solution to this problem is to get a faster machine or more of them (sharing out processes across multiple machines).

Not Enough Disk

If you're filling up your disks, check out the new disks on the market. Disk space is cheap.

DEBUGGING RUNNING PROCESSES

If a program seems to be hung up, you may be able to determine where, and what it's doing. The method of doing this varies from architecture to architecture, however.

Linux has something similar to truss named strace. ldd on the executable gives similar information to pldd. I'm not aware of anything comparable to pstack. You can inspect the /proc/<pid> filesystem for some information.

MacOS X has none of the /proc tools, and nothing like truss/strace or pstack. (However, newer releases have the DTRACE tools.

On any of the architectures, you can use gdb (or some other debugger) to attach to a running program and step through it by hand. This is less likely to be useful, however.

DATA FLOW

In the real time system, after hardware or resource (memory or disk limitations) problems are eliminated, the underlying data flow is a good place to look. Try to track down the packets when the problem first occurred. Use orbstat(1) to inspect the packets, and save them into forb(5) files. See if you can reproduce the problem by running a program directly from the forb files. If the problem packets have already fallen off the orb, consider setting up a very large orbserver64 (solaris only) or orb2disk buffer to hold 24 or more hours of data. Then if the problem recurs, you should be able to find the right packets. (You might also be able to recover lost data).

You should be familiar with the interactive mode of orbstat. Using this tool, you can frequently bore down into the nitty gritty details of a problem -- at least for processes which use the orb. Run orbstat on the orbserver which your process connects to, select the packet(s) of interest with select and reject, move to the general time of the problem with after, and then use commands like + and - to inspect the packets. Configure the level of packet detail shown with the commands terse, peek, hdr or unstuff or dump.

Once you find the right packets, you can save them to a file with save and reap. Here's an example:

% orbstat -i :your-port
> select TA_Q12A.*/M1
1 sources selected
> [
#40277115 'TA_Q12A/MGENC/M1':  4/02/2007 (092)  7:32:44.000 : 259 bytes
> ]
#40274800 'TA_Q12A/MGENC/M1':  4/04/2007 (094) 22:23:40.000 : 259 bytes
> after 4/3/2007 15:35:10
seeking to  4/03/2007 (093) 15:35:10.000
new pktid is 8430060
> peek
> .
#8430060 'TA_Q12A/MGENC/M1':  4/03/2007 (093) 15:35:11.000 : 256 bytes
     0 : TA       Q12A     LHZ                1.000/s calib=   1.5894587 calper=-1.000 segtype=V 10 samps Tue 2007-093 Apr 03 15:35:11.00000 - 15:35:21.00000
              740      922      959      795      621      466      462      746      975      968
     1 : TA       Q12A     LHN                1.000/s calib=   1.5894587 calper=-1.000 segtype=V 10 samps Tue 2007-093 Apr 03 15:35:11.00000 - 15:35:21.00000
             -174     -152     -154      -91      -23      -14     -123     -276     -220      -68
     2 : TA       Q12A     LHE                1.000/s calib=   1.5894587 calper=-1.000 segtype=V 10 samps Tue 2007-093 Apr 03 15:35:11.00000 - 15:35:21.00000
              126      153      237      207      150      118      105      184      230      181
> hdr
> .
#8430060 'TA_Q12A/MGENC/M1':  4/03/2007 (093) 15:35:11.000 : 256 bytes
     0 : TA       Q12A     LHZ                1.000/s calib=   1.5894587 calper=-1.000 segtype=V 10 samps Tue 2007-093 Apr 03 15:35:11.00000 - 15:35:21.00000
     1 : TA       Q12A     LHN                1.000/s calib=   1.5894587 calper=-1.000 segtype=V 10 samps Tue 2007-093 Apr 03 15:35:11.00000 - 15:35:21.00000
     2 : TA       Q12A     LHE                1.000/s calib=   1.5894587 calper=-1.000 segtype=V 10 samps Tue 2007-093 Apr 03 15:35:11.00000 - 15:35:21.00000
> terse
> save /tmp/packets
saving packets to '/tmp/packets'
> reap -n 10
<Enter any character to stop>
#8433186 'TA_Q12A/MGENC/M1':  4/03/2007 (093) 15:35:21.000 : 258 bytes
#8436406 'TA_Q12A/MGENC/M1':  4/03/2007 (093) 15:35:31.000 : 259 bytes
#8439331 'TA_Q12A/MGENC/M1':  4/03/2007 (093) 15:35:41.000 : 258 bytes
#8442657 'TA_Q12A/MGENC/M1':  4/03/2007 (093) 15:35:51.000 : 259 bytes
#8445693 'TA_Q12A/MGENC/M1':  4/03/2007 (093) 15:36:01.000 : 258 bytes
#8448729 'TA_Q12A/MGENC/M1':  4/03/2007 (093) 15:36:11.000 : 256 bytes
#8451706 'TA_Q12A/MGENC/M1':  4/03/2007 (093) 15:36:21.000 : 256 bytes
#8455134 'TA_Q12A/MGENC/M1':  4/03/2007 (093) 15:36:31.000 : 259 bytes
#8457885 'TA_Q12A/MGENC/M1':  4/03/2007 (093) 15:36:41.000 : 258 bytes
#8461239 'TA_Q12A/MGENC/M1':  4/03/2007 (093) 15:36:51.000 : 256 bytes

reaped 10 packets
> <^D>
%
Now you should be able to run the problem program against these packets, and perhaps reproduce the problem:
% program /tmp/packets other-arguments

FILES

There are a few files which can affect the global behavior of Antelope programs in ways which may be surprising or obscure. Be careful how you modify these:

Read the man pages for more information on these parameter files: elog(3), trdefaults.pf(5), site.pf(5).

ENVIRONMENT

If your path puts other directories ahead of the $ANTELOPE/bin directory, some other program may shadow an Antelope program. Try which program to see what program is actually getting executed.

Sometimes a local parameter file or a parameter file that is seen because of your PFPATH environment variable can cause surprising behavior.

The setup files setup.csh and setup.sh look for environment variables which may cause execution problems. If they complain, you should change your environment until they don't.

SEE ALSO

bugs(5)
reporting(5)
http://catb.org/~esr/faqs/smart-questions.html
rtsnapshot(1)
dbsnapshot(1)
rtincident(1)
orbstat(1)

BUGS AND CAVEATS

If you think you have found a problem with the software, please report bugs to "support@brtt.com".

Please bear in mind that a complete description of the problem, including an example of how to generate it, what you expected and what you actually saw, and any error messages generated is essential in order to diagnose and fix problems. See bugs(5) for further suggestions on how to report problems.

Do not send problem reports to individual email addresses at BRTT: they will not be answered. To receive a response, only use support@brtt.com. Requests sent to support@brtt.com are read by multiple people and will be responded to in a timely manner.

AUTHOR

Daniel Quinlan
Printer icon