linux/Documentation/oops-tracing.txt
<<
>>
Prefs
   1NOTE: ksymoops is useless on 2.6.  Please use the Oops in its original format
   2(from dmesg, etc).  Ignore any references in this or other docs to "decoding
   3the Oops" or "running it through ksymoops".  If you post an Oops from 2.6 that
   4has been run through ksymoops, people will just tell you to repost it.
   5
   6Quick Summary
   7-------------
   8
   9Find the Oops and send it to the maintainer of the kernel area that seems to be
  10involved with the problem.  Don't worry too much about getting the wrong person.
  11If you are unsure send it to the person responsible for the code relevant to
  12what you were doing.  If it occurs repeatably try and describe how to recreate
  13it.  That's worth even more than the oops.
  14
  15If you are totally stumped as to whom to send the report, send it to 
  16linux-kernel@vger.kernel.org. Thanks for your help in making Linux as
  17stable as humanly possible.
  18
  19Where is the Oops?
  20----------------------
  21
  22Normally the Oops text is read from the kernel buffers by klogd and
  23handed to syslogd which writes it to a syslog file, typically
  24/var/log/messages (depends on /etc/syslog.conf).  Sometimes klogd dies,
  25in which case you can run dmesg > file to read the data from the kernel
  26buffers and save it.  Or you can cat /proc/kmsg > file, however you
  27have to break in to stop the transfer, kmsg is a "never ending file".
  28If the machine has crashed so badly that you cannot enter commands or
  29the disk is not available then you have three options :-
  30
  31(1) Hand copy the text from the screen and type it in after the machine
  32    has restarted.  Messy but it is the only option if you have not
  33    planned for a crash. Alternatively, you can take a picture of
  34    the screen with a digital camera - not nice, but better than
  35    nothing.  If the messages scroll off the top of the console, you
  36    may find that booting with a higher resolution (eg, vga=791)
  37    will allow you to read more of the text. (Caveat: This needs vesafb,
  38    so won't help for 'early' oopses)
  39
  40(2) Boot with a serial console (see Documentation/serial-console.txt),
  41    run a null modem to a second machine and capture the output there
  42    using your favourite communication program.  Minicom works well.
  43
  44(3) Use Kdump (see Documentation/kdump/kdump.txt),
  45    extract the kernel ring buffer from old memory with using dmesg
  46    gdbmacro in Documentation/kdump/gdbmacros.txt.
  47
  48
  49Full Information
  50----------------
  51
  52NOTE: the message from Linus below applies to 2.4 kernel.  I have preserved it
  53for historical reasons, and because some of the information in it still
  54applies.  Especially, please ignore any references to ksymoops. 
  55
  56From: Linus Torvalds <torvalds@osdl.org>
  57
  58How to track down an Oops.. [originally a mail to linux-kernel]
  59
  60The main trick is having 5 years of experience with those pesky oops 
  61messages ;-)
  62
  63Actually, there are things you can do that make this easier. I have two 
  64separate approaches:
  65
  66        gdb /usr/src/linux/vmlinux
  67        gdb> disassemble <offending_function>
  68
  69That's the easy way to find the problem, at least if the bug-report is 
  70well made (like this one was - run through ksymoops to get the 
  71information of which function and the offset in the function that it 
  72happened in).
  73
  74Oh, it helps if the report happens on a kernel that is compiled with the 
  75same compiler and similar setups.
  76
  77The other thing to do is disassemble the "Code:" part of the bug report: 
  78ksymoops will do this too with the correct tools, but if you don't have
  79the tools you can just do a silly program:
  80
  81        char str[] = "\xXX\xXX\xXX...";
  82        main(){}
  83
  84and compile it with gcc -g and then do "disassemble str" (where the "XX" 
  85stuff are the values reported by the Oops - you can just cut-and-paste 
  86and do a replace of spaces to "\x" - that's what I do, as I'm too lazy 
  87to write a program to automate this all).
  88
  89Alternatively, you can use the shell script in scripts/decodecode.
  90Its usage is:  decodecode < oops.txt
  91
  92The hex bytes that follow "Code:" may (in some architectures) have a series
  93of bytes that precede the current instruction pointer as well as bytes at and
  94following the current instruction pointer.  In some cases, one instruction
  95byte or word is surrounded by <> or (), as in "<86>" or "(f00d)".  These
  96<> or () markings indicate the current instruction pointer.  Example from
  97i386, split into multiple lines for readability:
  98
  99Code: f9 0f 8d f9 00 00 00 8d 42 0c e8 dd 26 11 c7 a1 60 ea 2b f9 8b 50 08 a1
 10064 ea 2b f9 8d 34 82 8b 1e 85 db 74 6d 8b 15 60 ea 2b f9 <8b> 43 04 39 42 54
 1017e 04 40 89 42 54 8b 43 04 3b 05 00 f6 52 c0
 102
 103Finally, if you want to see where the code comes from, you can do
 104
 105        cd /usr/src/linux
 106        make fs/buffer.s        # or whatever file the bug happened in
 107
 108and then you get a better idea of what happens than with the gdb 
 109disassembly.
 110
 111Now, the trick is just then to combine all the data you have: the C 
 112sources (and general knowledge of what it _should_ do), the assembly 
 113listing and the code disassembly (and additionally the register dump you 
 114also get from the "oops" message - that can be useful to see _what_ the 
 115corrupted pointers were, and when you have the assembler listing you can 
 116also match the other registers to whatever C expressions they were used 
 117for).
 118
 119Essentially, you just look at what doesn't match (in this case it was the 
 120"Code" disassembly that didn't match with what the compiler generated). 
 121Then you need to find out _why_ they don't match. Often it's simple - you 
 122see that the code uses a NULL pointer and then you look at the code and 
 123wonder how the NULL pointer got there, and if it's a valid thing to do 
 124you just check against it..
 125
 126Now, if somebody gets the idea that this is time-consuming and requires 
 127some small amount of concentration, you're right. Which is why I will 
 128mostly just ignore any panic reports that don't have the symbol table 
 129info etc looked up: it simply gets too hard to look it up (I have some 
 130programs to search for specific patterns in the kernel code segment, and 
 131sometimes I have been able to look up those kinds of panics too, but 
 132that really requires pretty good knowledge of the kernel just to be able 
 133to pick out the right sequences etc..)
 134
 135_Sometimes_ it happens that I just see the disassembled code sequence 
 136from the panic, and I know immediately where it's coming from. That's when 
 137I get worried that I've been doing this for too long ;-)
 138
 139                Linus
 140
 141
 142---------------------------------------------------------------------------
 143Notes on Oops tracing with klogd:
 144
 145In order to help Linus and the other kernel developers there has been
 146substantial support incorporated into klogd for processing protection
 147faults.  In order to have full support for address resolution at least
 148version 1.3-pl3 of the sysklogd package should be used.
 149
 150When a protection fault occurs the klogd daemon automatically
 151translates important addresses in the kernel log messages to their
 152symbolic equivalents.  This translated kernel message is then
 153forwarded through whatever reporting mechanism klogd is using.  The
 154protection fault message can be simply cut out of the message files
 155and forwarded to the kernel developers.
 156
 157Two types of address resolution are performed by klogd.  The first is
 158static translation and the second is dynamic translation.  Static
 159translation uses the System.map file in much the same manner that
 160ksymoops does.  In order to do static translation the klogd daemon
 161must be able to find a system map file at daemon initialization time.
 162See the klogd man page for information on how klogd searches for map
 163files.
 164
 165Dynamic address translation is important when kernel loadable modules
 166are being used.  Since memory for kernel modules is allocated from the
 167kernel's dynamic memory pools there are no fixed locations for either
 168the start of the module or for functions and symbols in the module.
 169
 170The kernel supports system calls which allow a program to determine
 171which modules are loaded and their location in memory.  Using these
 172system calls the klogd daemon builds a symbol table which can be used
 173to debug a protection fault which occurs in a loadable kernel module.
 174
 175At the very minimum klogd will provide the name of the module which
 176generated the protection fault.  There may be additional symbolic
 177information available if the developer of the loadable module chose to
 178export symbol information from the module.
 179
 180Since the kernel module environment can be dynamic there must be a
 181mechanism for notifying the klogd daemon when a change in module
 182environment occurs.  There are command line options available which
 183allow klogd to signal the currently executing daemon that symbol
 184information should be refreshed.  See the klogd manual page for more
 185information.
 186
 187A patch is included with the sysklogd distribution which modifies the
 188modules-2.0.0 package to automatically signal klogd whenever a module
 189is loaded or unloaded.  Applying this patch provides essentially
 190seamless support for debugging protection faults which occur with
 191kernel loadable modules.
 192
 193The following is an example of a protection fault in a loadable module
 194processed by klogd:
 195---------------------------------------------------------------------------
 196Aug 29 09:51:01 blizard kernel: Unable to handle kernel paging request at virtual address f15e97cc
 197Aug 29 09:51:01 blizard kernel: current->tss.cr3 = 0062d000, %cr3 = 0062d000
 198Aug 29 09:51:01 blizard kernel: *pde = 00000000
 199Aug 29 09:51:01 blizard kernel: Oops: 0002
 200Aug 29 09:51:01 blizard kernel: CPU:    0
 201Aug 29 09:51:01 blizard kernel: EIP:    0010:[oops:_oops+16/3868]
 202Aug 29 09:51:01 blizard kernel: EFLAGS: 00010212
 203Aug 29 09:51:01 blizard kernel: eax: 315e97cc   ebx: 003a6f80   ecx: 001be77b   edx: 00237c0c
 204Aug 29 09:51:01 blizard kernel: esi: 00000000   edi: bffffdb3   ebp: 00589f90   esp: 00589f8c
 205Aug 29 09:51:01 blizard kernel: ds: 0018   es: 0018   fs: 002b   gs: 002b   ss: 0018
 206Aug 29 09:51:01 blizard kernel: Process oops_test (pid: 3374, process nr: 21, stackpage=00589000)
 207Aug 29 09:51:01 blizard kernel: Stack: 315e97cc 00589f98 0100b0b4 bffffed4 0012e38e 00240c64 003a6f80 00000001 
 208Aug 29 09:51:01 blizard kernel:        00000000 00237810 bfffff00 0010a7fa 00000003 00000001 00000000 bfffff00 
 209Aug 29 09:51:01 blizard kernel:        bffffdb3 bffffed4 ffffffda 0000002b 0007002b 0000002b 0000002b 00000036 
 210Aug 29 09:51:01 blizard kernel: Call Trace: [oops:_oops_ioctl+48/80] [_sys_ioctl+254/272] [_system_call+82/128] 
 211Aug 29 09:51:01 blizard kernel: Code: c7 00 05 00 00 00 eb 08 90 90 90 90 90 90 90 90 89 ec 5d c3 
 212---------------------------------------------------------------------------
 213
 214Dr. G.W. Wettstein           Oncology Research Div. Computing Facility
 215Roger Maris Cancer Center    INTERNET: greg@wind.rmcc.com
 216820 4th St. N.
 217Fargo, ND  58122
 218Phone: 701-234-7556
 219
 220
 221---------------------------------------------------------------------------
 222Tainted kernels:
 223
 224Some oops reports contain the string 'Tainted: ' after the program
 225counter. This indicates that the kernel has been tainted by some
 226mechanism.  The string is followed by a series of position-sensitive
 227characters, each representing a particular tainted value.
 228
 229  1: 'G' if all modules loaded have a GPL or compatible license, 'P' if
 230     any proprietary module has been loaded.  Modules without a
 231     MODULE_LICENSE or with a MODULE_LICENSE that is not recognised by
 232     insmod as GPL compatible are assumed to be proprietary.
 233
 234  2: 'F' if any module was force loaded by "insmod -f", ' ' if all
 235     modules were loaded normally.
 236
 237  3: 'S' if the oops occurred on an SMP kernel running on hardware that
 238     hasn't been certified as safe to run multiprocessor.
 239     Currently this occurs only on various Athlons that are not
 240     SMP capable.
 241
 242  4: 'R' if a module was force unloaded by "rmmod -f", ' ' if all
 243     modules were unloaded normally.
 244
 245  5: 'M' if any processor has reported a Machine Check Exception,
 246     ' ' if no Machine Check Exceptions have occurred.
 247
 248  6: 'B' if a page-release function has found a bad page reference or
 249     some unexpected page flags.
 250
 251  7: 'U' if a user or user application specifically requested that the
 252     Tainted flag be set, ' ' otherwise.
 253
 254  8: 'D' if the kernel has died recently, i.e. there was an OOPS or BUG.
 255
 256  9: 'A' if the ACPI table has been overridden.
 257
 258 10: 'W' if a warning has previously been issued by the kernel.
 259     (Though some warnings may set more specific taint flags.)
 260
 261 11: 'C' if a staging driver has been loaded.
 262
 263 12: 'I' if the kernel is working around a severe bug in the platform
 264     firmware (BIOS or similar).
 265
 266 13: 'O' if an externally-built ("out-of-tree") module has been loaded.
 267
 268The primary reason for the 'Tainted: ' string is to tell kernel
 269debuggers if this is a clean kernel or if anything unusual has
 270occurred.  Tainting is permanent: even if an offending module is
 271unloaded, the tainted value remains to indicate that the kernel is not
 272trustworthy.
 273