linux/Documentation/PCI/pci.rst
<<
>>
Prefs
   1.. SPDX-License-Identifier: GPL-2.0
   2
   3==============================
   4How To Write Linux PCI Drivers
   5==============================
   6
   7:Authors: - Martin Mares <mj@ucw.cz>
   8          - Grant Grundler <grundler@parisc-linux.org>
   9
  10The world of PCI is vast and full of (mostly unpleasant) surprises.
  11Since each CPU architecture implements different chip-sets and PCI devices
  12have different requirements (erm, "features"), the result is the PCI support
  13in the Linux kernel is not as trivial as one would wish. This short paper
  14tries to introduce all potential driver authors to Linux APIs for
  15PCI device drivers.
  16
  17A more complete resource is the third edition of "Linux Device Drivers"
  18by Jonathan Corbet, Alessandro Rubini, and Greg Kroah-Hartman.
  19LDD3 is available for free (under Creative Commons License) from:
  20https://lwn.net/Kernel/LDD3/.
  21
  22However, keep in mind that all documents are subject to "bit rot".
  23Refer to the source code if things are not working as described here.
  24
  25Please send questions/comments/patches about Linux PCI API to the
  26"Linux PCI" <linux-pci@atrey.karlin.mff.cuni.cz> mailing list.
  27
  28
  29Structure of PCI drivers
  30========================
  31PCI drivers "discover" PCI devices in a system via pci_register_driver().
  32Actually, it's the other way around. When the PCI generic code discovers
  33a new device, the driver with a matching "description" will be notified.
  34Details on this below.
  35
  36pci_register_driver() leaves most of the probing for devices to
  37the PCI layer and supports online insertion/removal of devices [thus
  38supporting hot-pluggable PCI, CardBus, and Express-Card in a single driver].
  39pci_register_driver() call requires passing in a table of function
  40pointers and thus dictates the high level structure of a driver.
  41
  42Once the driver knows about a PCI device and takes ownership, the
  43driver generally needs to perform the following initialization:
  44
  45  - Enable the device
  46  - Request MMIO/IOP resources
  47  - Set the DMA mask size (for both coherent and streaming DMA)
  48  - Allocate and initialize shared control data (pci_allocate_coherent())
  49  - Access device configuration space (if needed)
  50  - Register IRQ handler (request_irq())
  51  - Initialize non-PCI (i.e. LAN/SCSI/etc parts of the chip)
  52  - Enable DMA/processing engines
  53
  54When done using the device, and perhaps the module needs to be unloaded,
  55the driver needs to take the follow steps:
  56
  57  - Disable the device from generating IRQs
  58  - Release the IRQ (free_irq())
  59  - Stop all DMA activity
  60  - Release DMA buffers (both streaming and coherent)
  61  - Unregister from other subsystems (e.g. scsi or netdev)
  62  - Release MMIO/IOP resources
  63  - Disable the device
  64
  65Most of these topics are covered in the following sections.
  66For the rest look at LDD3 or <linux/pci.h> .
  67
  68If the PCI subsystem is not configured (CONFIG_PCI is not set), most of
  69the PCI functions described below are defined as inline functions either
  70completely empty or just returning an appropriate error codes to avoid
  71lots of ifdefs in the drivers.
  72
  73
  74pci_register_driver() call
  75==========================
  76
  77PCI device drivers call ``pci_register_driver()`` during their
  78initialization with a pointer to a structure describing the driver
  79(``struct pci_driver``):
  80
  81.. kernel-doc:: include/linux/pci.h
  82   :functions: pci_driver
  83
  84The ID table is an array of ``struct pci_device_id`` entries ending with an
  85all-zero entry.  Definitions with static const are generally preferred.
  86
  87.. kernel-doc:: include/linux/mod_devicetable.h
  88   :functions: pci_device_id
  89
  90Most drivers only need ``PCI_DEVICE()`` or ``PCI_DEVICE_CLASS()`` to set up
  91a pci_device_id table.
  92
  93New PCI IDs may be added to a device driver pci_ids table at runtime
  94as shown below::
  95
  96  echo "vendor device subvendor subdevice class class_mask driver_data" > \
  97  /sys/bus/pci/drivers/{driver}/new_id
  98
  99All fields are passed in as hexadecimal values (no leading 0x).
 100The vendor and device fields are mandatory, the others are optional. Users
 101need pass only as many optional fields as necessary:
 102
 103  - subvendor and subdevice fields default to PCI_ANY_ID (FFFFFFFF)
 104  - class and classmask fields default to 0
 105  - driver_data defaults to 0UL.
 106  - override_only field defaults to 0.
 107
 108Note that driver_data must match the value used by any of the pci_device_id
 109entries defined in the driver. This makes the driver_data field mandatory
 110if all the pci_device_id entries have a non-zero driver_data value.
 111
 112Once added, the driver probe routine will be invoked for any unclaimed
 113PCI devices listed in its (newly updated) pci_ids list.
 114
 115When the driver exits, it just calls pci_unregister_driver() and the PCI layer
 116automatically calls the remove hook for all devices handled by the driver.
 117
 118
 119"Attributes" for driver functions/data
 120--------------------------------------
 121
 122Please mark the initialization and cleanup functions where appropriate
 123(the corresponding macros are defined in <linux/init.h>):
 124
 125        ======          =================================================
 126        __init          Initialization code. Thrown away after the driver
 127                        initializes.
 128        __exit          Exit code. Ignored for non-modular drivers.
 129        ======          =================================================
 130
 131Tips on when/where to use the above attributes:
 132        - The module_init()/module_exit() functions (and all
 133          initialization functions called _only_ from these)
 134          should be marked __init/__exit.
 135
 136        - Do not mark the struct pci_driver.
 137
 138        - Do NOT mark a function if you are not sure which mark to use.
 139          Better to not mark the function than mark the function wrong.
 140
 141
 142How to find PCI devices manually
 143================================
 144
 145PCI drivers should have a really good reason for not using the
 146pci_register_driver() interface to search for PCI devices.
 147The main reason PCI devices are controlled by multiple drivers
 148is because one PCI device implements several different HW services.
 149E.g. combined serial/parallel port/floppy controller.
 150
 151A manual search may be performed using the following constructs:
 152
 153Searching by vendor and device ID::
 154
 155        struct pci_dev *dev = NULL;
 156        while (dev = pci_get_device(VENDOR_ID, DEVICE_ID, dev))
 157                configure_device(dev);
 158
 159Searching by class ID (iterate in a similar way)::
 160
 161        pci_get_class(CLASS_ID, dev)
 162
 163Searching by both vendor/device and subsystem vendor/device ID::
 164
 165        pci_get_subsys(VENDOR_ID,DEVICE_ID, SUBSYS_VENDOR_ID, SUBSYS_DEVICE_ID, dev).
 166
 167You can use the constant PCI_ANY_ID as a wildcard replacement for
 168VENDOR_ID or DEVICE_ID.  This allows searching for any device from a
 169specific vendor, for example.
 170
 171These functions are hotplug-safe. They increment the reference count on
 172the pci_dev that they return. You must eventually (possibly at module unload)
 173decrement the reference count on these devices by calling pci_dev_put().
 174
 175
 176Device Initialization Steps
 177===========================
 178
 179As noted in the introduction, most PCI drivers need the following steps
 180for device initialization:
 181
 182  - Enable the device
 183  - Request MMIO/IOP resources
 184  - Set the DMA mask size (for both coherent and streaming DMA)
 185  - Allocate and initialize shared control data (pci_allocate_coherent())
 186  - Access device configuration space (if needed)
 187  - Register IRQ handler (request_irq())
 188  - Initialize non-PCI (i.e. LAN/SCSI/etc parts of the chip)
 189  - Enable DMA/processing engines.
 190
 191The driver can access PCI config space registers at any time.
 192(Well, almost. When running BIST, config space can go away...but
 193that will just result in a PCI Bus Master Abort and config reads
 194will return garbage).
 195
 196
 197Enable the PCI device
 198---------------------
 199Before touching any device registers, the driver needs to enable
 200the PCI device by calling pci_enable_device(). This will:
 201
 202  - wake up the device if it was in suspended state,
 203  - allocate I/O and memory regions of the device (if BIOS did not),
 204  - allocate an IRQ (if BIOS did not).
 205
 206.. note::
 207   pci_enable_device() can fail! Check the return value.
 208
 209.. warning::
 210   OS BUG: we don't check resource allocations before enabling those
 211   resources. The sequence would make more sense if we called
 212   pci_request_resources() before calling pci_enable_device().
 213   Currently, the device drivers can't detect the bug when two
 214   devices have been allocated the same range. This is not a common
 215   problem and unlikely to get fixed soon.
 216
 217   This has been discussed before but not changed as of 2.6.19:
 218   https://lore.kernel.org/r/20060302180025.GC28895@flint.arm.linux.org.uk/
 219
 220
 221pci_set_master() will enable DMA by setting the bus master bit
 222in the PCI_COMMAND register. It also fixes the latency timer value if
 223it's set to something bogus by the BIOS.  pci_clear_master() will
 224disable DMA by clearing the bus master bit.
 225
 226If the PCI device can use the PCI Memory-Write-Invalidate transaction,
 227call pci_set_mwi().  This enables the PCI_COMMAND bit for Mem-Wr-Inval
 228and also ensures that the cache line size register is set correctly.
 229Check the return value of pci_set_mwi() as not all architectures
 230or chip-sets may support Memory-Write-Invalidate.  Alternatively,
 231if Mem-Wr-Inval would be nice to have but is not required, call
 232pci_try_set_mwi() to have the system do its best effort at enabling
 233Mem-Wr-Inval.
 234
 235
 236Request MMIO/IOP resources
 237--------------------------
 238Memory (MMIO), and I/O port addresses should NOT be read directly
 239from the PCI device config space. Use the values in the pci_dev structure
 240as the PCI "bus address" might have been remapped to a "host physical"
 241address by the arch/chip-set specific kernel support.
 242
 243See Documentation/driver-api/io-mapping.rst for how to access device registers
 244or device memory.
 245
 246The device driver needs to call pci_request_region() to verify
 247no other device is already using the same address resource.
 248Conversely, drivers should call pci_release_region() AFTER
 249calling pci_disable_device().
 250The idea is to prevent two devices colliding on the same address range.
 251
 252.. tip::
 253   See OS BUG comment above. Currently (2.6.19), The driver can only
 254   determine MMIO and IO Port resource availability _after_ calling
 255   pci_enable_device().
 256
 257Generic flavors of pci_request_region() are request_mem_region()
 258(for MMIO ranges) and request_region() (for IO Port ranges).
 259Use these for address resources that are not described by "normal" PCI
 260BARs.
 261
 262Also see pci_request_selected_regions() below.
 263
 264
 265Set the DMA mask size
 266---------------------
 267.. note::
 268   If anything below doesn't make sense, please refer to
 269   Documentation/core-api/dma-api.rst. This section is just a reminder that
 270   drivers need to indicate DMA capabilities of the device and is not
 271   an authoritative source for DMA interfaces.
 272
 273While all drivers should explicitly indicate the DMA capability
 274(e.g. 32 or 64 bit) of the PCI bus master, devices with more than
 27532-bit bus master capability for streaming data need the driver
 276to "register" this capability by calling pci_set_dma_mask() with
 277appropriate parameters.  In general this allows more efficient DMA
 278on systems where System RAM exists above 4G _physical_ address.
 279
 280Drivers for all PCI-X and PCIe compliant devices must call
 281pci_set_dma_mask() as they are 64-bit DMA devices.
 282
 283Similarly, drivers must also "register" this capability if the device
 284can directly address "consistent memory" in System RAM above 4G physical
 285address by calling pci_set_consistent_dma_mask().
 286Again, this includes drivers for all PCI-X and PCIe compliant devices.
 287Many 64-bit "PCI" devices (before PCI-X) and some PCI-X devices are
 28864-bit DMA capable for payload ("streaming") data but not control
 289("consistent") data.
 290
 291
 292Setup shared control data
 293-------------------------
 294Once the DMA masks are set, the driver can allocate "consistent" (a.k.a. shared)
 295memory.  See Documentation/core-api/dma-api.rst for a full description of
 296the DMA APIs. This section is just a reminder that it needs to be done
 297before enabling DMA on the device.
 298
 299
 300Initialize device registers
 301---------------------------
 302Some drivers will need specific "capability" fields programmed
 303or other "vendor specific" register initialized or reset.
 304E.g. clearing pending interrupts.
 305
 306
 307Register IRQ handler
 308--------------------
 309While calling request_irq() is the last step described here,
 310this is often just another intermediate step to initialize a device.
 311This step can often be deferred until the device is opened for use.
 312
 313All interrupt handlers for IRQ lines should be registered with IRQF_SHARED
 314and use the devid to map IRQs to devices (remember that all PCI IRQ lines
 315can be shared).
 316
 317request_irq() will associate an interrupt handler and device handle
 318with an interrupt number. Historically interrupt numbers represent
 319IRQ lines which run from the PCI device to the Interrupt controller.
 320With MSI and MSI-X (more below) the interrupt number is a CPU "vector".
 321
 322request_irq() also enables the interrupt. Make sure the device is
 323quiesced and does not have any interrupts pending before registering
 324the interrupt handler.
 325
 326MSI and MSI-X are PCI capabilities. Both are "Message Signaled Interrupts"
 327which deliver interrupts to the CPU via a DMA write to a Local APIC.
 328The fundamental difference between MSI and MSI-X is how multiple
 329"vectors" get allocated. MSI requires contiguous blocks of vectors
 330while MSI-X can allocate several individual ones.
 331
 332MSI capability can be enabled by calling pci_alloc_irq_vectors() with the
 333PCI_IRQ_MSI and/or PCI_IRQ_MSIX flags before calling request_irq(). This
 334causes the PCI support to program CPU vector data into the PCI device
 335capability registers. Many architectures, chip-sets, or BIOSes do NOT
 336support MSI or MSI-X and a call to pci_alloc_irq_vectors with just
 337the PCI_IRQ_MSI and PCI_IRQ_MSIX flags will fail, so try to always
 338specify PCI_IRQ_LEGACY as well.
 339
 340Drivers that have different interrupt handlers for MSI/MSI-X and
 341legacy INTx should chose the right one based on the msi_enabled
 342and msix_enabled flags in the pci_dev structure after calling
 343pci_alloc_irq_vectors.
 344
 345There are (at least) two really good reasons for using MSI:
 346
 3471) MSI is an exclusive interrupt vector by definition.
 348   This means the interrupt handler doesn't have to verify
 349   its device caused the interrupt.
 350
 3512) MSI avoids DMA/IRQ race conditions. DMA to host memory is guaranteed
 352   to be visible to the host CPU(s) when the MSI is delivered. This
 353   is important for both data coherency and avoiding stale control data.
 354   This guarantee allows the driver to omit MMIO reads to flush
 355   the DMA stream.
 356
 357See drivers/infiniband/hw/mthca/ or drivers/net/tg3.c for examples
 358of MSI/MSI-X usage.
 359
 360
 361PCI device shutdown
 362===================
 363
 364When a PCI device driver is being unloaded, most of the following
 365steps need to be performed:
 366
 367  - Disable the device from generating IRQs
 368  - Release the IRQ (free_irq())
 369  - Stop all DMA activity
 370  - Release DMA buffers (both streaming and consistent)
 371  - Unregister from other subsystems (e.g. scsi or netdev)
 372  - Disable device from responding to MMIO/IO Port addresses
 373  - Release MMIO/IO Port resource(s)
 374
 375
 376Stop IRQs on the device
 377-----------------------
 378How to do this is chip/device specific. If it's not done, it opens
 379the possibility of a "screaming interrupt" if (and only if)
 380the IRQ is shared with another device.
 381
 382When the shared IRQ handler is "unhooked", the remaining devices
 383using the same IRQ line will still need the IRQ enabled. Thus if the
 384"unhooked" device asserts IRQ line, the system will respond assuming
 385it was one of the remaining devices asserted the IRQ line. Since none
 386of the other devices will handle the IRQ, the system will "hang" until
 387it decides the IRQ isn't going to get handled and masks the IRQ (100,000
 388iterations later). Once the shared IRQ is masked, the remaining devices
 389will stop functioning properly. Not a nice situation.
 390
 391This is another reason to use MSI or MSI-X if it's available.
 392MSI and MSI-X are defined to be exclusive interrupts and thus
 393are not susceptible to the "screaming interrupt" problem.
 394
 395
 396Release the IRQ
 397---------------
 398Once the device is quiesced (no more IRQs), one can call free_irq().
 399This function will return control once any pending IRQs are handled,
 400"unhook" the drivers IRQ handler from that IRQ, and finally release
 401the IRQ if no one else is using it.
 402
 403
 404Stop all DMA activity
 405---------------------
 406It's extremely important to stop all DMA operations BEFORE attempting
 407to deallocate DMA control data. Failure to do so can result in memory
 408corruption, hangs, and on some chip-sets a hard crash.
 409
 410Stopping DMA after stopping the IRQs can avoid races where the
 411IRQ handler might restart DMA engines.
 412
 413While this step sounds obvious and trivial, several "mature" drivers
 414didn't get this step right in the past.
 415
 416
 417Release DMA buffers
 418-------------------
 419Once DMA is stopped, clean up streaming DMA first.
 420I.e. unmap data buffers and return buffers to "upstream"
 421owners if there is one.
 422
 423Then clean up "consistent" buffers which contain the control data.
 424
 425See Documentation/core-api/dma-api.rst for details on unmapping interfaces.
 426
 427
 428Unregister from other subsystems
 429--------------------------------
 430Most low level PCI device drivers support some other subsystem
 431like USB, ALSA, SCSI, NetDev, Infiniband, etc. Make sure your
 432driver isn't losing resources from that other subsystem.
 433If this happens, typically the symptom is an Oops (panic) when
 434the subsystem attempts to call into a driver that has been unloaded.
 435
 436
 437Disable Device from responding to MMIO/IO Port addresses
 438--------------------------------------------------------
 439io_unmap() MMIO or IO Port resources and then call pci_disable_device().
 440This is the symmetric opposite of pci_enable_device().
 441Do not access device registers after calling pci_disable_device().
 442
 443
 444Release MMIO/IO Port Resource(s)
 445--------------------------------
 446Call pci_release_region() to mark the MMIO or IO Port range as available.
 447Failure to do so usually results in the inability to reload the driver.
 448
 449
 450How to access PCI config space
 451==============================
 452
 453You can use `pci_(read|write)_config_(byte|word|dword)` to access the config
 454space of a device represented by `struct pci_dev *`. All these functions return
 4550 when successful or an error code (`PCIBIOS_...`) which can be translated to a
 456text string by pcibios_strerror. Most drivers expect that accesses to valid PCI
 457devices don't fail.
 458
 459If you don't have a struct pci_dev available, you can call
 460`pci_bus_(read|write)_config_(byte|word|dword)` to access a given device
 461and function on that bus.
 462
 463If you access fields in the standard portion of the config header, please
 464use symbolic names of locations and bits declared in <linux/pci.h>.
 465
 466If you need to access Extended PCI Capability registers, just call
 467pci_find_capability() for the particular capability and it will find the
 468corresponding register block for you.
 469
 470
 471Other interesting functions
 472===========================
 473
 474=============================   ================================================
 475pci_get_domain_bus_and_slot()   Find pci_dev corresponding to given domain,
 476                                bus and slot and number. If the device is
 477                                found, its reference count is increased.
 478pci_set_power_state()           Set PCI Power Management state (0=D0 ... 3=D3)
 479pci_find_capability()           Find specified capability in device's capability
 480                                list.
 481pci_resource_start()            Returns bus start address for a given PCI region
 482pci_resource_end()              Returns bus end address for a given PCI region
 483pci_resource_len()              Returns the byte length of a PCI region
 484pci_set_drvdata()               Set private driver data pointer for a pci_dev
 485pci_get_drvdata()               Return private driver data pointer for a pci_dev
 486pci_set_mwi()                   Enable Memory-Write-Invalidate transactions.
 487pci_clear_mwi()                 Disable Memory-Write-Invalidate transactions.
 488=============================   ================================================
 489
 490
 491Miscellaneous hints
 492===================
 493
 494When displaying PCI device names to the user (for example when a driver wants
 495to tell the user what card has it found), please use pci_name(pci_dev).
 496
 497Always refer to the PCI devices by a pointer to the pci_dev structure.
 498All PCI layer functions use this identification and it's the only
 499reasonable one. Don't use bus/slot/function numbers except for very
 500special purposes -- on systems with multiple primary buses their semantics
 501can be pretty complex.
 502
 503Don't try to turn on Fast Back to Back writes in your driver.  All devices
 504on the bus need to be capable of doing it, so this is something which needs
 505to be handled by platform and generic code, not individual drivers.
 506
 507
 508Vendor and device identifications
 509=================================
 510
 511Do not add new device or vendor IDs to include/linux/pci_ids.h unless they
 512are shared across multiple drivers.  You can add private definitions in
 513your driver if they're helpful, or just use plain hex constants.
 514
 515The device IDs are arbitrary hex numbers (vendor controlled) and normally used
 516only in a single location, the pci_device_id table.
 517
 518Please DO submit new vendor/device IDs to https://pci-ids.ucw.cz/.
 519There's a mirror of the pci.ids file at https://github.com/pciutils/pciids.
 520
 521
 522Obsolete functions
 523==================
 524
 525There are several functions which you might come across when trying to
 526port an old driver to the new PCI interface.  They are no longer present
 527in the kernel as they aren't compatible with hotplug or PCI domains or
 528having sane locking.
 529
 530=================       ===========================================
 531pci_find_device()       Superseded by pci_get_device()
 532pci_find_subsys()       Superseded by pci_get_subsys()
 533pci_find_slot()         Superseded by pci_get_domain_bus_and_slot()
 534pci_get_slot()          Superseded by pci_get_domain_bus_and_slot()
 535=================       ===========================================
 536
 537The alternative is the traditional PCI device driver that walks PCI
 538device lists. This is still possible but discouraged.
 539
 540
 541MMIO Space and "Write Posting"
 542==============================
 543
 544Converting a driver from using I/O Port space to using MMIO space
 545often requires some additional changes. Specifically, "write posting"
 546needs to be handled. Many drivers (e.g. tg3, acenic, sym53c8xx_2)
 547already do this. I/O Port space guarantees write transactions reach the PCI
 548device before the CPU can continue. Writes to MMIO space allow the CPU
 549to continue before the transaction reaches the PCI device. HW weenies
 550call this "Write Posting" because the write completion is "posted" to
 551the CPU before the transaction has reached its destination.
 552
 553Thus, timing sensitive code should add readl() where the CPU is
 554expected to wait before doing other work.  The classic "bit banging"
 555sequence works fine for I/O Port space::
 556
 557       for (i = 8; --i; val >>= 1) {
 558               outb(val & 1, ioport_reg);      /* write bit */
 559               udelay(10);
 560       }
 561
 562The same sequence for MMIO space should be::
 563
 564       for (i = 8; --i; val >>= 1) {
 565               writeb(val & 1, mmio_reg);      /* write bit */
 566               readb(safe_mmio_reg);           /* flush posted write */
 567               udelay(10);
 568       }
 569
 570It is important that "safe_mmio_reg" not have any side effects that
 571interferes with the correct operation of the device.
 572
 573Another case to watch out for is when resetting a PCI device. Use PCI
 574Configuration space reads to flush the writel(). This will gracefully
 575handle the PCI master abort on all platforms if the PCI device is
 576expected to not respond to a readl().  Most x86 platforms will allow
 577MMIO reads to master abort (a.k.a. "Soft Fail") and return garbage
 578(e.g. ~0). But many RISC platforms will crash (a.k.a."Hard Fail").
 579