Saturday, January 13, 2024

PCIe Study Notes

Chapter 1: Background

Nice References

PCI

#TO Study Further: Reflected-Wave Signaling and the impact on number of max electrical load. higher frequencies -> less fewer load allowed on the bus. PCI frequency cannot faster than 66MHz.
Transaction Models: Programmed I/O, DMA, Peer-to-Peer.
PCI is a Shared Bus Architecture; PCI arbitration algorithm must be "fair" and not starve any device for access.
PCI Inefficiencies: 1) Retry: init->target not ready for >16T -> target retry initiator (STOP#); 2) Discconnect: init -> target transfer some data, then not ready for >8T -> target signal disconnect to initiator (STOP#)
Address Space Map: memory Map, IO Map, PCI Configuration Space( 256B per function)
PCI Configuration Header: Type0: non-bridge (6BARs), Type1: bridge (2 BARs)

PCI-X

PCI-X transfer efficiency ~85% vs PCI ~50%~60%
PCI-X Split-Transaction Model: Reqeust-Completer. Completer will memorize transaction (address, transaction type, count, requester ID) and signal a split response.
PCI-X add Attribute stage
PCI-X add MSI(Message Signaled Interrupt): eliminate the need to share interrupts accross multiple deivces. memory write trans, to pre-defined address range, data is interrupt vector of that device. cpu avoids the overhead of finding which device generated the interrupt. no sideband pins are needed.
PCI-X uses common or distributted clock: 1. signal skew; 2) clock skew between multiple devices; 3) flight time budget

PCI-X 2.0 is point-to-point design, instead of shared bus arch.
PCI-X 2.0 uses Source-synchronous clocking model

Chapter 2: PCIe Architecture Overview

PCIe serial data transmission.

Transaction Layer

        Full Duplex
        A Transaction is defined as the combination of a Request packet that a delivers a command to a targeted device, together with any completion packets the target sends back in reply.
        posted transactions and non-posted transactions:
                Memory Read -> Non-Posted
                Memory Write -> Posted
                Memory Read Lock -> Non-Posted
                IO Read -> Non-Posted
                IO Write -> Non-Posted
                Configuration Read (Type 0 and Type 1) -> Non-Posted
                Configuration Write (Type0 and Type 1) -> Non-Posted
                Message -> Posted
        Non-posted write (IO write and Cfg write) and Memory Read Lock can only come from the processor.
        TLP (Transaction Layer Packet)
        QoS: Virtual Channel Buffer
                

Data Link Lyaer

        Layer main function:
                TLP error correction: Ack/Nack causing LP retry
                flow control
                some Link power managment
        DLLP fixed length: 8B
        DLLP (Data Link Layer Packets) only transferred between Data Link Layers of the two neighboring devices on a Link, and not routed anywhere else.
        Ack DLLP has the last good TLP sequence Number, and the transmitter flushes all the TLPs that wer send before the acked sequence number. the sequence number is generated by Data Link Layer, so it's always in order!
        Nack DLLP: Receiver truturns a Nack to the transmitter when it detectrs a TLP error and drops this TLP. the transmitter then will replay all unacknowledged TLPs.
        Error Check and Retry: Replay Buffer
        Flow Control: 
        #TODO: hwen in NAK of MRd, while completer resend the CplD, there maybe new MWr from other requester, so the old CplD data may not be up to date. Is this expected?

Physical Layer

        #TODO: transmission order on the physical line.
        Logical Part
                #TODO: ppm long term will cause overflow, how does ethernet and PCIe resolve this?
                Elastic Buffer
                Byte-Stripped?

        Link Training and Initialization: several things are checked or established to ensure proper and optimal operation, such as:
                Link width
                Link data rate
                Lane erversal - Lanes connected in reverse order
                Polarity inversion - Lane polarity connected backward
                Bit lock per Lane - Recovering the transmitter clock
                Symbol lock per Lane - Findding a recognizable position in the bit-stream
                Lane-to-Lane de-skew within a multi-Lane Link
        
        Electriclal Part

        Ordered Sets: always 4B. always terminate at the neighboring deivecs and are not routed through the PCIe fabric.
                used in 1) link training; 2) ppe compensation; 3) enter/exit of low power state on the link

Chapter 3: Configuration Overview

Enumeration

       PCIe Buses: up to 256 Bus; Initial Bus 0 is tpoically assigned by HW to RC. Bus 0 consists of a Virtual PCI bus with integrated endpoints and Virtual PCI-to-PCI Bridges(P2P), which are hard-coded with a Device No. and Function No.
        PCIe Devices: 32 devices per bus; PCIe point-to-point link means that device will always be Device 0. Deach Device must implement Function 0.
        PCIe Functions: Functions do not need to be implemented sequenctially. SW must check every one of the 8 functions to decide which are present.

         Errors should not normally be reported during enumeration.(the enumeration sw could fail since it's typically writtent o execute before the OS or other error handling SW is available.

        Device not present: SW get Vendor ID of FFFFh, reserved for "device not present"
        #TODO: enumeration started after 100ms+link training after reset, how to do it in DV simulation?
        #TODO: what will enum sw do if there is no EP attached to a pci-pci bridge?
        #multi-root system: what if rc assign start bus number 64 to rc, but later find that the total buses are more then 265-64? is this an error?
        It is important to remember that each port on a switch is aBridge, and thus has its own configuration space with a Type 1 Header. In reality, each port (each P2P Bridge) has its own Type 1 Header and performs the same two checks on TLPs when they are seen by that port.


Chapter 4: Address Space & Transaction Routing

Confugration

        Confugartion Space Total 4KB/function:
                1. (256B/64DW) PCI-compatible spece access by legacy PCI software or PCIe enhanced configuration access mechanism: 64B for Configuration Headers (Type0/Type 1); 192B for Function-Specific Configuration Header Space.
                2. (960DW) PCIe Extended space is only accessible by PCIe Ehanhanced Configuration Access Mechanism: for optional extended capability registers: ...
        Enhanced Configuration Mechanism memory-mapped Address Range:
                A[63:28]: upper bits of the 256MB-aligned base address of the address range allocated for PCIe enhanced configuration mechanism.
        BAR only 32bits, how to handle >4GB address? Mem address 64-bit decoding uses 2 BARs.
        BARs are used for decide which system address can be used to access the PCIe functions' internal locations.
        BARs device has 6, but not all needs to be implemented, harded coded value of all 0 means it's not implemented.
        All BARs must be evaluated sequentially.
        Where is the BAR? it's inside each function?
        #TODO: how base/limit registers to mean unused registers? e.g. if does not use IO address space, how to cfg IO base/limit register?
        Base/Limit only needed in type 1 Bridge ports.

TLP Routing

        Routing based on TLP header FMT/Type fields.
        TLP:
                MRd
                MRdLk
                MWr
                IORd
                IOWr
                CfgRd0
                CfgWr0
                CfgRd1
                CfgWr1
                Msg
                MsgD
                Cpl
                CplD
                CplLk
                CplDLk
                FetchAdd
                Swap
                CAS
                LPrfx
                EPrfx
        Address routing: MRd, MRdLk, MWr, AtomicOp, IORd, IOWr, Msg/MsgD
        ID routing: CfgRd/Wr, Cpl/CplD, CplLk/CplDLk
        Implicit routing: Msg/MsgD, for messages to mimic side-band signals.

        








C Programming

Header Files and Includes https://cplusplus.com/forum/articles/10627/ https://stackoverflow.com/questions/2762568/c-c-include-header-file-or...