Thursday, January 30, 2025

fuzzy search

Below are from answers get from Deepseek

### **How Fuzzy Search Works**

Fuzzy search is designed to find strings that approximately match a given query, even if the query contains typos, misspellings, or incomplete words. The key idea is to rank potential matches based on how closely they align with the query.

#### Key Concepts:

1. **Subsequence Matching**:

- Fuzzy search looks for whether the characters in the query appear in the target string **in the same order**, but not necessarily consecutively.

- For example, the query `"leaders"` matches `"leadership skills"` because the letters `l`, `e`, `a`, `d`, `e`, `r`, `s` appear in order within the target string.

2. **Scoring**:

- Matches are ranked based on how closely they align with the query. Factors like:

- **Proximity of characters**: Characters that appear closer together in the target string are preferred.

- **Completeness**: Matches that include more of the query characters are ranked higher.

3. **Efficiency**:

- Fuzzy search algorithms are optimized to quickly filter and rank potential matches, even in large datasets.

---

### **How FuzzyFinder Implements Fuzzy Search**

FuzzyFinder uses a simple yet effective approach to implement fuzzy search. Here’s a high-level overview of how it works:

#### Algorithm Steps:

1. **Preprocessing**:

- Convert both the query and the target strings to lowercase (to make the search case-insensitive).

2. **Subsequence Matching**:

- For each target string, check if the characters in the query appear in the same order within the target string.

- Example:

- Query: `"leaders"`

- Target: `"leadership skills"`

- Check if `l`, `e`, `a`, `d`, `e`, `r`, `s` appear in order in `"leadership skills"`.

3. **Scoring**:

- Assign a score to each match based on:

- **Character proximity**: Characters that appear closer together in the target string are preferred.

- **Match completeness**: Matches that include more of the query characters are ranked higher.

4. **Ranking**:

- Sort the matches based on their scores and return the results.

---

### **Example Implementation of Fuzzy Search**

If you wanted to implement a simple version of fuzzy search yourself, here’s how you could do it in Python:

#### Code:

```python

def fuzzy_search(query, choices):

query = query.lower()

matches = []

for choice in choices:

choice_lower = choice.lower()

i = 0 # Index for query

j = 0 # Index for choice

score = 0

# Check if query characters appear in order in the choice

while i < len(query) and j < len(choice_lower):

if query[i] == choice_lower[j]:

i += 1

score += 1 # Increase score for each matching character

j += 1

# If all query characters were found in order, add to matches

if i == len(query):

matches.append((choice, score))

# Sort matches by score (higher scores first)

matches.sort(key=lambda x: x[1], reverse=True)

return [match[0] for match in matches]

# Example usage

terms = [

"team leaders",

"project leaders",

"leadership skills",

"manager",

"ceo",

"director",

"lead developer",

"head of department",

"lead engineer",

"leadership training"

]

query = "leaders"

matches = fuzzy_search(query, terms)

print("Fuzzy matches for 'leaders':", matches)

```

#### Output:

```

Fuzzy matches for 'leaders': ['team leaders', 'project leaders', 'leadership skills', 'leadership training']

```

---

### **Why FuzzyFinder is Efficient**

FuzzyFinder is optimized for performance. It avoids complex calculations (like Levenshtein distance) and instead focuses on **subsequence matching** and **scoring**, which makes it fast and suitable for real-time applications.

---

### **Comparison with Other Algorithms**

1. **Levenshtein Distance**:

- Measures the number of edits (insertions, deletions, substitutions) required to transform one string into another.

- More accurate but computationally expensive for large datasets.

2. **FuzzyWuzzy**:

- Uses Levenshtein distance and provides similarity scores.

- Slower than FuzzyFinder but more precise.

3. **Regular Expressions**:

- Can be used for pattern matching but doesn’t handle typos or approximate matches.

---

### **When to Use FuzzyFinder**

- When you need **fast, approximate matching** for large datasets.

- When you want a lightweight solution without heavy dependencies.

- When you don’t need the precision of Levenshtein distance.

---

### **Conclusion**

FuzzyFinder’s algorithm is not overly complicated—it relies on **subsequence matching** and **scoring** to efficiently find approximate matches. If you’re curious, you can even implement a basic version yourself (as shown above) to understand how it works under the hood.

If you’d like to dive deeper into the implementation or explore more advanced fuzzy search techniques, let me know! 😊

Thursday, February 8, 2024

SVN Quick Reference

 this is a short summary for my reference based on the great official tutorial at:

Basic Work Cycle (red-bean.com)


Check out a working copy from a repository

$ svn checkout -r 1234 url://repository/path

Update Your Working Copy

%svn update    #this will download latest version file from repository to local working copy
#NOTE: will overwrite any local changes not commited?

%svn up -r 1234        #Checkout a specific revision from SVN

Make Your Changes

%svn add <file>
%svn delete <file>
%svn copy <file>
%svn move <file>
%svn mkdir <dir>

Changing the Repository Without a Working Copy

#not interested in this at this stage

Review Your Changes

%svn status    #only show the local files different from local checkouted version in repository
#Note: even without internet work, svn can still show the changes, this is because svn keeps an origianl file in local.
%svn status stuff/fish.c        #If you pass a specific path to svn status, you get information about that item alone
%svn status -v                        #verbose options, shows status of every local file

%svn status -u -v                              #-u options means -update, it shows also show you if any files has newer version in repository than you local version. outdated version and changed files will need resolve

$svn diff                                #difference details of locally changes files
%svn diff --old <file0> --new <file1>        #like diff command, shows differences between two files

most common codes that svn status displays:
? item #The file, directory, or symbolic link item is not under version control.
A item #The file, directory, or symbolic link item has been scheduled for addition into the repository.
C item #The file item is in a state of conflict. That is, changes received from the server during an update overlap with local changes that you have in your working copy (and weren't resolved during the update). You must resolve this conflict before committing your changes to the repository.
D item #The file, directory, or symbolic link item has been scheduled for deletion from the repository.
M item #The contents of the file item have been modified.
! item               #Item is missing (e.g., you moved or deleted it without using svn). This also indicates that a directory is incomplete (a checkout or update was interrupted).

#Note: advanced ussage of patch command from svn1.7 a future topic to study.

Fix Your Mistakes

%svn revert <file>        #revert any changes you made to local copy of files
#Note: But note that svn revert can undo any scheduled operation—for example, you might decide that you don't want to add a new file after all


Resolve Any Conflicts

#TODO

Commit Your Changes

%svn commit -m "Corrected number of cheese slices."

Examining History

%svn diff -r 3 rules.txt            #If a single --revision (-r) number is passed, your working copy is compared to the specified revision in the repository
%svn diff -r 2:3 rules.txt        #If two revision numbers, separated by a colon, are passed via --revision (-r), the two revisions are directly compared:
%svn diff -c 3 rules.txt            #A more convenient way of comparing one revision to the previous revision is to use the --change (-c) option:
%svn diff -r{2024-10-02} -r{2024-10-06} --summarize    #list changed filenames between two dates
%svn log <file>                        #history message of a file
%svn log -q <file>                    #-q quiet, only show file name
%svn diff --summarize -r42:41  #just pirnt the filenames that has changes between 2 commits

Checkout a Specific Version of files

svn checkout -r 100 https://example.com/svn/repo/trunk/folder [/path/to/workspace/folder]
svn checkout -r {2023-01-01} https://example.com/svn/repo/trunk/folder [/path/to/workspace/folder]


Sometimes You Just Need to Clean Up

#you can just delete the local working copy without the need to notify server
#before removing local working copy, remommend to run svn status just to make sure you dont have anything changes that you want to keep

Saturday, January 13, 2024

PCIe Study Notes

Chapter 1: Background

Nice References

PCI

#TO Study Further: Reflected-Wave Signaling and the impact on number of max electrical load. higher frequencies -> less fewer load allowed on the bus. PCI frequency cannot faster than 66MHz.
Transaction Models: Programmed I/O, DMA, Peer-to-Peer.
PCI is a Shared Bus Architecture; PCI arbitration algorithm must be "fair" and not starve any device for access.
PCI Inefficiencies: 1) Retry: init->target not ready for >16T -> target retry initiator (STOP#); 2) Discconnect: init -> target transfer some data, then not ready for >8T -> target signal disconnect to initiator (STOP#)
Address Space Map: memory Map, IO Map, PCI Configuration Space( 256B per function)
PCI Configuration Header: Type0: non-bridge (6BARs), Type1: bridge (2 BARs)

PCI-X

PCI-X transfer efficiency ~85% vs PCI ~50%~60%
PCI-X Split-Transaction Model: Reqeust-Completer. Completer will memorize transaction (address, transaction type, count, requester ID) and signal a split response.
PCI-X add Attribute stage
PCI-X add MSI(Message Signaled Interrupt): eliminate the need to share interrupts accross multiple deivces. memory write trans, to pre-defined address range, data is interrupt vector of that device. cpu avoids the overhead of finding which device generated the interrupt. no sideband pins are needed.
PCI-X uses common or distributted clock: 1. signal skew; 2) clock skew between multiple devices; 3) flight time budget

PCI-X 2.0 is point-to-point design, instead of shared bus arch.
PCI-X 2.0 uses Source-synchronous clocking model

Chapter 2: PCIe Architecture Overview

PCIe serial data transmission.

Transaction Layer

        Full Duplex
        A Transaction is defined as the combination of a Request packet that a delivers a command to a targeted device, together with any completion packets the target sends back in reply.
        posted transactions and non-posted transactions:
                Memory Read -> Non-Posted
                Memory Write -> Posted
                Memory Read Lock -> Non-Posted
                IO Read -> Non-Posted
                IO Write -> Non-Posted
                Configuration Read (Type 0 and Type 1) -> Non-Posted
                Configuration Write (Type0 and Type 1) -> Non-Posted
                Message -> Posted
        Non-posted write (IO write and Cfg write) and Memory Read Lock can only come from the processor.
        TLP (Transaction Layer Packet)
        QoS: Virtual Channel Buffer
                

Data Link Lyaer

        Layer main function:
                TLP error correction: Ack/Nack causing LP retry
                flow control
                some Link power managment
        DLLP fixed length: 8B
        DLLP (Data Link Layer Packets) only transferred between Data Link Layers of the two neighboring devices on a Link, and not routed anywhere else.
        Ack DLLP has the last good TLP sequence Number, and the transmitter flushes all the TLPs that wer send before the acked sequence number. the sequence number is generated by Data Link Layer, so it's always in order!
        Nack DLLP: Receiver truturns a Nack to the transmitter when it detectrs a TLP error and drops this TLP. the transmitter then will replay all unacknowledged TLPs.
        Error Check and Retry: Replay Buffer
        Flow Control: 
        #TODO: hwen in NAK of MRd, while completer resend the CplD, there maybe new MWr from other requester, so the old CplD data may not be up to date. Is this expected?

Physical Layer

        #TODO: transmission order on the physical line.
        Logical Part
                #TODO: ppm long term will cause overflow, how does ethernet and PCIe resolve this?
                Elastic Buffer
                Byte-Stripped?

        Link Training and Initialization: several things are checked or established to ensure proper and optimal operation, such as:
                Link width
                Link data rate
                Lane erversal - Lanes connected in reverse order
                Polarity inversion - Lane polarity connected backward
                Bit lock per Lane - Recovering the transmitter clock
                Symbol lock per Lane - Findding a recognizable position in the bit-stream
                Lane-to-Lane de-skew within a multi-Lane Link
        
        Electriclal Part

        Ordered Sets: always 4B. always terminate at the neighboring deivecs and are not routed through the PCIe fabric.
                used in 1) link training; 2) ppe compensation; 3) enter/exit of low power state on the link

Chapter 3: Configuration Overview

Enumeration

       PCIe Buses: up to 256 Bus; Initial Bus 0 is tpoically assigned by HW to RC. Bus 0 consists of a Virtual PCI bus with integrated endpoints and Virtual PCI-to-PCI Bridges(P2P), which are hard-coded with a Device No. and Function No.
        PCIe Devices: 32 devices per bus; PCIe point-to-point link means that device will always be Device 0. Deach Device must implement Function 0.
        PCIe Functions: Functions do not need to be implemented sequenctially. SW must check every one of the 8 functions to decide which are present.

         Errors should not normally be reported during enumeration.(the enumeration sw could fail since it's typically writtent o execute before the OS or other error handling SW is available.

        Device not present: SW get Vendor ID of FFFFh, reserved for "device not present"
        #TODO: enumeration started after 100ms+link training after reset, how to do it in DV simulation?
        #TODO: what will enum sw do if there is no EP attached to a pci-pci bridge?
        #multi-root system: what if rc assign start bus number 64 to rc, but later find that the total buses are more then 265-64? is this an error?
        It is important to remember that each port on a switch is aBridge, and thus has its own configuration space with a Type 1 Header. In reality, each port (each P2P Bridge) has its own Type 1 Header and performs the same two checks on TLPs when they are seen by that port.


Chapter 4: Address Space & Transaction Routing

Confugration

        Confugartion Space Total 4KB/function:
                1. (256B/64DW) PCI-compatible spece access by legacy PCI software or PCIe enhanced configuration access mechanism: 64B for Configuration Headers (Type0/Type 1); 192B for Function-Specific Configuration Header Space.
                2. (960DW) PCIe Extended space is only accessible by PCIe Ehanhanced Configuration Access Mechanism: for optional extended capability registers: ...
        Enhanced Configuration Mechanism memory-mapped Address Range:
                A[63:28]: upper bits of the 256MB-aligned base address of the address range allocated for PCIe enhanced configuration mechanism.
        BAR only 32bits, how to handle >4GB address? Mem address 64-bit decoding uses 2 BARs.
        BARs are used for decide which system address can be used to access the PCIe functions' internal locations.
        BARs device has 6, but not all needs to be implemented, harded coded value of all 0 means it's not implemented.
        All BARs must be evaluated sequentially.
        Where is the BAR? it's inside each function?
        #TODO: how base/limit registers to mean unused registers? e.g. if does not use IO address space, how to cfg IO base/limit register?
        Base/Limit only needed in type 1 Bridge ports.

TLP Routing

        Routing based on TLP header FMT/Type fields.
        TLP:
                MRd
                MRdLk
                MWr
                IORd
                IOWr
                CfgRd0
                CfgWr0
                CfgRd1
                CfgWr1
                Msg
                MsgD
                Cpl
                CplD
                CplLk
                CplDLk
                FetchAdd
                Swap
                CAS
                LPrfx
                EPrfx
        Address routing: MRd, MRdLk, MWr, AtomicOp, IORd, IOWr, Msg/MsgD
        ID routing: CfgRd/Wr, Cpl/CplD, CplLk/CplDLk
        Implicit routing: Msg/MsgD, for messages to mimic side-band signals.

        








Sunday, December 31, 2023

how to build a cpu

Adder

 oscillator

SR latch
Level triggered D-type flip-flop
Edge triggered flip-flop: needed for accumulator

Sunday, November 5, 2023

QEMU/KVM virtual machines on linux

I'm running Debian linux 12 on macbook pro.


I find the official Debian Wiki is as always very useful:

KVM - Debian Wiki

and also this Youtube video tutorial for beginners:

QEMU/KVM for absolute beginners by Veronica Explains


install needed packages

%sudo apt install --no-install-recommends qemu-system libvirt-clients libvirt-daemon-system

%sudo apt install virtinst #if you want to create guest os on command line instead of gui


Creating a new guest:

virt-install --virt-type kvm --name bullseye-amd64 \
--location http://deb.debian.org/debian/dists/bullseye/main/installer-amd64/ \
--os-variant debian11 \
--disk size=20 --memory 4000 \
--graphics none \
--console pty,target_type=serial \
--extra-args "console=ttyS0"

Connect to a VM through command line

#Use the virsh console command to log in to a running VM called ‘centos7’ type: 
%virsh console centos7 
#To exit a virsh console session, type 
CTRL + Shift + ]



raspberry pi gpio controls

#gpiozero library https://gpiozero.readthedocs.io/en/stable/#