1. bloom filter
If you're going to be crawling a large site, then the above technique of simply using a set won't hold up. The set will grow huge and will start using a ton of memory. The solution to this is a Bloom Filter. I won't go into much detail, but a bloom filter essentially lets us store which urls we have seen, then ask the question "Have I seen this URL before?" and probably (very likely) returning the correct answer. It can do this using very little memory. pybloomfiltermmap is a good Bloom Filter implementation for Python.
Important: Make sure your normalize your URL before inserting it into the set / bloom filter. Depending on the site you may want to remove all query params etc. You don't want to be storing the same URL a ton of times.sorting
Important: Make sure your normalize your URL before inserting it into the set / bloom filter. Depending on the site you may want to remove all query params etc. You don't want to be storing the same URL a ton of times.sorting
2. sorting
3. minimum common divider?
4. (Fast) Fourier Transform
An Interactive Guide To The Fourier TransformDigital Signal Processing
The Scientist & Engineer's Guide to Digital Signal Processing
5. calculus
Paul's online notesTrig Cheat Sheet
6. parity calculation