Saturday, September 15, 2018

learn some programming algorithms


1. bloom filter

If you're going to be crawling a large site, then the above technique of simply using a set won't hold up. The set will grow huge and will start using a ton of memory. The solution to this is a Bloom Filter. I won't go into much detail, but a bloom filter essentially lets us store which urls we have seen, then ask the question "Have I seen this URL before?" and probably (very likely) returning the correct answer. It can do this using very little memory. pybloomfiltermmap is a good Bloom Filter implementation for Python.

Important: Make sure your normalize your URL before inserting it into the set / bloom filter. Depending on the site you may want to remove all query params etc. You don't want to be storing the same URL a ton of times.sorting

2. sorting



6. parity calculation

C Programming

Header Files and Includes https://cplusplus.com/forum/articles/10627/ https://stackoverflow.com/questions/2762568/c-c-include-header-file-or...