Search Ranking Algorithms, Duplicate Content, and Spam Detection
This is the first of eight videos in the Machine Learning for Marketers series I’m putting together for you. This series debunks and clarifies the constant misinformation I so often see when marketers talk about machine learning and the future of ________ [search, social, etc; pick your poison]. Machine learning isn’t magic, it’s not hand-wavy, and understanding it is way more within reach than you might realize.
By the end of the eight videos, you’ll understand far more precisely how to navigate the current and future impact of machine learning in search marketing, link building, content marketing, reputation management, and even in understanding the social graph, big data, and how machine learning is an important step toward artificial intelligence, but isn’t quite there yet. I’m boiling it all down for you in clear and simple terms.
While I’d love for you to sign up for ontolo, none of this is a sales pitch. Just direct explanations of how machine learning is influencing the current marketing landscape.
This week’s video, at about 28 minutes, is going to be the longest in the series, but will help you understand a tremendous amount about how a search engine actually ranks documents, identifies duplicate content, and detects spam. In it, I head up from Boulder, past Gold Hill, and end up getting a bit lost in the woods, but it all worked out, aside from a bit of a beating taken by the 4Runner.
More About Search Ranking Algorithms, Duplicate Content, and Spam Detection
Here are some other links you might want to check out if you’re looking for more information on any of this. I’ll continue to add to this list as I find new, interesting, and relevant stuff, so check back in regularly:
- Wikipedia: Inverted Index
- Part of a really fantastic, technical series on how a search index works.
- Another really fantastic, technical series on how a search index works.
On-Page Ranking Algorithms (TF/IDF, BM25, etc)
- Wikipedia: TF/IDF
- Wikipedia: BM25
- Implementing BM25 from a modified TF/IDF in Lucene (powers Solr and Elasticsearch)
- BM25 vs. Lucene similarity on the Elasticsearch blog.
- …BM25 isn’t always a better algorithm for all data sets.
Duplicate Content Detection: Cosine Similarity
- Wikipedia: Cosine Similarity
- Coursera: Machine Learning, Clustering and Retrieval: Cosine Similarity. Math-heavy.
Spam Detection: Bayes Theorem
- A fantastic 20-minute documentary of how Bayes Theorem was used to find sunken treasure 200 miles off the coast of South Carolina.
- Wikipedia: Bayes Theorem
- One of the best explanations of Bayes Theorem I’ve ever seen.
- Khan Academy: Conditional probability with Bayes Theorem
Machine Learning for Marketers: Schedule
- Introduction to Machine Learning for Marketers:
Cantankerous and Cranky.
Importance of Machine Learning for Your Future in Marketing.
- Search Indexes and On-Page Factors:
Inverted Indexes vs Relational Databases.
TF/IDF, Cosine Similarity, and BM25.
- Topics and Categorizing Web Pages:
Classification with Term Vectors.
Bag of Words and Window Sizes, and Clustering.
- Link Value Graphs:
Applications to SEO.
- Link Relationship Graphs:
Nearest Neighbor and Shortest Path.
Applications to Social Media Marketing.
- Modeling the Human Brain:
Artificial and Convolutional Neural Networks.
Extra: Putting it All together.
- Not Quite AI: The Human Element:
Normalization, Data Structures, and Training Models.
Extra: How Penalties, including “Manual” Penalties, Work.
- Understanding the “Big” in Big Data:
Limitations in the Physical World.
Data Storage, Parallel Computing (and GPUs), Distributed Computing.
You Probably Don’t Need (And Shouldn’t Use) Machine Learning.
I hope you enjoy it as much as I’m enjoying putting it together for you. If you think of anything I’ve left out, let me know.
P.S. The series is two weeks behind for two reasons:
- This first video took far longer than I anticipated. This is my first real foray into the medium of video, but I’ve decided to commit to video for the long term and will be doing far longer than this eight week series. So, it’s important for me to do this well for you, to a degree of quality, information density, and entertainment than you’ve seen before. Expect to see the quality improve as time goes on and I get my legs under me.
- While good gear doesn’t guarantee better quality, it definitely helps at a certain point. I have some new equipment coming in a couple of days that will greatly improve the video quality. So I’m going to do my best to get caught up on weeks 2 and 3 over the next couple of weeks, but either way, this will be wrapped up the week before Thanksgiving.