I have had this blog up for one day, almost 24 hours, and currently have 132 comments on the first “hello world” post. Needless to say, this is all SPAM.
There are, of course, many anti-spam solutions available for WordPress; however, for the sake of this exercise, we’re going to ignore those. If I were building this for a client, I would install a plugin and move on with my day.
If I were installing this for a client, I would also not be self-hosting WordPress, but that’s another story.
Okay. First step. Managing SPAM Comments.
In the WordPress admin panel, you can navigate to /wp-admin/edit-comments.php and manage the comments one by one, or eight at a time. You can select them as spam or delete them.
At the current count of … 152 SPAM comments, this is time consuming and not fun.
So. Let’s explore other ways to manage these SPAM comments.
Before I figure out how I want to manage these SPAM comments, I want to get a good idea of the data.
To do that, I’m going to download the comments and then review the data with python and pandas.
Getting the Data
I like to use WP-CLI as a tool to manage my WordPress installations. There are install instructions at the link provided. WP-CLI is a command line tool that allows a developer to run commands against their WordPress site. If you are comfortable with working in a Linux terminal, I would recommend you take a look at WP-CLI.
As an example, for this project, I can login to my server (or run this remotely, which I do) and run a command to get all the comments from my site:
wp comment list
And that provides the following output:
That’s a helpful start. There are; however, other fields available with comments that are not available in the default list. You can review the docs here: https://developer.wordpress.org/cli/commands/comment/list/
For this project, two things I want that are not in the default list are comment_author_IP, comment_author_url and comment_content. I want to see where these SPAM comments are coming from, what the author posted as their URL; and, what the content of the comment was.
I also want the output to be a CSV file because I’m going to parse it and look at it further and I want to save it to a file. This changes my command to the following:
wp comment list --fields=comment_ID,comment_date,comment_author_IP,comment_author_url,comment_content --format=json > data/spam_report.csv
I run that command and it saves the output to my local as a CSV file
Parsing the data
Now that I’ve download my comments in a CSV file, I want to parse the data and see what we can see.
I use python, pandas and jupyter notebook for this task. Installing those is a discussion by itself and perhaps its own future article, although there are already plenty of sources for that. For now I will just focus on parsing the data.
First, after starting my jupyter notebook server, I import pandas and read the downloaded CSV file into a dataframe
From the screenshot, we can see I have, at this count, 166 comments that are, safe to say, all spam.
One thing I’m curious about: where are these coming from? For that, I’ll group by the IP address and get a count.
Out of the list, I have five very similar IP addresses that seem to be causing the majority of the comment SPAM.
Let’s take a look at the author URL’s as well
And let’s take a look at the domains
So I have a process now where I can download the comment spam (or I guess all comments) and analyze some basic information.
There are a couple of things I can do now. I could programmatically mark them all as spam. I can also programmatically delete them. However, that only solves for this batch. What I want to do is see if I can create my own spam filter and block this on my own, without a third party paid service.
Again, if I were doing this for a client, I would go grab one those existing services and be done with it.
Next, I will work on attempting to determine which of these comments are spam and which are not.
More to come….