The purpose of a spam filter is to prevent unsolicited or spam email from reaching the inbox. Spam filters analyze various criteria gathered from the email to determine whether it is spam. Mailbox providers then create a spam score based on the criteria to determine if a campaign passes through the filter. The spam score criteria varies depending on the mailbox provider, so it is possible for a campaign to get through some filters but fail to get through other filters. Generally, campaigns must trigger multiple negative criteria check points to be filtered as spam.
Although there are more than a few ways to develop a spam filter, the three methodologies most often employed are:
- Algorithms: A set of rules that are followed based on email and sender characteristics.
- Heuristics: A problem solving technique used to find an approximate solution when an exact solution cannot be found.
- Bayesian: The more advanced form of heuristics that relies on advanced mathematical and statistical techniques to help determine the probability that something is true.
Spam filters use hundreds of data points to determine if an email is spam but they generally focus on four main aspects of email including:
- The sender's identity
- The IP and domain reputation of the sender
- Subscriber behavior
- The content of the email
Sender identity
A commonly analyzed aspect by a spam filter is the identity used by the sender. It is used by a mailbox provider to record the IP and domain reputation associated with the sender to help determine trustworthiness. Analyzing sender identity allows spammers to be recognized faster, leading to the spam filter moving messages to the spam folder or blocking the sender completely.
Spam filters use a combination of IP address and domain recognition within the email to create a sender identity and history that is recorded and used to identify and filter future email. Highly recommended but not required; senders have the option of using authentication techniques that help make it easier for a spam filter to identify the sender. Spam filters may or may not use all three forms of identity authentication as part of their filtering criteria.
The three forms of authentication are:
- Sender Policy Framework (SPF): An SPF record is a type of Domain Name Service (DNS) record that identifies which email servers are permitted to send email on behalf of the domain. The purpose of an SPF record is to prevent spammers from sending messages with forged From addresses at the domain. Recipients can refer to the SPF record to determine whether a message purporting to be from a domain comes from an authorized email server.
- DomainKeys Identified Mail (DKIM): A DKIM record lets an organization take responsibility for a message that is in transit. The organization is a handler of the message, either as its originator or as an intermediary. Its reputation is the basis for evaluating whether to trust the message for further handling, such as delivery. Technically, DKIM provides a method for validating a domain name identity that is associated with a message through cryptographic authentication.
- Domain-based Message Authentication, Reporting & Conformance (DMARC): A DMARC record leverages the SPF and DKIM authentication protocols to help communicate to mailbox providers how to treat email with a forged sending domain. A sender can instruct a mailbox provider to ignore, quarantine or reject email should the sender's identity be unverified (using SPF and DKIM) upon receipt.
IP and domain reputation
Reputation-based filters automatically apply the mailbox provider's email flow policies based on the reputation score of the sender. As the filter receives inbound email, a threat assessment of the sender is performed. This assessment returns a reputation score which is linked to email flow policies specified by the mailbox provider administrator. Generally, the lower the reputation score, the more email that is filtered or blocked. Some of the parameters leveraged to generate a reputation score are:
- Complaints
- Unknown users
- Spam traps
- Message composition
- Volume
- Blocklists
- Allowlists
Subscriber behavior
Another important aspect of spam filters is looking at the way users interact with messages from a sender. The subscriber behavior is recorded and used by the spam filter to help determine a sender reputation and reputation score. Those reputation scores are then used to help the spam filter determine which email is spam or not spam. Some of the important aspects of subscriber behavior used by spam filters are:
- Complaints: Marking an email as spam (This is Spam [TIS]) is used by a spam filter to determine a complaint rate based on the volume from the sender. The higher the rate, the more likely the spam filter calculates a higher spam score. This is seen as a negative interaction by the user towards the sender.
- Rescuing email from the spam folder: When a user moves your message out of the spam folder (This is Not Spam [TINS]), it is also used to determine a rate based on the volume from the sender. The higher the rate, the more likely the spam filter calculates a lower spam score. This is seen as a positive interaction by the user towards the sender.
- Reads: Users are less likely to read email they consider to be spam. Reads are typically time-based. When a user opens a message and keeps it open for a period of time, it implies they are reading the contents. When a user reads your message, it is used to determine a rate based on the volume from the sender. The higher the rate, the more likely the spam filter calculates a lower spam score. This is seen as a positive interaction by the user towards the sender.
- Deleting without opening: Users tend to delete messages without opening when they are not interested in the content. If a large number of users consistently delete messages without opening them or unsubscribing, it is likely to raise the spam score. This is seen as a negative interaction by the user towards the sender.
Content
Content analysis technology scans email, everything from the header, footer, code, HTML markup, images, text color, timestamp, URLs, subject line, text-to-image ratio, language, attachments, and more. For some content filters, there is not one part of the message that the content filter ignores. Other content filters may look at only the structure of an email, while still others simply parse URLs out of the message and then reference them against a blocklist (like the Spam Uniform Resources identifier Real-time Block List [SURBL]).
One well-known method of analyzing content is called fingerprinting. Technology providers like Cloudmark are known for creating fingerprints of email content. Fingerprinting is a technology that helps filters make decisions about email.