FAQ: Spam Filter Web Service

Updated 10.02.2020


Q:  Why is the spam filter not catching all my fraud? Why is the spam filter tagging some applications that are not fraud incorrectly?

The initial model was built on historical data in which fraud was tagged by a small group of colleges (~8 colleges). Based on 3 weeks of production data, we have 162,000 total applications and confirmed false positives by the colleges were 453 (0.2%). We will also have a small percentage of false negatives that are yet to be determined. This current model will be in use for a short duration of time until we enable continuous re-training and we expect the performance of model to continue improving as it now uses tagged data from all the colleges.


In continuous nightly retraining, the model uses the feedback provided by the colleges to continuously improve and evolve. If a fraud signature is not caught, the feedback provided by the college application specialists will accumulate to improve the model’s response during re-training. The same goes for signatures that the model incorrectly tagged as fraud. By retagging these are not fraud, the subsequent runs of the model will capture these signatures.

Q:  Why does my model catch some applications as fraud, while others as not fraud when I see some similar signatures in both groups of applications? 

The model is built to use over 200+ features. These features are a combination of application fields and engineered features. The model uses all of these features in determining an answer. As a result, it is not straightforward to explain why the model made a given decision. Where reasonable, we can dig into specific application or groups of applications to study decisions made by models so we can explain its decision better.


As explained in the previous question, when some of these decisions made by the model are inaccurate, the feedback from the colleges restructures the way the model uses all the features to come-up with an answer that is more accurate.

Q:  What about new fraud signatures?

If the signature is entirely new, the model will rely on initial feedback from the colleges to start learning the signatures. As such applications accumulate the model will comprehend these signatures and predict them accurately.

Q:  How long will the model take to learn a new signature?

This depends on how much tagged data is available for the model to capture the underlying patterns. With accurate tagged data, the model will pick-up on the underlying patterns and signatures very quickly.

Q:  How can the colleges help?

The model is designed to be a human-in-the-loop system. Feedback from the colleges is critical for the model to continue to preform well. The colleges can aid the system in the following two ways:


Ensure that all applications in the suspend folder are reviewed and closed. If an application is marked as fraud in the suspend folder, we require a confirmation by the colleges before these applications can be used in subsequent re-training. The larger the suspend folder, the slower the model evolves.


Identifying new fraud signatures in a timely manner. By tagging new fraud quickly, the model will get to learn from them and capture them automatically in subsequent runs.