Development: Spam Filter Web Service

Of the several million applications submitted through CCCApply each year, the vast majority of them are valid - submitted by legitimate applicants who want to attend a California community college. These applications contain personal identifiable data and other critical information that needs to get to the college as quickly and safely as possible. However, for the percentage of applications that are bad and that are submitted through CCCApply for nefarious purposes with the intent to commit fraud, we've developed a system that will analyze, flag, suspend, and ultimately, block the fraud attempt through a spam filter web service and user interface.

Development of the spam filter web service and user interface began in early 2017 to assist colleges in making accurate and informed decisions on whether an application is fraud or not. The tool consists of three main components: the post-submission web service, the machine-learning model and prediction service, and the user interface to review and confirm identified fraud.

This page talks about the development project, what it includes, and how it operates.

1 Post-Submission Web Service Process
2 Workflow Process
3 Post-submission Development

Post-Submission Web Service Process

At the end of the CCCApply application process, after all the application data has been entered by the student and the applicant has confirmed - under penalty of perjury - that the data being submitted is valid and correct, the "Submit" button is clicked to push the application data to the college that the applicant is applying to. Everything that happens after that point is considered the post-submission process and is the point at which the application is routed to the college via the Download Client or through the College Adapter (SuperGlue) for real-time integration with the college student information system.

With the development of the Spam Filter Web Service, every application is intercepted after submission and routed to the spam filter machine learning model and prediction service to see if the data meets the criteria that constitutes it as spam or fraud.

The applications that are legitimate and do not meet the criteria for spam are quickly passed through to the college via their selected data delivery method.

For the applications that are frauds, however, the model extracts the data and looks for "identifiers" which are then fed into machine learning algorithm for full analysis. The prediction service then calculates a probability of how confident it is that the application is bad; in other words it "suggests a level of confidence" between 1 and 100. The closer the number is to 100, the more likely it is fraudulent. This is called the Confidence Threshold.

At the heart of the web service is the machine learning, continuous training model that does NOT make any decisions, it just predicts whether an application meets the "identifiers" that have been collected by the model based on thousands of applications already confirmed as fraud by the colleges.

Read more about the Machine Learning Model and Prediction Service here.

Workflow Process

The post-submission workflow looks like this:

Application is submitted to CCCApply
Application is stored with a fraud status flag set to PENDING
Application is posted to the prediction service where model is applied
Prediction service returns the probability rating that the application is fraudulent or not.
Based on the probability rating, the fraud status flag is updated with “Checked Fraud” or “Not Checked Fraud”
Applications set with “Checked Fraud” are sent to the Suspension folder (User Interface) awaiting confirmation by college staff
College staff confirm fraud labels via User Interface
Application fraud label confirmation trains the machine learning model
Model is refined over time to better identify and filter fraudulent applications

Post-submission Development

Download client:
The major change to the download client is that applications will not be available to download unless they have a fraud_status of either LEGACY, NOT_CHECKED, CONFIRMED_NOT_FRAUD or CHECKED_NOT_FRAUD.

Export for training:
The Apply team will develop a new tool that can be used to export applications. This tool will dump applications into a CSV file, PGP encrypt the file and copy it to an S3 bucket for Infiniti. The file will contain application data and the fraud status for each application. Infiniti will use this file to perform ongoing training of their prediction model.