Fraudulent Applications: Machine Learning Research Study

Soon after the first wave of fraud applications were identified in June 2016, the CCC Technology Center took immediate steps to strengthen the security of the CCCApply system and protect our students' personal identifiable data (read more about all the ways we are addressing fraud in CCCApply). At the same time, we contracted with a machine learning data research team to perform data analysis on several thousand fraud applications examples that we had collected from the colleges who initially reported the spam.

Research Objectives

The objectives for the research project were simple:

Understand why we are seeing an influx of fraudulent applications across the CCC system
Understand the motivations behind these fraudulent attacks
Identify trends, commonalities and patterns in the data
Identify the tools and techniques being used by spammers
What can CCCApply do to prevent fraud now and in the future?
What can the colleges do to prevent fraud now and in the future?

In addition, the research team will work with the CCCApply product manager and support team to commence a small pilot of colleges to help develop a process for ongoing collection of data and fraud applications for continuous analysis and disseminate information to the colleges.

Data Analysis

Based on that initial review, we initiated a multi-part data analysis (without using any student personal information). In the first data review, we focused on one college that provided a large number of bad applications between June 1, 2016 - August 15, 2017; the second analysis looked at all other colleges who provided examples of bad applications in the same time frame; and the third pull looked at all remaining colleges and submitted application data. It was important to compare the bad applications to good applications in order to start detecting trends and patterns in the fraudulent "formula".

After reviewing all three data pulls, even without including personal identifiable information, we learned a great deal.

The majority of bad applications identified were submitted in under 3 minutes, with the majority of those being submitted in under 2.5 minutes. This information alone told us that robots are likely submitting applications using keyboard strokes;

Of the applications identified as frauds, other patterns were prevalent:

Time to completion: 2.25 minutes (average)
Permanent Address State: NOT California
Current Mailing Address State: NOT California
Gender: Male
Race: White
HS Ed Level: No high school completion
Interest in Financial Aid: NO

Research Outcomes

By identifying characteristics common in the fake applications collected by colleges, such as volume, average submission time, patterns in the submitted data, and user profiling - and comparing that information to non-fraud applications, we are able to take steps to prevent this threat through enhanced security, short-term stop gap fixes as needed, and the development of a spam filter web service. These aren't the only solutions, but as we continue to better understand the motivations behind these attacks, these can be used as part of an overall enhanced security strategy.

One of the outcomes of the research study was the recommendation to develop a spam filter web service that would prevent these the bad applications from getting back to the colleges through their download system to prevent bad data from getting to the colleges and continuously re-training the prediction service model.

Research Outcomes
After the initial review, the data analysts recommended developing a spam filter service using on a continuous learning/training model - based on a custom algorithm that will get smarter each time an application is flagged as "spam". This filter service is being built for CCCApply Standard application, with a back-end user interface that will be accessible in the new CCCApply Administrator (deploying in June). Both the spam filter service and the admin interface are under-development now - with an expected release date of June 2018. This is a huge project and will require the cooperation and participation of all colleges - not just the colleges being targeted with spam - in order to "train" the algorithm with accurate data - both good, legitimate applications as well as the bad, fraudulent applications.

A comprehensive communication plan is mapped out, beginning with the announcement about the Spam Filter as part of the new CCCApply Administrator release- going out the week of March 19. Training webinars and user guides are being developed to accompany the new system.

Meanwhile, we continue to work with the machine learning team and several colleges in a pilot project to build and train the algorithm with any bad applications submitted by colleges. The email tomorrow will also specify how colleges can submit their fraud applications to the Tech Center for this purpose (we need them formatted in a specific way and ensure colleges know not to include any student personal identity information.

We are also working with the CCCApply Steering Committee to better understand the motivations of these spammers. What are they after?

Research Outcomes: What We've Learned

Trends & Motivation for Fraudulent Activity

We've identified several motivating factors and are working with our security office to publish some best practices to help colleges prevent bad applications from being submitted in the first place.

We've found that the majority of spammers are seeking financial gain and are targeting colleges that are giving away something for free at the time of application - specifically, .edu email addresses, as well as free software licenses -before the applicant has been officially admitted to the college (registered, or other vetting process).

Among other things, in appears these spammers are using the .edu email addresses to:

seek special discounts on technology hardware and software
selling the addresses on eBay and CraigsList (we've found them there)
using the emails and other auto-responses that acknowledge their "California" address / residency to create fake identities
apply for financial aid

To confirm our suspicions, we surveyed the colleges that have reported fraudulent applications and each one of the colleges confirmed that they have been giving new applicants a .edu address automatically upon application submission.

Other Motivating Factors

Some colleges are giving applicants free software licenses (Office 365). These licenses are being sold to end-users.
In some instances, confirmation emails being sent to applicants are confirming their residency status (based on self-reported data). These are then being used to create fake identities.
Student ids and other "identification codes" are allowing these fraud applicants to access the colleges' SIS (again, this is happening prior to registration).

From a security standpoint, allowing students to access a college's student information system prior to registration or matriculation process is a high risk that our Chief Security Officer, Jeff Holden, is also investigating to see what can be done from a systemwide perspective.