3 Data Science - Insurance Claims(Python)

Insurance Claims - Fraud Detection

Business case:
Insurance fraud is a huge problem in the industry. It's difficult to identify fraud claims. IHS is in a unique position to help the Auto Insurance industry with this problem.

Problem Statement:
Data is stored in different systems and its difficult to build analytics using multiple data sources. Copying data into a single platform is time consuming.

Business solution:
Use S3 as a data lake to store different sources of data in a single platform. This allows data scientists / analysis to quickly analyze the data and generate reports to predict market trends and/or make financial decisions.

Technical Solution:
Use Databricks as a single platform to pull various sources of data from API endpoints, or batch dumps into S3 for further processing. ETL the CSV datasets into efficient Parquet formats for performant processing.

In this example, we will be working with some auto insurance data to demonstrate how we can create a predictive model that predicts if an insurance claim is fraudulent or not. This will be a Binary Classification task, and we will be creating a Decision Tree model.

With the prediction data, we are able to estimate what our total predicted fradulent claim amount is like, and zoom into various features such as a breakdown of predicted fraud count by insured hobbies - our model's best predictor.

We will cover the following steps to illustrate how we build a Machine Learning Pipeline:

  • Data Import
  • Data Exploration
  • Data Processing
  • Create Decision Tree Model
  • Measuring Error Rate
  • Model Tuning
  • Zooming in on Prediction Data

Data Import

First download this csv file locally

The data used in this example was from a CSV file that was imported using the Tables UI. Databases and Tables

After uploading the data using the UI, we can run SparkSQL queries against the table, or create a DataFrame from the table.
In this example, we will create a Spark DataFrame.

display(dbutils.fs.ls("/FileStore/tables/dgshzbe11508270992746/"))
dbfs:/FileStore/tables/dgshzbe11508270992746/insurance_claims.csvinsurance_claims.csv266964
# data = spark.table("insurance_claims")

fileStorePath = "/FileStore/tables/dgshzbe11508270992746/insurance_claims.csv"

data = spark.read.format("csv")\
          .options(inferSchema="true", header="true")\
          .load(fileStorePath)\
          .drop("_c39")
      
df = data.withColumn("policy_bind_date", data.policy_bind_date.cast("string"))\
         .withColumn("incident_date", data.incident_date.cast("string"))
# Preview data
display(df)
328485215852014-10-17 00:00:00OH250/50010001406.910466132MALEMDcraft-repairsleepinghusband5330002015-01-25 00:00:00Single Vehicle CollisionSide CollisionMajor DamagePoliceSCColumbus9935 4th Drive51YES12YES7161065101302052080Saab92x2004Y
228423428682006-06-27 00:00:00IN250/50020001197.225000000468176MALEMDmachine-op-inspctreadingother-relative002015-01-21 00:00:00Vehicle Theft?Minor DamagePoliceVARiverwood6608 MLK Hwy81?00?50707807803510MercedesE4002007Y
134296876982000-09-06 00:00:00OH100/30020001413.145000000430632FEMALEPhDsalesboard-gamesown-child3510002015-02-22 00:00:00Multi-vehicle CollisionRear CollisionMinor DamagePoliceNYColumbus7121 Francis Lane73NO23NO346507700385023100DodgeRAM2007N
256412278111990-05-25 00:00:00IL250/50020001415.746000000608117FEMALEPhDarmed-forcesboard-gamesunmarried48900-624002015-01-10 00:00:00Single Vehicle CollisionFront CollisionMajor DamagePoliceOHArlington6956 Maple Drive51?12NO634006340634050720ChevroletTahoe2014Y
228443674552014-06-06 00:00:00IL500/100010001583.916000000610706MALEAssociatesalesboard-gamesunmarried66000-460002015-02-17 00:00:00Vehicle Theft?Minor DamageNoneNYArlington3041 3rd Ave201NO01NO650013006504550AccuraRSX2009N
256391045942006-10-12 00:00:00OH250/50010001351.10478456FEMALEPhDtech-supportbungie-jumpingunmarried002015-01-02 00:00:00Multi-vehicle CollisionRear CollisionMajor DamageFireSCArlington8973 Washington St193NO02NO641006410641051280Saab952003Y
137344139782000-06-04 00:00:00IN250/50010001333.350441716MALEPhDprof-specialtyboard-gameshusband0-770002015-01-13 00:00:00Multi-vehicle CollisionFront CollisionMinor DamagePoliceNYSpringfield5846 Weaver Drive03?00?7865021450715050050NissanPathfinder2012N
165374290271990-02-03 00:00:00IL100/30010001137.030603195MALEAssociatetech-supportbase-jumpingunmarried002015-02-27 00:00:00Multi-vehicle CollisionFront CollisionTotal LossPoliceVAColumbus3525 3rd Hwy233?22YES515909380938032830AudiA52015N
27334856651997-02-05 00:00:00IL100/3005001442.990601734FEMALEPhDother-servicegolfown-child002015-01-30 00:00:00Single Vehicle CollisionFront CollisionTotal LossPoliceWVArlington4872 Rock Ridge211NO11YES277002770277022160ToyotaCamry2012N
212426365502011-07-25 00:00:00IL100/3005001315.680600983MALEPhDpriv-house-servcampingwife0-393002015-01-05 00:00:00Single Vehicle CollisionRear CollisionTotal LossOtherNCHillsdale3066 Francis Ave141NO21?423004700470032900Saab92x1996N
235425436102002-05-26 00:00:00OH100/3005001253.124000000462283FEMALEMastersexec-managerialdancingother-relative3840002015-01-06 00:00:00Single Vehicle CollisionFront CollisionTotal LossPoliceNYNorthbend1558 1st Ridge221YES22?8701079101582063280FordF1502002N
447612146181999-05-29 00:00:00OH100/30020001137.160615561FEMALEHigh Schoolexec-managerialskydivingother-relative0-510002015-02-15 00:00:00Multi-vehicle CollisionFront CollisionMajor DamageFireSCSpringfield5971 5th Hwy213YES12YES114920176801768079560AudiA32006N
60238426431997-11-20 00:00:00OH500/10005001215.363000000432220MALEMDprotective-servreadingwife002015-01-22 00:00:00Single Vehicle CollisionRear CollisionTotal LossAmbulanceSCNorthbend6655 5th Drive91YES10NO565204710942042390Saab952000N
121346268082012-10-26 00:00:00OH100/3001000936.610464652FEMALEMDarmed-forcesbungie-jumpingwife52800-328002015-01-08 00:00:00Parked Car?Minor DamageNoneSCSpringfield6582 Elm Lane51NO11NO7280112011205040ToyotaHighlander2010N
180386440811998-12-28 00:00:00OH250/50020001301.130476685FEMALECollegemachine-op-inspctboard-gamesnot-in-family41300-555002015-01-15 00:00:00Single Vehicle CollisionRear CollisionTotal LossPoliceSCSpringfield6851 3rd Drive121NO02YES462004200840033600DodgeNeon2003Y
473588928741992-10-19 00:00:00IN100/30020001131.40458733FEMALEMDtransport-movingmoviesother-relative5570002015-01-29 00:00:00Multi-vehicle CollisionSide CollisionMajor DamageOtherWVHillsdale9573 Weaver Ave124YES00NO63120105201052042080AccuraMDX1999Y
70265589382005-06-08 00:00:00OH500/100010001199.445000000619884MALECollegemachine-op-inspcthikingown-child6360002015-02-22 00:00:00Multi-vehicle CollisionRear CollisionMajor DamageOtherNYRiverwood5074 3rd St03?12YES521105790579040530NissanMaxima2012N
140312752652004-11-15 00:00:00IN500/1000500708.646000000470610MALEHigh Schoolmachine-op-inspctreadingunmarried5350002015-01-06 00:00:00Single Vehicle CollisionSide CollisionTotal LossPoliceWVNorthbend4546 Tree St91NO02YES7788014160708056640SuburuLegacy2015N
160379212022014-12-28 00:00:00OH500/10005001374.220472135FEMALEMDcraft-repairyachtingother-relative45500-378002015-01-19 00:00:00Single Vehicle CollisionSide CollisionTotal LossOtherNYNorthbrook3842 Solo Ridge191YES10NO7293066301326053040AccuraTL2015N
196391439721992-08-02 00:00:00IN500/100020001475.730477670FEMALEHigh Schoolhandlers-cleanerscampingown-child57000-273002015-02-22 00:00:00Multi-vehicle CollisionSide CollisionMajor DamagePoliceVAColumbus8101 3rd Ridge83?20NO604006040604048320NissanPathfinder2014N
460621834302002-06-25 00:00:00IN250/50010001187.964000000618845MALEJDother-servicebungie-jumpingown-child002015-01-01 00:00:00Multi-vehicle CollisionRear CollisionMinor DamagePoliceNYColumbus5380 Pine St203NO10?471600524041920SuburuImpreza2011N
217414318762005-11-27 00:00:00IL500/10002000875.150442479FEMALEAssociatemachine-op-inspctskydivingown-child4670002015-02-10 00:00:00Multi-vehicle CollisionSide CollisionTotal LossPoliceSCArlington8957 Weaver Drive153?12?378400473033110AccuraRSX1996N
370552854961994-05-27 00:00:00IL100/3002000972.180443920MALEHigh Schoolprof-specialtypaintballother-relative72700-682002015-01-11 00:00:00Multi-vehicle CollisionRear CollisionMajor DamageAmbulanceSCHillsdale2526 Embaracadero Ave203NO00YES7152017880596047680SuburuForrestor2000Y
413551153991991-02-08 00:00:00IN100/30020001268.790453148MALEMDpriv-house-servchessown-child0-310002015-01-19 00:00:00Single Vehicle CollisionFront CollisionTotal LossAmbulanceWVNorthbend5667 4th Drive151?22?9816081801636073620DodgeRAM2011Y
237407368821996-02-02 00:00:00IN100/3001000883.310434733MALECollegecraft-repairkayakinghusband0-535002015-02-24 00:00:00Single Vehicle CollisionRear CollisionMinor DamageOtherVARiverwood2502 Apache Hwy61NO13NO7788070801416056640FordEscape2005N
8356990442013-12-05 00:00:00OH100/30020001266.920613982MALEMasterssalespoloown-child002015-01-09 00:00:00Multi-vehicle CollisionRear CollisionMajor DamageOtherOHArlington3418 Texas Lane163NO13YES71500165001100044000FordEscape2006Y
257438632361990-09-20 00:00:00IN100/30020001322.10436984MALEHigh Schoolprof-specialtygolfown-child0-292002015-01-28 00:00:00Parked Car?Minor DamagePolicePAArlington2533 Elm St41YES13YES902016408206560ToyotaCamry2005N
202346085132002-07-18 00:00:00IN100/300500848.073000000607730MALEJDexec-managerialchessnot-in-family31000-302002015-01-07 00:00:00Vehicle Theft?Minor DamageNoneVANorthbrook3790 Andromedia Hwy51YES21?572010405204160SuburuForrestor2003Y
224409140881990-02-08 00:00:00OH100/30020001291.70609837FEMALEJDsaleskayakingnot-in-family0-556002015-01-08 00:00:00Single Vehicle CollisionSide CollisionMinor DamageOtherSCNorthbend3220 Rock Drive211NO10YES6984077601552046560DodgeNeon2009N
241455967852014-03-04 00:00:00IL500/100020001104.50432211FEMALEPhDmachine-op-inspctbasketballunmarried002015-02-15 00:00:00Single Vehicle CollisionRear CollisionMinor DamagePoliceSCNorthbrook2100 Francis Drive51NO22NO91650141001410063450AccuraTL2011N
64259086162000-02-18 00:00:00IL250/5001000954.160473328MALEMastersprof-specialtyvideo-gameshusband5320002015-01-18 00:00:00Multi-vehicle CollisionSide CollisionMajor DamageAmbulanceSCColumbus4687 5th Drive224NO00?75600126001260050400ToyotaCorolla2005N
166376663332008-06-19 00:00:00IL100/30020001337.288000000610393MALEJDcraft-repairreadinghusband2750002015-02-28 00:00:00Multi-vehicle CollisionSide CollisionMajor DamagePoliceWVRiverwood9038 2nd Lane103NO22?671407460746052220FordF1502006Y
155353366142003-08-01 00:00:00IL500/100010001088.340614780FEMALEAssociateadm-clericalyachtingother-relative8110002015-02-24 00:00:00Multi-vehicle CollisionFront CollisionTotal LossPoliceNYArlington6092 5th Ave163YES23NO297903310331023170BMW3 Series2008N
114305848591992-04-04 00:00:00IL100/30010001558.290472248MALEHigh Schoolfarming-fishingvideo-gameswife51400-640002015-01-09 00:00:00Multi-vehicle CollisionFront CollisionMajor DamageAmbulanceNYHillsdale8353 Britain Ridge13NO12?77110140201402049070SuburuImpreza2015N
149379904931991-01-13 00:00:00IL500/10005001415.680603381MALEPhDprof-specialtyyachtingown-child002015-02-12 00:00:00Single Vehicle CollisionSide CollisionTotal LossFireWVHillsdale3540 Maple St171YES01YES6480010800540048600AudiA31999N
147331298722010-08-08 00:00:00OH100/30010001334.156000000479224MALEHigh Schoolcraft-repairreadingnot-in-family53300-492002015-01-24 00:00:00Single Vehicle CollisionFront CollisionMajor DamageOtherWVSpringfield3104 Sky Drive151YES20YES5310010620531037170MercedesC3001995Y
62282001522003-03-09 00:00:00IL100/3001000988.450430141FEMALEMastersprotective-servcampingunmarried002015-01-09 00:00:00Single Vehicle CollisionRear CollisionTotal LossPoliceNYNorthbrook4981 Weaver St31?11YES602006020602048160SuburuForrestor2004Y
289499332931993-02-03 00:00:00IL500/100020001222.480620757FEMALEJDpriv-house-servgolfunmarried002015-01-18 00:00:00Parked Car?Minor DamageNoneWVArlington6676 Tree Lane161NO11YES533012308203280SuburuLegacy2001N
431544856642002-11-25 00:00:00IN500/100020001155.550615901FEMALEMDcraft-repairbungie-jumpingunmarried6570002015-01-21 00:00:00Multi-vehicle CollisionRear CollisionMajor DamagePoliceNYHillsdale3930 Embaracadero St43?20?6230012460623043610JeepWrangler2007N
199379828711997-07-27 00:00:00IN250/5005001262.080474615MALEJDtech-supportvideo-gameswife4850002015-01-08 00:00:00Single Vehicle CollisionFront CollisionMajor DamageAmbulanceNCColumbus3422 Flute St41?03NO60170109401094038290NissanPathfinder2011Y
79262062131995-05-08 00:00:00IL100/3005001451.620456446MALEAssociatetech-supportkayakingnot-in-family0-557002015-01-03 00:00:00Single Vehicle CollisionRear CollisionMinor DamageAmbulanceWVColumbus4862 Lincoln Hwy191NO22?400008000400028000BMWM52010N
116346163372012-08-30 00:00:00IN250/5005001737.660470577MALEAssociatetransport-movingchessunmarried0-241002015-01-01 00:00:00Single Vehicle CollisionSide CollisionMajor DamagePoliceWVNorthbrook5719 2nd Lane11?11?97080161801618064720BMWX52001Y
37234489612006-04-30 00:00:00IL500/10005001475.930441648FEMALECollegeprof-specialtyhikinghusband0-674002015-01-16 00:00:00Multi-vehicle CollisionSide CollisionMinor DamageOtherSCSpringfield3221 Solo Ridge173YES10NO516605740574040180DodgeRAM2010N
106307904422003-04-13 00:00:00OH250/500500538.170433782FEMALEPhDtransport-movingreadingown-child49700-602002015-02-10 00:00:00Single Vehicle CollisionRear CollisionTotal LossOtherNCArlington6660 MLK Drive231NO22NO511205680568039760MercedesE4002005N
269441088442007-12-05 00:00:00IL100/30020001081.080468104MALEJDpriv-house-servreadingunmarried36400-287002015-02-14 00:00:00Single Vehicle CollisionFront CollisionMinor DamageOtherSCSpringfield1699 Oak Drive141YES02?56400112801128033840ToyotaHighlander2014N
265404300292006-08-21 00:00:00IL250/50010001454.430459407FEMALEMDprotective-servyachtinghusband002015-02-21 00:00:00Multi-vehicle CollisionRear CollisionTotal LossOtherNYArlington4234 Cherokee Lane173NO23?551206890048230AccuraMDX2002N
163335291121990-01-08 00:00:00IN100/3005001240.470472573FEMALEAssociateother-servicepolohusband3530002015-02-18 00:00:00Multi-vehicle CollisionRear CollisionTotal LossFireNCNorthbend7476 4th St113YES11?7711001402063090HondaCivic2014N
355479396311990-03-18 00:00:00OH500/100020001273.74000000433473MALECollegeother-servicekayakinghusband002015-01-10 00:00:00Multi-vehicle CollisionFront CollisionMajor DamageFireWVArlington8907 Tree Ave193NO21NO628006280628050240AudiA32003Y
175348669312008-01-07 00:00:00IN500/100010001123.878000000446326FEMALEPhDprotective-servdancingother-relative002015-02-26 00:00:00Vehicle Theft?Trivial DamagePoliceNYArlington6619 Flute Ave51?20YES72908108105670VolkswagenPassat1995N
192355820111997-03-10 00:00:00IL100/30010001245.890435481FEMALEMastersexec-managerialmovieswife0-403002015-01-01 00:00:00Single Vehicle CollisionRear CollisionTotal LossOtherWVSpringfield6011 Britain St191NO00?7660015320766053620MercedesC3002000N
430596911892004-01-10 00:00:00OH250/50020001326.627000000477310MALEMDother-servicebungie-jumpingown-child002015-01-03 00:00:00Multi-vehicle CollisionFront CollisionMinor DamageFireNYRiverwood5104 Francis Drive193?03YES8180016360818057260NissanPathfinder1998N
91275375461994-08-20 00:00:00IL100/30020001073.830609930FEMALEJDfarming-fishingpolohusband002015-01-17 00:00:00Vehicle Theft?Trivial DamageNoneNYArlington2280 4th Ave41?12?726013206605280BMWM52008N
217393949752002-06-02 00:00:00IN100/30010001530.520603993MALECollegearmed-forcesbasketballnot-in-family002015-02-22 00:00:00Vehicle Theft?Minor DamageNoneWVNorthbend2644 Elm Drive81?21YES43004304303440ToyotaCorolla2000N
223407296341994-04-28 00:00:00IN100/3005001201.410437818FEMALEJDpriv-house-servmovieshusband88400-465002015-01-27 00:00:00Multi-vehicle CollisionSide CollisionMajor DamagePoliceNCColumbus7466 MLK Ridge73YES10?70510128201282044870SuburuForrestor1999N
195392821952014-08-17 00:00:00OH250/50010001393.570478423MALEPhDmachine-op-inspctmoviesnot-in-family47600-396002015-02-27 00:00:00Parked Car?Minor DamagePoliceVANorthbend5821 2nd St51NO01YES26404804801680FordF1502009N
22264208102007-08-11 00:00:00OH100/30010001276.570467784MALEPhDcraft-repairskydivingnot-in-family7150002015-01-06 00:00:00Single Vehicle CollisionRear CollisionMinor DamageFireNYArlington6723 Best Drive31YES12NO7890015780789055230ChevroletSilverado1995N
439565248362008-11-20 00:00:00IN250/5005001082.490606714FEMALEPhDprof-specialtychessunmarried36100-550002015-02-28 00:00:00Multi-vehicle CollisionFront CollisionMajor DamageFireSCColumbus4866 4th Hwy123?23?564300627050160HondaCRV2014N
94323071951995-10-18 00:00:00IN500/100010001414.740464691FEMALEMastersadm-clericalhikingown-child002015-02-22 00:00:00Parked Car?Minor DamageNoneVARiverwood5418 Britain Ave191NO13NO24003003001800ChevroletSilverado2014N
11396236481993-05-19 00:00:00IL250/50020001470.060431683MALEPhDother-serviceyachtinghusband56600-458002015-01-07 00:00:00Single Vehicle CollisionFront CollisionTotal LossAmbulanceWVRiverwood4296 Pine Hwy221YES01NO657907310731051170Saab932007N
151364853722005-02-26 00:00:00OH250/5002000870.630431725FEMALEMDadm-clericalkayakingown-child94800-585002015-01-06 00:00:00Multi-vehicle CollisionSide CollisionMinor DamagePoliceVAHillsdale2299 1st St123NO11NO6292011440572045760FordEscape2000N
154345985541990-02-14 00:00:00IN100/300500795.230609216MALEPhDmachine-op-inspctbase-jumpingother-relative3690002015-01-10 00:00:00Multi-vehicle CollisionRear CollisionMajor DamagePoliceNYSpringfield6618 Cherokee Drive153YES21?6948015440054040NissanMaxima2014Y
245443039871993-09-30 00:00:00IL500/100010001168.20452787MALEJDhandlers-cleanersbasketballhusband6910002015-02-11 00:00:00Multi-vehicle CollisionSide CollisionTotal LossOtherOHSpringfield7459 Flute St233NO03NO442807380369033210HondaAccord1997N
119323431612014-06-10 00:00:00IL500/10001000993.510468767MALEHigh Schoolarmed-forcesbungie-jumpingunmarried0-495002015-01-12 00:00:00Single Vehicle CollisionSide CollisionMinor DamageOtherWVHillsdale3567 4th Drive121NO03YES5630056301126039410BMWM52011N
215425193122008-10-28 00:00:00OH500/10005001848.810435489MALEJDtransport-movingvideo-gamesown-child0-490002015-02-06 00:00:00Multi-vehicle CollisionFront CollisionMajor DamageFireWVNorthbend2457 Washington Ave203YES22YES6852011420571051390SuburuLegacy2003Y
295421329022007-04-24 00:00:00OH250/50020001641.735000000450149MALEPhDsaleschessnot-in-family6240002015-01-20 00:00:00Multi-vehicle CollisionRear CollisionTotal LossFireVARiverwood1269 Flute Drive163NO00NO591306570657045990FordEscape2006Y
254393328671993-12-13 00:00:00IN100/3005001362.870458364FEMALEMDexec-managerialchessother-relative3570002015-02-22 00:00:00Multi-vehicle CollisionFront CollisionMinor DamageAmbulanceNYArlington1218 Sky Hwy63YES22NO8232013720686061740DodgeNeon1995Y
107313565902011-08-17 00:00:00IN250/5005001239.227000000476458FEMALEHigh Schooltech-supportpaintballnot-in-family43400-912002015-01-30 00:00:00Single Vehicle CollisionSide CollisionMinor DamageFireSCSpringfield9169 Pine Ridge121YES01NO89700138001380062100AudiA52009Y
478643460021990-08-20 00:00:00OH250/500500835.020602433FEMALEAssociateadm-clericalreadingunmarried5960002015-02-02 00:00:00Multi-vehicle CollisionSide CollisionMinor DamageFireWVHillsdale8538 Texas Lane173NO11NO339300377030160BMWX61998N
128305005331994-02-11 00:00:00OH100/30010001061.330478575MALEMDmachine-op-inspctmoviesown-child43300-662002015-01-10 00:00:00Single Vehicle CollisionFront CollisionMajor DamageAmbulanceWVNorthbrook5783 Oak Ave81NO03NO6853012460623049840AudiA51997N
338493482091994-02-22 00:00:00IN500/100010001279.080449718MALEMDother-servicekayakingown-child0-515002015-02-27 00:00:00Parked Car?Minor DamageNoneNCRiverwood7721 Washington Ridge131NO01?43008608602580FordF1502004N
271424866762011-08-15 00:00:00OH100/3005001105.490463181FEMALEAssociateprof-specialtysleepingown-child56200-500002015-02-20 00:00:00Multi-vehicle CollisionSide CollisionMajor DamageOtherSCHillsdale8006 Maple Hwy122?23?6831012420621049680AudiA32003Y
222412608451998-11-11 00:00:00OH100/30020001055.530441992FEMALEMDarmed-forcescross-fitnot-in-family37800-503002015-02-08 00:00:00Single Vehicle CollisionFront CollisionTotal LossOtherWVNorthbrook6751 Pine Ridge71NO02NO612906810681047670HondaCivic1995Y
199416570451995-12-04 00:00:00OH250/5001000895.830452597FEMALEAssociatesalespaintballhusband002015-02-11 00:00:00Single Vehicle CollisionRear CollisionMinor DamageAmbulanceNCArlington2324 Texas Ridge101NO12NO301003010027090ChevroletMalibu1999N
215377611892002-12-28 00:00:00IN100/3005001632.930614417FEMALECollegetransport-movinggolfnot-in-family0-429002015-02-23 00:00:00Multi-vehicle CollisionRear CollisionMinor DamageFireSCRiverwood7923 Elm Ave73NO20YES571209520476042840MercedesC3002002N
192401751772004-04-15 00:00:00IL100/30010001405.990472895FEMALEAssociatesalesyachtingwife002015-03-01 00:00:00Multi-vehicle CollisionSide CollisionMinor DamageAmbulanceVASpringfield4755 Best Lane183YES10YES429309540477028620BMWX62005N
120351167002001-02-02 00:00:00OH100/30010001425.540475847FEMALEHigh Schooltransport-movingbungie-jumpingother-relative7830002015-01-15 00:00:00Multi-vehicle CollisionFront CollisionTotal LossAmbulanceSCRiverwood5053 Tree Drive223NO20NO5121011380569034140FordFusion2010N
270451662642010-01-12 00:00:00OH500/100010001038.090476978FEMALECollegehandlers-cleanersgolfhusband0-197002015-01-14 00:00:00Multi-vehicle CollisionFront CollisionMinor DamageFireNYSpringfield2078 3rd Ave183NO11YES8940014900745067050SuburuLegacy1998N
319475279451992-04-14 00:00:00IN250/5005001307.110600648MALECollegetransport-movingdancingnot-in-family002015-02-17 00:00:00Multi-vehicle CollisionFront CollisionTotal LossPoliceWVNorthbrook2804 Best St223NO02?59730108601086038010AudiA32005N
194396275402010-05-21 00:00:00OH500/100010001489.246000000608335FEMALEJDother-servicekayakingwife0-450002015-01-24 00:00:00Vehicle Theft?Minor DamageNoneSCSpringfield7877 Sky Lane151YES22YES8060124012405580Saab952004N
227382794222013-10-27 00:00:00OH500/1000500976.670471600FEMALEPhDhandlers-cleanerspolounmarried002015-01-21 00:00:00Single Vehicle CollisionRear CollisionMajor DamageFireSCNorthbrook6530 Weaver Ave161?12?7220014440722050540BMWM52013Y
137314842001994-10-12 00:00:00OH250/50020001340.430441175MALEHigh Schoolexec-managerialpaintballhusband52700-406002015-02-19 00:00:00Multi-vehicle CollisionSide CollisionMinor DamageAmbulanceNCArlington3087 Oak Hwy63NO12NO50800101601016030480AccuraMDX2005N
244406452581997-07-04 00:00:00OH500/100020001267.815000000603123FEMALEMastersexec-managerialvideo-gameswife002015-01-03 00:00:00Vehicle Theft?Trivial DamageNoneNCNorthbrook7098 Lincoln Hwy101?21?660066013204620AccuraTL2005N
78296946622011-02-15 00:00:00IL250/50010001234.26000000457767MALEMastersother-servicebungie-jumpingother-relative002015-01-29 00:00:00Vehicle Theft?Minor DamagePoliceNYNorthbrook5124 Maple St31YES22NO750075015005250NissanMaxima2002N
200359606801994-08-21 00:00:00IN250/50020001318.060618498MALEHigh Schoolexec-managerialvideo-gameswife57300-806002015-01-19 00:00:00Vehicle Theft?Trivial DamageNoneVAHillsdale2333 Maple Lane131NO03YES6490118011804130VolkswagenJetta2002N
284484981401997-05-15 00:00:00IN500/10002000769.950605486MALEMastersprof-specialtymoviesnot-in-family0-442002015-01-19 00:00:00Multi-vehicle CollisionSide CollisionMajor DamageAmbulanceNYHillsdale1012 5th Lane162?23NO6094055401108044320AudiA32013Y
275414988751996-10-26 00:00:00OH100/30020001514.720617970MALEHigh Schooltransport-movingboard-gamesown-child3570002015-02-02 00:00:00Multi-vehicle CollisionFront CollisionMajor DamageFireNYNorthbrook7477 MLK Drive133YES01?5830058301166040810SuburuLegacy2007N
153347981772006-03-04 00:00:00IL500/10001000873.644000000432934FEMALEAssociatepriv-house-servyachtinghusband80002015-01-30 00:00:00Multi-vehicle CollisionFront CollisionMinor DamageOtherSCColumbus9489 3rd St93NO21?68400114001140045600FordF1502007N
134326147631991-01-02 00:00:00IL500/10005001612.430456762FEMALEMDother-serviceyachtingown-child3640002015-01-08 00:00:00Single Vehicle CollisionSide CollisionTotal LossFireVASpringfield2087 Apache Ave21?21YES64240116801168040880BMW3 Series2015N
31366793701999-08-15 00:00:00IL500/100020001318.249000000601748FEMALECollegeprof-specialtykayakingnot-in-family0-786002015-01-30 00:00:00Parked Car?Trivial DamageNoneWVArlington5540 Sky St91NO01YES47009404703290DodgeNeon2002N
41259588571992-01-15 00:00:00IN100/30010001226.830607763FEMALECollegeexec-managerialexercisenot-in-family0-561002015-01-07 00:00:00Multi-vehicle CollisionSide CollisionMajor DamageOtherSCColumbus7238 2nd St123YES20?451200564039480AccuraMDX2011Y
127296868161999-12-07 00:00:00OH250/50020001326.445000000436973FEMALEHigh Schoolsalesboard-gamesown-child002015-02-24 00:00:00Multi-vehicle CollisionFront CollisionTotal LossFireSCArlington8442 Britain Hwy122YES11?66950103001030046350Saab931995N
61231277541993-06-06 00:00:00IL250/50020001136.834000000471300FEMALEAssociatetech-supportexerciseown-child0-624002015-02-02 00:00:00Single Vehicle CollisionSide CollisionMajor DamagePoliceNYColumbus1331 Britain Hwy141NO03?9834089401788071520HondaAccord2004Y
207429186292000-10-03 00:00:00IL250/50020001322.780453277MALEPhDfarming-fishingyachtingown-child5520002015-02-28 00:00:00Parked Car?Trivial DamageNoneWVSpringfield5260 Francis Drive91NO01NO59005905904720BMWX52007N
219437314502010-12-29 00:00:00IN100/30010001483.250465100FEMALEMDexec-managerialexercisenot-in-family90700-208002015-02-09 00:00:00Multi-vehicle CollisionFront CollisionMajor DamageAmbulanceNCRiverwood1135 Solo Lane33NO11?7068058901178053010FordFusion2009N
271423074471990-03-17 00:00:00IL100/3005001515.30603248FEMALEHigh Schoolmachine-op-inspcthikingnot-in-family002015-01-19 00:00:00Multi-vehicle CollisionRear CollisionTotal LossAmbulanceSCNorthbend9737 Solo Hwy213NO10NO9372017040852068160MercedesML3502005N
80259921452012-03-01 00:00:00IL100/30020001075.185000000601112FEMALEPhDarmed-forcesexercisehusband67700-584002015-02-21 00:00:00Vehicle Theft?Minor DamageNoneOHNorthbrook3289 Britain Drive51NO20YES693012606305040ToyotaHighlander2001N
325479006282006-02-05 00:00:00IN500/100010001690.270438830FEMALEAssociateprotective-servhikingnot-in-family6150002015-01-14 00:00:00Single Vehicle CollisionSide CollisionMajor DamageFireVASpringfield6550 Andromedia St111YES03NO729306630663059670DodgeRAM2006Y
29252352202014-11-01 00:00:00IL250/50020001352.830464959MALEMastersfarming-fishingskydivingown-child0-717002015-01-22 00:00:00Multi-vehicle CollisionRear CollisionMinor DamageOtherSCHillsdale1679 2nd Hwy44YES12YES648907210721050470NissanPathfinder2013Y
295487400192009-06-17 00:00:00OH250/50010001148.730439787FEMALECollegemachine-op-inspctkayakingwife002015-02-22 00:00:00Parked Car?Trivial DamageNoneWVColumbus3998 Flute St61?12YES54009009003600Saab951999N
239422468821999-09-20 00:00:00IL100/3001000969.50464839MALECollegeexec-managerialreadingnot-in-family002015-01-26 00:00:00Vehicle Theft?Trivial DamageNoneNCNorthbrook2430 MLK Ave101NO00?56007007004200AudiA32007N

Data Exploration

We have several string (categorical) columns in our dataset, along with some ints and doubles.

display(df.dtypes)
months_as_customerint
ageint
policy_numberint
policy_bind_datestring
policy_statestring
policy_cslstring
policy_deductableint
policy_annual_premiumdouble
umbrella_limitint
insured_zipint
insured_sexstring
insured_education_levelstring
insured_occupationstring
insured_hobbiesstring
insured_relationshipstring
capital-gainsint
capital-lossint
incident_datestring
incident_typestring
collision_typestring
incident_severitystring
authorities_contactedstring
incident_statestring
incident_citystring
incident_locationstring
incident_hour_of_the_dayint
number_of_vehicles_involvedint
property_damagestring
bodily_injuriesint
witnessesint
police_report_availablestring
total_claim_amountint
injury_claimint
property_claimint
vehicle_claimint
auto_makestring
auto_modelstring
auto_yearint
fraud_reportedstring

Count number of categories for every categorical column (Count Distinct).

# Create a List of Column Names with data type = string
stringColList = [i[0] for i in df.dtypes if i[1] == 'string']
print stringColList
['policy_bind_date', 'policy_state', 'policy_csl', 'insured_sex', 'insured_education_level', 'insured_occupation', 'insured_hobbies', 'insured_relationship', 'incident_date', 'incident_type', 'collision_type', 'incident_severity', 'authorities_contacted', 'incident_state', 'incident_city', 'incident_location', 'property_damage', 'police_report_available', 'auto_make', 'auto_model', 'fraud_reported']
from pyspark.sql.functions import *

# Create a function that performs a countDistinct(colName)
distinctList = []
def countDistinctCats(colName):
  count = df.agg(countDistinct(colName)).collect()
  distinctList.append(count)
# Apply function on every column in stringColList
map(countDistinctCats, stringColList)
print distinctList
[[Row(count(DISTINCT policy_bind_date)=951)], [Row(count(DISTINCT policy_state)=3)], [Row(count(DISTINCT policy_csl)=3)], [Row(count(DISTINCT insured_sex)=2)], [Row(count(DISTINCT insured_education_level)=7)], [Row(count(DISTINCT insured_occupation)=14)], [Row(count(DISTINCT insured_hobbies)=20)], [Row(count(DISTINCT insured_relationship)=6)], [Row(count(DISTINCT incident_date)=60)], [Row(count(DISTINCT incident_type)=4)], [Row(count(DISTINCT collision_type)=4)], [Row(count(DISTINCT incident_severity)=4)], [Row(count(DISTINCT authorities_contacted)=5)], [Row(count(DISTINCT incident_state)=7)], [Row(count(DISTINCT incident_city)=7)], [Row(count(DISTINCT incident_location)=1000)], [Row(count(DISTINCT property_damage)=3)], [Row(count(DISTINCT police_report_available)=3)], [Row(count(DISTINCT auto_make)=14)], [Row(count(DISTINCT auto_model)=39)], [Row(count(DISTINCT fraud_reported)=2)]]

We have identified that some string columns have many distinct values (900+). We will remove these columns from our dataset in the Data Processing step to improve model accuracy.

  • policy number (1000 distinct)
  • policy bind date (951 distinct. Possible to narrow down to year/month to test model accuracy)
  • insured zip (995 distinct)
  • insured location (1000 distinct)
  • incident date (60 distinct. Excluding, but possible to narrow down to year/month to test model accuracy)

Like most fraud datasets, our label distribution is skewed.

display(df)
0200400600800YNTOOLTIPfraud_reportedfraud_reported
# Count number of frauds vs non-frauds
display(df.groupBy("fraud_reported").count())
Y247
N753

We can quickly create one-click plots using Databricks built-in visualizations to understand our data better.

Click 'Plot Options' to try out different chart types.

# Fraud Count by Incident State
display(df)
+N/A0-5050-100100-150150-200200-250250-300
# Breakdown of Average Vehicle claim by insured's education level, grouped by fraud reported
display(df)
02004006008001,0001,2001,400YN02004006008001,0001,2001,400YN02004006008001,0001,2001,400YN02004006008001,0001,2001,400YN02004006008001,0001,2001,400YN02004006008001,0001,2001,400YN02004006008001,0001,2001,400YNTOOLTIPMDPhDAssociateMastersHigh SchoolCollegeJDfraud_reportedpolicy_annual_premiuminsured_education_levelinsured_education_levelMDPhDAssociateMastersHigh SchoolCollegeJD

Data Processing

Next, we will clean up the data a little and prepare it for our machine learning model.

We will first remove the columns that we have identified earlier that have too many distinct categories and cannot be converted to numeric.