2 SQL DataFrames - Python

ACCESSY_KEY_ID = "AKIAJBRYNXGHORDHZB4A"
SECERET_ACCESS_KEY = "a0BzE1bSegfydr3%2FGE3LSPM6uIV5A4hOUfpH8aFF" 

mounts_list = [
{'bucket':'databricks-corp-training/sf_open_data/', 'mount_folder':'/mnt/sf_open_data'}
]

for mount_point in mounts_list:
  bucket = mount_point['bucket']
  mount_folder = mount_point['mount_folder']
  try:
    dbutils.fs.ls(mount_folder)
    dbutils.fs.unmount(mount_folder)
  except:
    pass
  finally: #If MOUNT_FOLDER does not exist
    dbutils.fs.mount("s3a://"+ ACCESSY_KEY_ID + ":" + SECERET_ACCESS_KEY + "@" + bucket,mount_folder)

The SF OpenData project was launched in 2009 and contains hundreds of datasets from the city and county of San Francisco. Open government data has the potential to increase the quality of life for residents, create more efficient government services, better public decisions, and even new local businesses and services.

In our analysis of SF Fire Department calls, we will be seeking answers the following questions:

How many different types of calls were made to the Fire Department?
How many incidents of each call type were there?
How many incidents of each call type were there?
How many service calls were logged in the past 7 days?
Which neighborhood in SF generated the most calls last year?
What was the primary non-medical reason most people called the fire department from the Tenderloin last year?

%fs ls /mnt/sf_open_data/fire_dept_calls_for_service/


dbfs:/mnt/sf_open_data/fire_dept_calls_for_service/Fire_Department_Calls_for_Service.csv	Fire_Department_Calls_for_Service.csv	1634673683

spark

Out[3]: <pyspark.sql.session.SparkSession at 0x7fb731df43d0>

fireServiceCallsDF = spark.read.csv('/mnt/sf_open_data/fire_dept_calls_for_service/Fire_Department_Calls_for_Service.csv', header=True, inferSchema=True)

from pyspark.sql.types import StructType, StructField, IntegerType, StringType, BooleanType

# Note that we are removing all space characters from the col names to prevent errors when writing to Parquet later

fireSchema = StructType([StructField('CallNumber', IntegerType(), True),
                     StructField('UnitID', StringType(), True),
                     StructField('IncidentNumber', IntegerType(), True),
                     StructField('CallType', StringType(), True),                  
                     StructField('CallDate', StringType(), True),       
                     StructField('WatchDate', StringType(), True),       
                     StructField('ReceivedDtTm', StringType(), True),       
                     StructField('EntryDtTm', StringType(), True),       
                     StructField('DispatchDtTm', StringType(), True),       
                     StructField('ResponseDtTm', StringType(), True),       
                     StructField('OnSceneDtTm', StringType(), True),       
                     StructField('TransportDtTm', StringType(), True),                  
                     StructField('HospitalDtTm', StringType(), True),       
                     StructField('CallFinalDisposition', StringType(), True),       
                     StructField('AvailableDtTm', StringType(), True),       
                     StructField('Address', StringType(), True),       
                     StructField('City', StringType(), True),       
                     StructField('ZipcodeofIncident', IntegerType(), True),       
                     StructField('Battalion', StringType(), True),                 
                     StructField('StationArea', StringType(), True),       
                     StructField('Box', StringType(), True),       
                     StructField('OriginalPriority', StringType(), True),       
                     StructField('Priority', StringType(), True),       
                     StructField('FinalPriority', IntegerType(), True),       
                     StructField('ALSUnit', BooleanType(), True),       
                     StructField('CallTypeGroup', StringType(), True),
                     StructField('NumberofAlarms', IntegerType(), True),
                     StructField('UnitType', StringType(), True),
                     StructField('Unitsequenceincalldispatch', IntegerType(), True),
                     StructField('FirePreventionDistrict', StringType(), True),
                     StructField('SupervisorDistrict', StringType(), True),
                     StructField('NeighborhoodDistrict', StringType(), True),
                     StructField('Location', StringType(), True),
                     StructField('RowID', StringType(), True)])

#Notice that no job is run this time
# Python is not typed language so Datasets don't exists. Instead, we always get back a DataFrame
fireServiceCallsDF = spark.read.csv('/mnt/sf_open_data/fire_dept_calls_for_service/Fire_Department_Calls_for_Service.csv', header=True, schema=fireSchema)

display(fireServiceCallsDF.limit(5))

142480332

B02

14086309

Alarms

09/05/2014

09/04/2014

09/05/2014 03:15:13 AM

09/05/2014 03:17:26 AM

09/05/2014 03:18:18 AM

09/05/2014 03:20:30 AM

09/05/2014 03:24:11 AM

04/25/2016 01:15:16 PM

Fire

09/05/2014 03:33:20 AM

1600 Block of HAIGHT ST

San Francisco

94117

B05

12

4525

3

true

Alarm

1

CHIEF

3

5

Haight Ashbury

(37.7695711762103, -122.449920089485)

142480332-B02

153022542

T02

15115908

Structure Fire

10/29/2015

10/29/2015 03:39:06 PM

10/29/2015 03:39:25 PM

10/29/2015 03:39:49 PM

10/29/2015 03:40:55 PM

10/29/2015 03:43:21 PM

04/25/2016 01:07:30 PM

Fire

10/29/2015 03:51:21 PM

BATTERY ST/VALLEJO ST

San Francisco

94111

B01

13

1155

3

false

Alarm

1

TRUCK

4

1

3

Financial District/South Beach

(37.7995314468258, -122.401240243673)

153022542-T02

143451112

AM04

14122741

Medical Incident

12/11/2014

12/11/2014 09:02:07 AM

12/11/2014 09:03:01 AM

12/11/2014 09:03:11 AM

12/11/2014 09:06:19 AM

12/11/2014 09:20:16 AM

12/11/2014 09:20:26 AM

12/11/2014 09:43:41 AM

Code 2 Transport

12/11/2014 10:06:26 AM

300 Block of BUENA VISTA AVE

San Francisco

94117

B05

21

5136

3

false

Potentially Life-Threatening

1

PRIVATE

1

5

8

Castro/Upper Market

(37.7668035178194, -122.440704687809)

143451112-AM04

141660300

E01

14057129

Medical Incident

06/15/2014

06/14/2014

06/15/2014 02:04:57 AM

06/15/2014 02:06:42 AM

06/15/2014 02:10:01 AM

06/15/2014 02:12:55 AM

06/15/2014 02:24:55 AM

04/25/2016 01:16:45 PM

Code 2 Transport

06/15/2014 02:51:39 AM

0 Block of HALLAM ST

San Francisco

94103

B03

1

2313

2

true

Non Life-threatening

1

ENGINE

2

6

South of Market

(37.7756902570435, -122.408609057895)

141660300-E01

152633454

E36

15100829

Outside Fire

09/20/2015

09/20/2015 08:15:00 PM

09/20/2015 08:15:53 PM

09/20/2015 08:16:17 PM

09/20/2015 08:18:07 PM

04/25/2016 01:08:14 PM

Fire

09/20/2015 08:22:11 PM

MARKET ST/VAN NESS AV

San Francisco

94103

B02

36

3211

3

true

Fire

1

ENGINE

1

2

6

Mission

(37.7751470741622, -122.419255607214)

152633454-E36

CallNumber	UnitID	IncidentNumber	CallType	CallDate	WatchDate	ReceivedDtTm	EntryDtTm	DispatchDtTm	ResponseDtTm	OnSceneDtTm	TransportDtTm	HospitalDtTm	CallFinalDisposition	AvailableDtTm	Address	City	ZipcodeofIncident	Battalion	StationArea	Box	OriginalPriority	Priority	FinalPriority	ALSUnit	CallTypeGroup	NumberofAlarms	UnitType	Unitsequenceincalldispatch	FirePreventionDistrict	SupervisorDistrict	NeighborhoodDistrict	Location	RowID

+

Exploring the City of San Francisco public data with Apache Spark 2.0

Introduction to Spark

Introduction to Fire Department Calls for Service